36
ICLR2016 VAEまとめ 鈴⽊雅⼤

Iclr2016 vaeまとめ

Embed Size (px)

Citation preview

Page 1: Iclr2016 vaeまとめ

ICLR2016 VAEまとめ鈴⽊雅⼤

Page 2: Iclr2016 vaeまとめ

今回の発表について¤ 今⽇の内容

¤ ICLRで発表されたVAE関連を中⼼に発表します.

¤ ICLR 2016¤ 2016年5⽉2⽇~4⽇¤ プエルトリコ,サンフアン¤ 発表数:

¤ 会議トラック:80¤ ワークショップ:55

Page 3: Iclr2016 vaeまとめ

ICLR2016のトレンド

Reinforcement Learning

Unsupervised Learning

Incorporating Structure

Compressing Networks

Incorporating Structure

Initializing Networks

Backprop Tricks

Attention

Deep Metric Learning

Computer Vision Applications

Visualizing Networks

Do Deep Convolutional Nets Really Need to be Deep?

Training-Free Methods

Geometric Methods Gaussian Processes and Auto Encoders

ResNet

http://www.computervisionblog.com

Page 4: Iclr2016 vaeまとめ

ICLRにおけるVAE論⽂¤ ICLRに採録されたVAE(もしくはVAEに関連する)論⽂は5本

¤ Importance Weighted Autoencoders¤ The Variational Fair Autoencoder¤ Generating Images from Captions with Attention¤ Variational Gaussian Process¤ Variationally Auto-Encoded Deep Gaussian Processes

¤ VAEを基礎から説明しつつ,これらの論⽂の説明をします.

Page 5: Iclr2016 vaeまとめ

識別モデルと⽣成モデル

¤ データを分けることのみに興味があるのが識別モデル¤ 深層NNやSVMなどは識別モデル(正確には識別関数)

¤ データの⽣成源も考えるのが⽣成モデル

識別モデル(識別関数) ⽣成モデル

𝑝 𝐶# 𝑥 =𝑝(𝑥, 𝐶#) 𝑝(𝑥)

𝑝 𝐶# 𝑥

分類確率(事後確率)をモデル化 データの分布(同時分布)をモデル化

Page 6: Iclr2016 vaeまとめ

変分推論とVAE

Page 7: Iclr2016 vaeまとめ

背景知識:変分推論¤ ⽣成モデルの学習=データから分布𝑝(𝑥)をモデル化したい

➡尤度𝑝(𝑥)を最⼤化することで求められる.

¤ 𝑝 𝑥 = ∫𝑝 𝑥, 𝑧 𝑑𝑧�� のように潜在変数𝑧もモデル化した場合・・・

¤ そのまま最⼤化できない.¤ よって,代わりに対数尤度を常に下から抑える下界を最⼤化する.

¤ 𝑝 𝑧|𝑥 を近似する分布𝑞(𝑧|𝑥)を考える.¤ このとき,対数尤度は次のように分解できる.

log𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))下界 真の分布と近似分布の差

必ず0以上になる

452 9. MIXTURE MODELS AND EM

Figure 9.12 Illustration of the E step ofthe EM algorithm. The qdistribution is set equal tothe posterior distribution forthe current parameter val-ues θold, causing the lowerbound to move up to thesame value as the log like-lihood function, with the KLdivergence vanishing. ln p(X|θold)L(q,θold)

KL(q||p) = 0

shown in Figure 9.13. If we substitute q(Z) = p(Z|X, θold) into (9.71), we see that,after the E step, the lower bound takes the form

L(q, θ) =!

Z

p(Z|X, θold) ln p(X,Z|θ) −!

Z

p(Z|X, θold) ln p(Z|X, θold)

= Q(θ, θold) + const (9.74)

where the constant is simply the negative entropy of the q distribution and is there-fore independent of θ. Thus in the M step, the quantity that is being maximized is theexpectation of the complete-data log likelihood, as we saw earlier in the case of mix-tures of Gaussians. Note that the variable θ over which we are optimizing appearsonly inside the logarithm. If the joint distribution p(Z,X|θ) comprises a member ofthe exponential family, or a product of such members, then we see that the logarithmwill cancel the exponential and lead to an M step that will be typically much simplerthan the maximization of the corresponding incomplete-data log likelihood functionp(X|θ).

The operation of the EM algorithm can also be viewed in the space of parame-ters, as illustrated schematically in Figure 9.14. Here the red curve depicts the (in-

Figure 9.13 Illustration of the M step of the EMalgorithm. The distribution q(Z)is held fixed and the lower boundL(q, θ) is maximized with respectto the parameter vector θ to givea revised value θnew. Because theKL divergence is nonnegative, thiscauses the log likelihood ln p(X|θ)to increase by at least as much asthe lower bound does.

ln p(X|θnew)L(q,θnew)

KL(q||p)

Page 8: Iclr2016 vaeまとめ

Variational Autioencoder¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14]

¤ 確率分布を多層ニューラルネットワークで表現した⽣成モデル¤ 単純のため,潜在変数は𝑧のみとする

¤ 最⼤化する下界は

𝐿 𝐱 =𝐸89 𝐳 𝐱 log𝑝; 𝐱, 𝐳𝑞< 𝐳 𝐱 =

1𝑇?log

@

ABC

𝑝; 𝐱, 𝐳(𝒕)

𝑞< 𝐳(𝒕) 𝐱

𝑥

𝑧

𝑞(𝑧|𝑥)近似分布

=エンコーダーと考える

ただし,𝐳(𝒕) = 𝝁 + diag 𝝈 ⨀𝜺, 𝜺~𝑵(𝟎, I)

reparameterizationtrick

𝑥~𝑝(𝑥|𝑧)

𝑧~𝑝(𝑧)

デコーダーと考える

Page 9: Iclr2016 vaeまとめ

VAEのモデル化¤ ニューラルネットワークによってモデル化する

・・・

・・・

・・・ ・・・

・・・

・・・

・・・

推論モデル

sampling

Under review as a conference paper at ICLR 2017

!"

# $ #% $&

' () !"#$!"%"&#$!"%

Figure 2: The network architecture of MVAE. This represents the same model as Fig.1.

Hense,

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+Eqφ(z|x,w)[log pθx(x|z)] + Eqφ(z|x,w)[log pθw(w|z)] (7)

By SGBM algorithm, the estimator of the lower bound is as follows:

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+1

L

L!l=1

log pθx(x|z(l)) + log pθw(w|z(l)), (8)

where z(l) = µ+ σ ⊙ ϵ(l), ϵ(l) ∼ N (0, I).

The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there aretwo negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Sameas VAE, we call qφ(z|x,w) as encoder and both pθx(x|z

(l)) and pθw(w|z(l)) as decoder.

We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws themodel which is same as Fig. 1 but represented by deep neural networks.

Considering the encoder qφ(z|x,w) as a Gaussian distribution, we can estimate mean and varianceof the distribution by neural networks as follows:

y(x) = MLPφx(x)

y(w) = MLPφw(w)

µφ = Linear(y(x),y(w))

logσ2φ = Tanh(Linear(y(x),y(w)), (9)

where MLPφxand MLPφw

mean deep neural networks corresponding each modality. Moreover,Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network hasmultiple input layers, which are corresponding to a and b.

Because each modality has different feature representation, we should make different networks foreach decoder, pθx(x|z) and pθw(w|z). The type of distribution and the network architecture dependon the representation of each modality, e.g., Gaussian distribution when the representation of modal-ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw(w|z) is Bernoullidistribution B(w|µθw

), the parameter of Bernoulli distribution µθwcan estimate as follows:

y(z) = MLPθw(z)

µθ = Linear(y(z)) (10)

In case that the decoder is Gaussian distribution, you can estimate the parameter of this distributionin the same way as Eq. (9), except that the input of the Linear network is single.

The main advantage of this model is following:

• MVAE can be extract joint feature representation, therefore, it is expected that it can extractbetter representation than single feature representation, which means original VAE.

4

Under review as a conference paper at ICLR 2017

!"

# $ #% $&

' () !"#$!"%"&#$!"%

Figure 2: The network architecture of MVAE. This represents the same model as Fig.1.

Hense,

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+Eqφ(z|x,w)[log pθx(x|z)] + Eqφ(z|x,w)[log pθw(w|z)] (7)

By SGBM algorithm, the estimator of the lower bound is as follows:

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+1

L

L!l=1

log pθx(x|z(l)) + log pθw(w|z(l)), (8)

where z(l) = µ+ σ ⊙ ϵ(l), ϵ(l) ∼ N (0, I).

The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there aretwo negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Sameas VAE, we call qφ(z|x,w) as encoder and both pθx(x|z

(l)) and pθw(w|z(l)) as decoder.

We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws themodel which is same as Fig. 1 but represented by deep neural networks.

Considering the encoder qφ(z|x,w) as a Gaussian distribution, we can estimate mean and varianceof the distribution by neural networks as follows:

y(x) = MLPφx(x)

y(w) = MLPφw(w)

µφ = Linear(y(x),y(w))

logσ2φ = Tanh(Linear(y(x),y(w)), (9)

where MLPφxand MLPφw

mean deep neural networks corresponding each modality. Moreover,Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network hasmultiple input layers, which are corresponding to a and b.

Because each modality has different feature representation, we should make different networks foreach decoder, pθx(x|z) and pθw(w|z). The type of distribution and the network architecture dependon the representation of each modality, e.g., Gaussian distribution when the representation of modal-ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw(w|z) is Bernoullidistribution B(w|µθw

), the parameter of Bernoulli distribution µθwcan estimate as follows:

y(z) = MLPθw(z)

µθ = Linear(y(z)) (10)

In case that the decoder is Gaussian distribution, you can estimate the parameter of this distributionin the same way as Eq. (9), except that the input of the Linear network is single.

The main advantage of this model is following:

• MVAE can be extract joint feature representation, therefore, it is expected that it can extractbetter representation than single feature representation, which means original VAE.

4

Under review as a conference paper at ICLR 2017

!"

# $ #% $&

' () !"#$!"%"&#$!"%

Figure 2: The network architecture of MVAE. This represents the same model as Fig.1.

Hense,

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+Eqφ(z|x,w)[log pθx(x|z)] + Eqφ(z|x,w)[log pθw(w|z)] (7)

By SGBM algorithm, the estimator of the lower bound is as follows:

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+1

L

L!l=1

log pθx(x|z(l)) + log pθw(w|z(l)), (8)

where z(l) = µ+ σ ⊙ ϵ(l), ϵ(l) ∼ N (0, I).

The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there aretwo negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Sameas VAE, we call qφ(z|x,w) as encoder and both pθx(x|z

(l)) and pθw(w|z(l)) as decoder.

We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws themodel which is same as Fig. 1 but represented by deep neural networks.

Considering the encoder qφ(z|x,w) as a Gaussian distribution, we can estimate mean and varianceof the distribution by neural networks as follows:

y(x) = MLPφx(x)

y(w) = MLPφw(w)

µφ = Linear(y(x),y(w))

logσ2φ = Tanh(Linear(y(x),y(w)), (9)

where MLPφxand MLPφw

mean deep neural networks corresponding each modality. Moreover,Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network hasmultiple input layers, which are corresponding to a and b.

Because each modality has different feature representation, we should make different networks foreach decoder, pθx(x|z) and pθw(w|z). The type of distribution and the network architecture dependon the representation of each modality, e.g., Gaussian distribution when the representation of modal-ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw(w|z) is Bernoullidistribution B(w|µθw

), the parameter of Bernoulli distribution µθwcan estimate as follows:

y(z) = MLPθw(z)

µθ = Linear(y(z)) (10)

In case that the decoder is Gaussian distribution, you can estimate the parameter of this distributionin the same way as Eq. (9), except that the input of the Linear network is single.

The main advantage of this model is following:

• MVAE can be extract joint feature representation, therefore, it is expected that it can extractbetter representation than single feature representation, which means original VAE.

4

Under review as a conference paper at ICLR 2017

!"

# $ #% $&

' () !"#$!"%"&#$!"%

Figure 2: The network architecture of MVAE. This represents the same model as Fig.1.

Hense,

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+Eqφ(z|x,w)[log pθx(x|z)] + Eqφ(z|x,w)[log pθw(w|z)] (7)

By SGBM algorithm, the estimator of the lower bound is as follows:

L(x,w) = −DKL(qφ(z|x,w)||p(z))

+1

L

L!l=1

log pθx(x|z(l)) + log pθw(w|z(l)), (8)

where z(l) = µ+ σ ⊙ ϵ(l), ϵ(l) ∼ N (0, I).

The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there aretwo negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Sameas VAE, we call qφ(z|x,w) as encoder and both pθx(x|z

(l)) and pθw(w|z(l)) as decoder.

We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws themodel which is same as Fig. 1 but represented by deep neural networks.

Considering the encoder qφ(z|x,w) as a Gaussian distribution, we can estimate mean and varianceof the distribution by neural networks as follows:

y(x) = MLPφx(x)

y(w) = MLPφw(w)

µφ = Linear(y(x),y(w))

logσ2φ = Tanh(Linear(y(x),y(w)), (9)

where MLPφxand MLPφw

mean deep neural networks corresponding each modality. Moreover,Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network hasmultiple input layers, which are corresponding to a and b.

Because each modality has different feature representation, we should make different networks foreach decoder, pθx(x|z) and pθw(w|z). The type of distribution and the network architecture dependon the representation of each modality, e.g., Gaussian distribution when the representation of modal-ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw(w|z) is Bernoullidistribution B(w|µθw

), the parameter of Bernoulli distribution µθwcan estimate as follows:

y(z) = MLPθw(z)

µθ = Linear(y(z)) (10)

In case that the decoder is Gaussian distribution, you can estimate the parameter of this distributionin the same way as Eq. (9), except that the input of the Linear network is single.

The main advantage of this model is following:

• MVAE can be extract joint feature representation, therefore, it is expected that it can extractbetter representation than single feature representation, which means original VAE.

4

・・・ ・・・

Under review as a conference paper at ICLR 2017

in both directions, e.g., both from images to texts and from texts to images. As far as we know, thisis the first study that extends VAEs to the multimodal learning framework.

We evaluate our model in two datasets: MNIST and CelebA dataset. First, we show that MVAEcan acquire more appropriate feature representations than the original VAE model’s ones. Next, weevaluate the reconstruction of datasets in both quality and quantity, and we show that the MVAEcan reconstruct datasets as much or more greater than the VAE. Finally, we show that MVAE cangenerate a modality representation from the other one.

2 RELATED WORK

In this section, we describe previous works of multimodal learning by deep architectures and varia-tional autoencoders, which is the original deep generative model of ours.

2.1 MULTIMODAL LEARNING BY DEEP ARCHITECTURES

Ngiam et al. (2011) proposed deep multimodal learning which extracts high-level feature represen-tations from speech and video features by using deep autoencoders (AE). They evaluated in varioussettings and found that their proposed model can extract better representations than single modalitysettings.

Srivastava & Salakhutdinov (2012) used deep restricted Boltzmann machines (RBM), which is theearliest deep generative model, to multimodal learning settings. The same as one by using deep AE,they jointed latent variables in multiple networks and tried to extract high-level features from mul-timodal features: images and texts. In their experiment, they showed that their model outperformedNgiam et al. (2011). It suggests that deep generative models may extract better representations thandiscriminative ones.

2.2 VARIATIONAL AUTOENCODERS

Variational autoencoders (VAE) (Welling, 2014; Rezende et al., 2014) are recent proposed deep gen-erative models.

Given observation variables x and corresponding latent variables z, we consider their generatingprocesses as follow:

z ∼ p(z); x ∼ pθ(x|z), xw (1)

where θ means the model parameter of p.

In varaitional inference, we consider qφ(z|x), where φ is the model parameter of q, in order to ap-proximate the posterior distribution pθ(z|x). The goal of this problem is that maximize the marginaldistribution p(x) =

!pθ(x|z)p(z), however this distribution is intractable. Therefore, we attempt

to maximize the lower bound of the marginal distribution. The lower bound L(x) can estimate asfollows:

log p(x) = log

"pθ(x, z)dz

= log

"qφ(z|x)

pθ(x, z)

qφ(z|x)dz

"qφ(z|x) log

pθ(x, z)

qφ(z|x)dz

= −DKL(qφ(z|x)||p(z)) + Eqφ(z|x)[log pθ(x|z)]

= L(x). (2)

Following Welling (2014), we call qφ(z|x) as encoder, and pθ(x|z) as decoder.

When we optimize the lower bound L(x) with respect to parameters θ,φ, VAE can be used stochas-tic gradient variational BayesSGVBalgorithm, which is also called stochastic back-propagation. If

2

𝑞(𝑧|𝑥)

𝑥~𝑝(𝑥|𝑧)

𝑧~𝑝(𝑧)

⽣成モデル

Page 10: Iclr2016 vaeまとめ

良いモデルを学習するには?¤ 最⼤化しているのは下界だが,本当は(対数)尤度を最⼤化したい

¤ 下界が対数尤度をうまく近似できればよい¤ これは近似分布がどれだけ近似できるか次第

𝐿 𝑥 = log𝑝 𝑥 − 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))

¤ しかし実際には,近似分布はVAEの下界によって制約を受けてしまう.¤ 事後サンプルが少しでも𝑥を説明できないと,⼤きな制約となる.

¤ 解決策:¤ より対数尤度を近似するような新たな下界を考える

近似分布が真の事後分布を近似できればKL距離は0になる=下界と対数尤度が等しくなる

Page 11: Iclr2016 vaeまとめ

Importance Weighted AE¤ Importance Weighted Autoencoders [Bruda+ 15; ICLR 2016]

¤ 次のような新たな下界を提案¤ サンプル数kによる重要度重み推定量

𝐿# 𝑥 = 𝐸RS,…,RU~89 𝑧 𝑥 log1𝑘?

𝑝; 𝐱, 𝐳(𝐤)

𝑞< 𝐳(𝐤) 𝐱

X

#BC

¤ この下界は,次の関係が証明されている

log𝑝 𝑥 ≥ 𝐿#ZC 𝑥 ≥ 𝐿# 𝑥 ≥ 𝐿C 𝑥 = 𝐿 𝑥

¤ サンプル数を増やすだけで,制約が緩和され,真の下界に近づく.

Page 12: Iclr2016 vaeまとめ

IWAE : 実験結果¤ テスト尤度が向上していることが確認できる

Under review as a conference paper at ICLR 2016

MNIST OMNIGLOTVAE IWAE VAE IWAE

# stoch.layers k NLL

activeunits NLL

activeunits NLL

activeunits NLL

activeunits

1 1 86.76 19 86.76 19 108.11 28 108.11 285 86.47 20 85.54 22 107.62 28 106.12 3450 86.35 20 84.78 25 107.80 28 104.67 41

2 1 85.33 16+5 85.33 16+5 107.58 28+4 107.56 30+55 85.01 17+5 83.89 21+5 106.31 30+5 104.79 38+650 84.78 17+5 82.90 26+7 106.30 30+5 103.38 44+7

Table 1: Results on density estimation and the number of active latent dimensions. For models with two latentlayers, “k1+k2” denotes k1 active units in the first layer and k2 in the second layer. The generative performanceof IWAEs improved with increasing k, while that of VAEs benefitted only slightly. Two-layer models achievedbetter generative performance than one-layer models.

The log-likelihood results are reported in Table 1. Our VAE results are comparable to those previ-ously reported in the literature. We observe that training a VAE with k > 1 helped only slightly. Bycontrast, using multiple samples improved the IWAE results considerably on both datasets. Note thatthe two algorithms are identical for k = 1, so the results ought to match up to random variability.

On MNIST, IWAE with two stochastic layers and k = 50 achieves a log-likelihood of -82.90 onthe permutation-invariant model on this dataset. By comparison, deep belief networks achieved log-likelihood of approximately -84.55 nats (Murray & Salakhutdinov, 2009), and deep autoregressivenetworks achieved log-likelihood of -84.13 nats (Gregor et al., 2014). Gregor et al. (2015), whoexploited spatial structure, achieved a log-likelihood of -80.97. We did not find overfitting to be aserious issue for either the VAE or the IWAE: in both cases, the training log-likelihood was 0.62 to0.79 nats higher than the test log-likelihood. We present samples from our models in Appendix E.

For the OMNIGLOT dataset, the best performing IWAE has log-likelihood of -103.38 nats, which isslightly worse than the log-likelihood of -100.46 nats achieved by a Restricted Boltzmann Machinewith 500 hidden units trained with persistent contrastive divergence (Burda et al., 2015). RBMstrained with centering or FANG methods achieve a similar performance of around -100 nats (Grosse& Salakhudinov, 2015). The training log-likelihood for the models we trained was 2.39 to 2.65 natshigher than the test log-likelihood.

5.2 LATENT SPACE REPRESENTATION

We have observed that both VAEs and IWAEs tend to learn latent representations with effectivedimensions far below their capacity. Our next set of experiments aimed to quantify this effect anddetermine whether the IWAE objective ameliorates this effect.

If a latent dimension encodes useful information about the data, we would expect its distributionto change depending on the observations. Based on this intuition, we measured activity of a latentdimension u using the statistic Au = Cov

x

�Eu⇠q(u|x)[u]

�. We defined the dimension u to be active

if Au > 10

�2. We have observed two pieces of evidence that this criterion is both well-defined andmeaningful:

1. The distribution of Au for a trained model consisted of two widely separated modes, asshown in Appendix C.

2. To confirm that the inactive dimensions were indeed insignificant to the predictions, weevaluated all models with the inactive dimensions removed. In all cases, this changed thetest log-likelihood by less than 0.06 nats.

In Table 1, we report the numbers of active units for all conditions. In all conditions, the number ofactive dimensions was far less than the total number of dimensions. Adding more latent dimensionsdid not increase the number of active dimensions. Interestingly, in the two-layer models, the secondlayer used very little of its modeling capacity: the number of active dimensions was always lessthan 10. In all cases with k > 1, the IWAE learned more latent dimensions than the VAE. Since this

7

Page 13: Iclr2016 vaeまとめ

条件付きVAEと半教師あり学習

Page 14: Iclr2016 vaeまとめ

VAEによる半教師あり学習¤ Semi-Supervised Learning with Deep Generative Models

[Kingma+ 2014 ; NIPS 2014]¤ 条件付きVAE(CVAE)による半教師あり学習

¤ 条件付きVAEの下界は 𝐿 𝐱|𝒚 =𝐸89 𝐳 𝐱, 𝐲 log ]^ 𝐱,𝐳|𝐲89 𝐳 𝐱, 𝐲

¤ よって,下界は

𝐿 𝐱 + 𝐿 𝐱|𝐲 + 𝛼𝔼[−log𝑞< 𝐲 𝐱 ]

¤ 最後の項はラベル予測するモデル

𝑧

𝑥

𝑦ラベル

Page 15: Iclr2016 vaeまとめ

CVAE半教師あり学習の問題点¤ ラベルと潜在変数はモデル上独⽴になっている

¤ しかし,近似分布が𝑞< 𝐳 𝐱, 𝐲 となっているため,𝐲と𝐳に依存関係が⽣じてしまう

¤ データの情報を保持しながら,ラベルとは独⽴な潜在変数を獲得したい.¤ 𝐲をドメインと考えれば,ドメインを除去した表現が獲得できるはず.

𝑧

𝑥

𝑦

Page 16: Iclr2016 vaeまとめ

Variational Fair Autoencoder¤ The Variational Fair Autoencoder [Louizos+ 15; ICLR 2016]

¤ 𝑥と𝑠(sensitive変数.前ページでいうラベル)を独⽴にするために,次のmaximum mean discrepancy(MMD)を⼩さくするようにする.

¤ s=0とs=1のときの潜在変数の差がなくなるようにする.¤ これをVAEの下界に追加する.

¤ MMDは通常カーネルの計算に持っていく.¤ しかし,SGDで⾼次元のグラム⾏列を計算するのは⼤変なので,写像を次の形で求める

Under review as a conference paper at ICLR 2016

2.3 FURTHER INVARIANCE VIA MAXIMUM MEAN DISCREPANCY

Despite the fact that we have a model that encourages statistical independence between s and z1

a-priori we might still have some dependence in the (approximate) marginal posterior q�(z1|s). Inparticular, this can happen if the label y is correlated with the sensitive variable s, which can allowinformation about s to “leak” into the posterior. Thus instead we could maximize a “penalized”lower bound where we impose some sort of regularization on the marginal q�(z1|s). In the followingwe will describe one way to achieve this regularization through the Maximum Mean Discrepancy(MMD) (Gretton et al., 2006) measure.

2.3.1 MAXIMUM MEAN DISCREPANCY

Consider the problem of determining whether two datasets {X} ⇠ P0 and {X0} ⇠ P1 are drawnfrom the same distribution, i.e., P0 = P1. A simple test is to consider the distance between empiricalstatistics (·) of the two datasets:

�����1

N0

N0X

i=1

(xi)� 1

N1

N1X

i=1

(x

0i)

�����

2

. (6)

Expanding the square yields an estimator composed only of inner products on which the kernel trickcan be applied. The resulting estimator is known as Maximum Mean Discrepancy (MMD) (Grettonet al., 2006):

`MMD(X,X

0) =

1

N

20

N0X

n=1

N0X

m=1

k(xn,xm) +

1

N

21

N1X

n=1

N1X

m=1

k(x

0n,x

0m)� 2

N0N1

N0X

n=1

N1X

m=1

k(xn,x0m).

(7)

Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x

0) = e

��kx�x

0k2

,`MMD(X,X

0) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as

matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” andforce the model to try to match the moments between the marginal posterior distributions of ourlatent variables, i.e., q�(z1|s = 0) and q�(z1|s = 1) (in the case of binary nuisance informations

1). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture weobtain our proposed model, the “Variational Fair Autoencoder” (VFAE):

FVFAE(�, ✓;xn,xm, sn, sm,yn) = FVAE(�, ✓;xn,xm, sn, sm,yn)� �`MMD(Z1s=0,Z1

s=1) (8)

where:

`MMD(Z1s=0,Z1

s=1) = kEp(x|s=0)[Eq(z1|x,s=0)[ (z1)]]� Ep(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2 (9)

2.4 FAST MMD VIA RANDOM FOURIER FEATURES

A naive implementation of MMD in minibatch stochastic gradient descent would require computingthe M⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead,we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such thatcomputing the estimator (6) approximates the full MMD (7). To compute this, we draw a randomK ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features andeach entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then givenas:

W

(x) =

r2

D

cos

✓r2

xW + b

◆. (10)

where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015)have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti-mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500

in our experiments.1In case that we have more than two states for the nuisance information s, we minimize the MMD penalty

between each marginal q(z|s = k) and q(z), i.e.,PK

k=1 `MMD(Z1s=k,Z1) for all possible states K of s.

4

Under review as a conference paper at ICLR 2016

2.3 FURTHER INVARIANCE VIA MAXIMUM MEAN DISCREPANCY

Despite the fact that we have a model that encourages statistical independence between s and z1

a-priori we might still have some dependence in the (approximate) marginal posterior q�(z1|s). Inparticular, this can happen if the label y is correlated with the sensitive variable s, which can allowinformation about s to “leak” into the posterior. Thus instead we could maximize a “penalized”lower bound where we impose some sort of regularization on the marginal q�(z1|s). In the followingwe will describe one way to achieve this regularization through the Maximum Mean Discrepancy(MMD) (Gretton et al., 2006) measure.

2.3.1 MAXIMUM MEAN DISCREPANCY

Consider the problem of determining whether two datasets {X} ⇠ P0 and {X0} ⇠ P1 are drawnfrom the same distribution, i.e., P0 = P1. A simple test is to consider the distance between empiricalstatistics (·) of the two datasets:

�����1

N0

N0X

i=1

(xi)� 1

N1

N1X

i=1

(x

0i)

�����

2

. (6)

Expanding the square yields an estimator composed only of inner products on which the kernel trickcan be applied. The resulting estimator is known as Maximum Mean Discrepancy (MMD) (Grettonet al., 2006):

`MMD(X,X

0) =

1

N

20

N0X

n=1

N0X

m=1

k(xn,xm) +

1

N

21

N1X

n=1

N1X

m=1

k(x

0n,x

0m)� 2

N0N1

N0X

n=1

N1X

m=1

k(xn,x0m).

(7)

Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x

0) = e

��kx�x

0k2

,`MMD(X,X

0) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as

matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” andforce the model to try to match the moments between the marginal posterior distributions of ourlatent variables, i.e., q�(z1|s = 0) and q�(z1|s = 1) (in the case of binary nuisance informations

1). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture weobtain our proposed model, the “Variational Fair Autoencoder” (VFAE):

FVFAE(�, ✓;xn,xm, sn, sm,yn) = FVAE(�, ✓;xn,xm, sn, sm,yn)� �`MMD(Z1s=0,Z1

s=1) (8)

where:

`MMD(Z1s=0,Z1

s=1) = kEp(x|s=0)[Eq(z1|x,s=0)[ (z1)]]� Ep(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2 (9)

2.4 FAST MMD VIA RANDOM FOURIER FEATURES

A naive implementation of MMD in minibatch stochastic gradient descent would require computingthe M⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead,we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such thatcomputing the estimator (6) approximates the full MMD (7). To compute this, we draw a randomK ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features andeach entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then givenas:

W

(x) =

r2

D

cos

✓r2

xW + b

◆. (10)

where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015)have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti-mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500

in our experiments.1In case that we have more than two states for the nuisance information s, we minimize the MMD penalty

between each marginal q(z|s = k) and q(z), i.e.,PK

k=1 `MMD(Z1s=k,Z1) for all possible states K of s.

4

Page 17: Iclr2016 vaeまとめ

実験:公平性の検証¤ zからsの情報がなくなっているかどうかを検証

¤ zからsを分類したときの正解率で評価

Under review as a conference paper at ICLR 2016

(a) Adult dataset

(b) German dataset

(c) Health dataset

Figure 3: Fair classification results. Columns correspond to each evaluation scenario (in order):Random/RF/LR accuracy on s, Discrimination/Discrimination prob. against s and Random/Modelaccuracy on y. Note that the objective of a “fair” encoding is to have low accuracy on S (where LRis a linear classifier and RF is nonlinear), low discrimination against S and high accuracy on Y.

introducing these independence properties as well as the MMD penalty the nuisance variable groupsbecome practically indistinguishable.

(a) (b) (c) (d)

Figure 4: t-SNE (van der Maaten, 2013) visualizations from the Adult dataset on: (a): original x ,(b): latent z1 without s and MMD, (c): latent z1 with s and without MMD, (d): latent z1 with s andMMD. Blue colour corresponds to males whereas red colour corresponds to females.

3.3.2 DOMAIN ADAPTATION

As for the domain adaptation scenario and the Amazon reviews dataset, the results of our VFAEmodel can be seen in Table 1. Our model was successful in factoring out the domain information,since the accuracy, measured both linearly (LR) and non-linearly (RF), was towards random chance(which for this dataset is 0.5). We should also mention that, on this dataset at least, completelyremoving information about the domain does not guarantee a better performance on y. The sameeffect was also observed by Ganin et al. (2015) and Chen et al. (2012). As far as the accuracy on y

7

Page 18: Iclr2016 vaeまとめ

実験:ドメイン適応の検証¤ 異なるドメイン間でのドメイン適応

¤ 半教師あり学習で実験(⽬標ドメインのラベルがない).¤ the Amazon reviews dataset

¤ 𝑦はセンチメント(positiveかnegative)

¤ 結果:¤ 12のうち9が既存研究([Ganin+ 15])を上回った

Under review as a conference paper at ICLR 2016

2 LEARNING INVARIANT REPRESENTATIONS

x

zs

N

Figure 1: Unsupervised model

x

z1s

z2y

N

Figure 2: Semi-supervised model

2.1 UNSUPERVISED MODEL

Factoring out undesired variations from the data can be easily formulated as a general probabilisticmodel which admits two distinct (independent) “sources”; an observed variable s, which denotes thevariations that we want to remove, and a continuous latent variable z which models all the remaininginformation. This generative process can be formally defined as:

z ⇠ p(z); x ⇠ p✓(x|z, s)where p✓(x|z, s) is an appropriate probability distribution for the data we are modelling. With thisformulation we explicitly encode a notion of ‘invariance’ in our model, since the latent represen-tation is marginally independent of the factors of variation s. Therefore the problem of finding aninvariant representation for a data point x and variation s can be cast as performing inference on thisgraphical model and obtaining the posterior distribution of z, p(z|x, s).For our model we will employ a variational autoencoder architecture (Kingma & Welling, 2014;Rezende et al., 2014); namely we will parametrize the generative model (decoder) p✓(x|z, s) andthe variational posterior (encoder) q�(z|x, s) as (deep) neural networks which accept as inputs z, sand x, s respectively and produce the parameters of each distribution after a series of non-lineartransformations. Both the model (✓) and variational (�) parameters will be jointly optimized withthe SGVB (Kingma & Welling, 2014) algorithm according to a lower bound on the log-likelihood.This parametrization will allow us to capture most of the salient information of x in our embeddingz. Furthermore the distributed representation of a neural network would allow us to better resolvethe dependencies between x and s thus yielding a better disentangling between the independentfactors z and s. By choosing a Gaussian posterior q�(z|x, s) and standard isotropic Gaussian priorp(z) = N (0, I) we can obtain the following lower bound:

NX

n=1

log p(xn|sn) �NX

n=1

Eq�(zn|xn,sn)[log p✓(xn|zn, sn)]�KL(q�(zn|xn, sn)||p(z)) (1)

= F(�, ✓;xn, sn)

with q�(zn|xn, sn) = N (zn|µn = f�(xn, sn),�n = e

f�(xn,sn)) and p✓(xn|zn, sn) = f✓(zn, sn)

with f✓(zn, sn) being an appropriate probability distribution for the data we are modelling.

2.2 SEMI-SUPERVISED MODEL

Factoring out variations in an unsupervised way can however be harmful in cases where we want touse this invariant representation for a subsequent prediction task. In particular if we have a situationwhere the nuisance variable s and the actual label y are correlated, then training an unsupervisedmodel could yield random or degenerate representations with respect to y. Therefore it is moreappropriate to try to “inject” the information about the label during the feature extraction phase. Thiscan be quite simply achieved by introducing a second “layer” of latent variables to our generativemodel where we try to correlate z with the prediction task. Assuming that the invariant featuresare now called z1 we enrich the generative story by similarly providing two distinct (independent)

2

Under review as a conference paper at ICLR 2016

is concerned, we compared against a recent neural network based state of the art method for domainadaptation, Domain Adversarial Neural Network (DANN) (Ganin et al., 2015). As we can observein table 1, our accuracy on the labels y is higher on 9 out of the 12 domain adaptation tasks whereason the remaining 3 it is quite similar to the DANN architecture.

Table 1: Results on the Amazon reviews dataset. The DANN column is taken directly from Ganinet al. (2015) (the column that uses the original representation as input).

Source - Target S YRF LR VFAE DANN

books - dvd 0.535 0.564 0.799 0.784books - electronics 0.541 0.562 0.792 0.733books - kitchen 0.537 0.583 0.816 0.779dvd - books 0.537 0.563 0.755 0.723dvd - electronics 0.538 0.566 0.786 0.754dvd - kitchen 0.543 0.589 0.822 0.783electronics - books 0.562 0.590 0.727 0.713electronics - dvd 0.556 0.586 0.765 0.738electronics - kitchen 0.536 0.570 0.850 0.854

kitchen - books 0.560 0.593 0.720 0.709kitchen - dvd 0.561 0.599 0.733 0.740

kitchen - electronics 0.533 0.565 0.838 0.843

3.4 LEARNING INVARIANT REPRESENTATIONS

Regarding the more general task of learning invariant representations; our results on the ExtendedYale B dataset also demonstrate our model’s ability to learn such representations. As expected,on the original representation x the lighting conditions, s, are well identifiable with almost perfectaccuracy from both RF and LR. This can also be seen in the two dimensional embeddings of theoriginal space x in Figure 5a: the images are mostly clustered according to the lighting conditions.As soon as we utilize our VFAE model we simultaneously decrease the accuracy on s, from 96%to about 50%, and increase our accuracy on y, from 78% to about 85%. This effect can also beseen in Figure 5b: the images are now mostly clustered according to the person ID (the label y).It is clear that in this scenario the information about s is purely “nuisance” with respect to thelabels y. Therefore, by using our VFAE model we are able to obtain improved generalization andclassification performance by effectively removing s from our representations.

Table 2: Results on the Extended Yale B dataset. We also included the best result from Li et al.(2014) under the NN + MMD row.

Method S YRF LROriginal x 0.952 0.961 0.78NN + MMD - - 0.82VFAE 0.435 0.565 0.846

4 RELATED WORK

Most related to our “fair” representations view is the work from Zemel et al. (2013). They proposed aneural network based semi-supervised clustering model for learning fair representations. The idea isto learn a localised representation that maps each datapoint to a cluster in such a way that each clustergets assigned roughly equal proportions of data from each group in s. Although their approach wassuccessfully applied on several datasets, the restriction to clustering means that it cannot leverage therepresentational power of a distributed representation. Furthermore, this penalty does not accountfor higher order moments in the latent distribution. For example, if p(zk = 1|xi, s = 0) always

8

Page 19: Iclr2016 vaeまとめ

CVAEの活⽤¤ 条件付きVAEは,ラベル等に条件づけられた画像を⽣成できる

¤ 学習サンプルに存在していないデータも⽣成可能

¤ 数字ラベルで条件付け[Kingma+ 2014 ; NIPS 2014]

(a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z

(b) MNIST analogies (c) SVHN analogies

Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c)Analogical reasoning with generative semi-supervised models using a high-dimensional z-space.The leftmost columns show images from the test set. The other columns show analogical fantasiesof x by the generative model, where the latent variable z of each row is set to the value inferred fromthe test-set image on the left by the inference network. Each column corresponds to a class label y.

Table 2: Semi-supervised classification onthe SVHN dataset with 1000 labels.

KNN TSVM M1+KNN M1+TSVM M1+M277.93 66.55 65.63 54.33 36.02

(± 0.08) (± 0.10) (± 0.15) (± 0.11) (± 0.10)

Table 3: Semi-supervised classification onthe NORB dataset with 1000 labels.

KNN TSVM M1+KNN M1+TSVM78.71 26.00 65.39 18.79

(± 0.02) (± 0.06) (± 0.09) (± 0.05)

value, vary the class label y, and simulate images from the generative model corresponding to thatcombination of z and y. This again demonstrate the disentanglement of style from class. Figure 1shows these analogical fantasies for the MNIST and SVHN datasets (Netzer et al., 2011). TheSVHN data set is a far more complex data set than MNIST, but the model is able to fix the style ofhouse number and vary the digit that appears in that style well. These generations represent the bestcurrent performance in simulation from generative models on these data sets.

The model used in this way also provides an alternative model to the stochastic feed-forward net-works (SFNN) described by Tang and Salakhutdinov (2013). The performance of our model sig-nificantly improves on SFNN, since instead of an inefficient Monte Carlo EM algorithm relying onimportance sampling, we are able to perform efficient joint inference that is easy to scale.

4.3 Image ClassificationWe demonstrate the performance of image classification on the SVHN, and NORB image data sets.Since no comparative results in the semi-supervised setting exists, we perform nearest-neighbourand TSVM classification with RBF kernels and compare performance on features generated byour latent-feature discriminative model to the original features. The results are presented in tables 2and 3, and we again demonstrate the effectiveness of our approach for semi-supervised classification.

7

Page 20: Iclr2016 vaeまとめ

Conditional alignDRAW¤ Generating Images from Captions with Attention [Mansimov+

16 ; ICLR 2016]¤ DRAWにbidirectional RNNで条件づけたモデル

¤ DRAW [Gregor+ 14]¤ VAEの枠組みでRNNを使えるようにしたもの.¤ 各時間ステップで画像を上書きしていく

¤ 前のステップとの差分をみることで注意(attention)をモデル化DRAW: A Recurrent Neural Network For Image Generation

ing images in a single pass, it iteratively constructs scenesthrough an accumulation of modifications emitted by thedecoder, each of which is observed by the encoder.

An obvious correlate of generating images step by step isthe ability to selectively attend to parts of the scene whileignoring others. A wealth of results in the past few yearssuggest that visual structure can be better captured by a se-quence of partial glimpses, or foveations, than by a sin-gle sweep through the entire image (Larochelle & Hinton,2010; Denil et al., 2012; Tang et al., 2013; Ranzato, 2014;Zheng et al., 2014; Mnih et al., 2014; Ba et al., 2014; Ser-manet et al., 2014). The main challenge faced by sequentialattention models is learning where to look, which can beaddressed with reinforcement learning techniques such aspolicy gradients (Mnih et al., 2014). The attention model inDRAW, however, is fully differentiable, making it possibleto train with standard backpropagation. In this sense it re-sembles the selective read and write operations developedfor the Neural Turing Machine (Graves et al., 2014).

The following section defines the DRAW architecture,along with the loss function used for training and the pro-cedure for image generation. Section 3 presents the selec-tive attention model and shows how it is applied to read-ing and modifying images. Section 4 provides experi-mental results on the MNIST, Street View House Num-bers and CIFAR-10 datasets, with examples of generatedimages; and concluding remarks are given in Section 5.Lastly, we would like to direct the reader to the videoaccompanying this paper (https://www.youtube.com/watch?v=Zt-7MI9eKEo) which contains exam-ples of DRAW networks reading and generating images.

2. The DRAW NetworkThe basic structure of a DRAW network is similar to that ofother variational auto-encoders: an encoder network deter-mines a distribution over latent codes that capture salientinformation about the input data; a decoder network re-ceives samples from the code distribuion and uses them tocondition its own distribution over images. However thereare three key differences. Firstly, both the encoder and de-coder are recurrent networks in DRAW, so that a sequenceof code samples is exchanged between them; moreover theencoder is privy to the decoder’s previous outputs, allow-ing it to tailor the codes it sends according to the decoder’sbehaviour so far. Secondly, the decoder’s outputs are suc-cessively added to the distribution that will ultimately gen-erate the data, as opposed to emitting this distribution ina single step. And thirdly, a dynamically updated atten-tion mechanism is used to restrict both the input regionobserved by the encoder, and the output region modifiedby the decoder. In simple terms, the network decides ateach time-step “where to read” and “where to write” as well

read

x

zt zt+1

P (x|z1:T )write

encoderRNN

sample

decoderRNN

read

x

write

encoderRNN

sample

decoderRNN

c

t�1

c

t

c

T

h

enc

t�1

h

dec

t�1

Q(zt|x, z1:t�1) Q(z

t+1

|x, z1:t

)

. . .

decoding(generative model)

encoding(inference)

encoderFNN

sample

decoderFNN

z

P (x|z)

Figure 2. Left: Conventional Variational Auto-Encoder. Dur-ing generation, a sample z is drawn from a prior P (z) and passedthrough the feedforward decoder network to compute the proba-bility of the input P (x|z) given the sample. During inference theinput x is passed to the encoder network, producing an approx-imate posterior Q(z|x) over latent variables. During training, zis sampled from Q(z|x) and then used to compute the total de-scription length KL

�Q(Z|x)||P (Z)

�� log(P (x|z)), which is

minimised with stochastic gradient descent. Right: DRAW Net-work. At each time-step a sample zt from the prior P (zt) ispassed to the recurrent decoder network, which then modifies partof the canvas matrix. The final canvas matrix cT is used to com-pute P (x|z1:T ). During inference the input is read at every time-step and the result is passed to the encoder RNN. The RNNs atthe previous time-step specify where to read. The output of theencoder RNN is used to compute the approximate posterior overthe latent variables at that time-step.

as “what to write”. The architecture is sketched in Fig. 2,alongside a feedforward variational auto-encoder.

2.1. Network Architecture

Let RNN enc be the function enacted by the encoder net-work at a single time-step. The output of RNN enc at timet is the encoder hidden vector henc

t

. Similarly the output ofthe decoder RNN dec at t is the hidden vector hdec

t

. In gen-eral the encoder and decoder may be implemented by anyrecurrent neural network. In our experiments we use theLong Short-Term Memory architecture (LSTM; Hochreiter& Schmidhuber (1997)) for both, in the extended form withforget gates (Gers et al., 2000). We favour LSTM dueto its proven track record for handling long-range depen-dencies in real sequential data (Graves, 2013; Sutskeveret al., 2014). Throughout the paper, we use the notationb = W (a) to denote a linear weight matrix with bias fromthe vector a to the vector b.

At each time-step t, the encoder receives input from boththe image x and from the previous decoder hidden vectorh

dect�1

. The precise form of the encoder input depends on aread operation, which will be defined in the next section.The output henc

t

of the encoder is used to parameterise adistribution Q(Z

t

|henct

) over the latent vector z

t

. In our

DRAW: A Recurrent Neural Network For Image Generation

Time

Figure 7. MNIST generation sequences for DRAW without at-tention. Notice how the network first generates a very blurry im-age that is subsequently refined.

with attention it constructs the digit by tracing the lines—much like a person with a pen.

4.3. MNIST Generation with Two Digits

The main motivation for using an attention-based genera-tive model is that large images can be built up iteratively,by adding to a small part of the image at a time. To testthis capability in a controlled fashion, we trained DRAWto generate images with two 28 ⇥ 28 MNIST images cho-sen at random and placed at random locations in a 60⇥ 60

black background. In cases where the two digits overlap,the pixel intensities were added together at each point andclipped to be no greater than one. Examples of generateddata are shown in Fig. 8. The network typically generatesone digit and then the other, suggesting an ability to recre-ate composite scenes from simple pieces.

4.4. Street View House Number Generation

MNIST digits are very simplistic in terms of visual struc-ture, and we were keen to see how well DRAW performedon natural images. Our first natural image generation ex-periment used the multi-digit Street View House Numbersdataset (Netzer et al., 2011). We used the same preprocess-ing as (Goodfellow et al., 2013), yielding a 64 ⇥ 64 housenumber image for each training example. The network wasthen trained using 54⇥ 54 patches extracted at random lo-cations from the preprocessed images. The SVHN trainingset contains 231,053 images, and the validation set contains4,701 images.

The house number images generated by the network are

Figure 8. Generated MNIST images with two digits.

Figure 9. Generated SVHN images. The rightmost columnshows the training images closest (in L

2 distance) to the gener-ated images beside them. Note that the two columns are visuallysimilar, but the numbers are generally different.

highly realistic, as shown in Figs. 9 and 10. Fig. 11 revealsthat, despite the long training time, the DRAW network un-derfit the SVHN training data.

4.5. Generating CIFAR Images

The most challenging dataset we applied DRAW to wasthe CIFAR-10 collection of natural images (Krizhevsky,

Page 21: Iclr2016 vaeまとめ

Conditional alignDRAW¤ Conditional alignDRAWの全体像

¤ DRAWにbidirectional RNNで条件づけたモデル¤ Bidirectional RNNの出⼒を重み付け和したもので条件付ける.

Published as a conference paper at ICLR 2016

Figure 2: AlignDRAW model for generating images by learning an alignment between the input captions andgenerating canvas. The caption is encoded using the Bidirectional RNN (left). The generative RNN takes alatent sequence z1:T sampled from the prior along with the dynamic caption representation s1:T to generatethe canvas matrix cT , which is then used to generate the final image x (right). The inference RNN is used tocompute approximate posterior Q over the latent sequence.

3.2 IMAGE MODEL: THE CONDITIONAL DRAW NETWORK

To generate an image x conditioned on the caption information y, we extended the DRAW net-work (Gregor et al., 2015) to include caption representation h

lang at each step, as shown in Fig. 2.The conditional DRAW network is a stochastic recurrent neural network that consists of a sequenceof latent variables Z

t

2 RD, t = 1, .., T , where the output is accumulated over all T time-steps. Forsimplicity in notation, the images x 2 Rh⇥w are assumed to have size h-by-w and only one colorchannel.

Unlike the original DRAW network where latent variables are independent spherical GaussiansN (0, I), the latent variables in the proposed alignDRAW model have their mean and variance de-pend on the previous hidden states of the generative LSTM h

gen

t�1, except for P (Z1) = N (0, I).Namely, the mean and variance of the prior distribution over Z

t

are parameterized by:

P (Z

t

|Z1:t�1) = N✓µ(h

gen

t�1),�(hgen

t�1)

◆,

µ(h

gen

t�1) = tanh(W

µ

h

gen

t�1),

�(h

gen

t�1) = exp

�tanh(W

h

gen

t�1)�,

where W

µ

2 RD⇥n, W�

2 RD⇥n are the learned model parameters, and n is the dimensional-ity of hgen

t

, the hidden state of the generative LSTM. Similar to (Bachman & Precup, 2015), wehave observed that the model performance is improved by including dependencies between latentvariables.

Formally, an image is generated by iteratively computing the following set of equations for t =

1, ..., T (see Fig. 2), with h

gen

0 and c0 initialized to learned biases:

z

t

⇠ P (Z

t

|Z1:t�1) = N✓µ(h

gen

t�1),�(hgen

t�1)

◆, (1)

s

t

= align(h

gen

t�1, hlang

), (2)h

gen

t

= LSTM gen

(h

gen

t�1, [zt, st]), (3)c

t

= c

t�1 + write(hgen

t

), (4)

˜

x ⇠ P (x |y, Z1:T ) =

Y

i

P (x

i

|y, Z1:T ) =

Y

i

Bern(�(cT,i

)). (5)

The align function is used to compute the alignment between the input caption and intermediateimage generative steps (Bahdanau et al., 2015). Given the caption representation from the languagemodel, hlang

= [h

lang

1 , h

lang

2 , ..., h

lang

N

], the align operator outputs a dynamic sentence representa-tion s

t

at each step through a weighted sum using alignment probabilities ↵t

1...N :

s

t

= align(h

gen

t�1, hlang

) = ↵

t

1hlang

1 + ↵

t

2hlang

2 + ...+ ↵

t

N

h

lang

N

. (6)

3

𝑦

𝑥

𝑧

Published as a conference paper at ICLR 2016

The corresponding alignment probability ↵

t

k

for the k

th word in the caption is obtained using thecaption representation h

lang and the current hidden state of the generative model hgen

t�1:

t

k

=

exp

⇣v

>tanh(Uh

lang

k

+Wh

gen

t�1 + b)

PN

i=1 exp

⇣v

>tanh(Uh

lang

i

+Wh

gen

t�1 + b)

⌘, (7)

where v 2 Rl, U 2 Rl⇥m, W 2 Rl⇥n and b 2 Rl are the learned model parameters of the alignmentmodel.

The LSTM gen function of Eq. 3 is defined by the LSTM network with forget gates (Gers et al.,2000) at a single time-step. To generate the next hidden state hgen

t

, the LSTM gen takes the previoushidden state h

gen

t�1 and combines it with the input from both the latent sample z

t

and the sentencerepresentation s

t

.

The output of the LSTM gen function h

gen

t

is then passed through the write operator which is addedto a cumulative canvas matrix c

t

2 Rh⇥w (Eq. 4). The write operator produces two arrays of 1DGaussian filter banks F

x

(h

gen

t

) 2 Rh⇥p and F

y

(h

gen

t

) 2 Rw⇥p whose filter locations and scales arecomputed from the generative LSTM hidden state h

gen

t

(same as defined in Gregor et al. (2015)).The Gaussian filter banks are then applied to the generated p-by-p image patch K(h

gen

t

) 2 Rp⇥p,placing it onto the canvas:

�c

t

= c

t

� c

t�1 = write(hgen

t

) = F

x

(h

gen

t

)K(h

gen

t

)F

y

(h

gen

t

)

>. (8)

Finally, each entry c

T,i

from the final canvas matrix c

T

is transformed using a sigmoid function � toproduce a conditional Bernoulli distribution with mean vector �(c

T

) over the h⇥w image pixels xgiven the latent variables Z1:T and the input caption y

1. In practice, when generating an image x,instead of sampling from the conditional Bernoulli distribution, we simply use the conditional meanx = �(c

T

).

3.3 LEARNING

The model is trained to maximize a variational lower bound L on the marginal likelihood of thecorrect image x given the input caption y:

L =

X

Z

Q(Z |x,y) logP (x |y, Z)�DKL (Q(Z |x,y) kP (Z |y)) logP (x |y). (9)

Similar to the DRAW model, the inference recurrent network produces an approximate posteriorQ(Z1:T |x,y) via a read operator, which reads a patch from an input image x using two arrays of1D Gaussian filters (inverse of write from section 3.2) at each time-step t. Specifically,

ˆ

x

t

= x� �(ct�1), (10)

r

t

= read(xt

,

ˆ

x

t

, h

gen

t�1), (11)

h

infer

t

= LSTM infer

(h

infer

t�1 , [r

t

, h

gen

t�1]), (12)

Q(Z

t

|x,y, Z1:t�1) = N⇣µ(h

infer

t

),�(h

infer

t

)

⌘, (13)

where ˆ

x is the error image and h

infer

0 is initialized to the learned bias b. Note that the inferenceLSTM infer takes as its input both the output of the read operator r

t

2 Rp⇥p, which depends onthe original input image x, and the previous state of the generative decoder hgen

t�1, which dependson the latent sample history z1:t�1 and dynamic sentence representation s

t�1 (see Eq. 3). Hence,the approximate posterior Q will depend on the input image x, the corresponding caption y, and thelatent history Z1:t�1, except for the first step Q(Z1|x), which depends only on x.

The terms in the variational lower bound Eq. 9 can be rearranged using the law of total expectation.Therefore, the variational bound L is calculated as follows:

L =EQ(Z1:T |y,x)

"log p(x |y, Z1:T )�

TX

t=2

DKL (Q(Z

t

|Z1:t�1,y,x) kP (Z

t

|Z1:t�1,y))

#

�DKL (Q(Z1 |x) kP (Z1)) . (14)1We also experimented with a conditional Gaussian observation model, but it worked worse compared to

the Bernoulli model.

4

Page 22: Iclr2016 vaeまとめ

実験:キャプション付きMNIST¤ キャプション付きのMNISTで学習

¤ キャプションはMNISTの場所を指定

¤ 左が訓練データにあるもの,右はないもの.

¤ 複数の数字でも適切に⽣成されている.

Published as a conference paper at ICLR 2016

Figure 6: Examples of generating 60⇥ 60 MNIST images corresponding to respective captions. The captionson the left column were part of the training set. The digits described in the captions on the right column werehidden during training for the respective configurations.

APPENDIX A: MNIST WITH CAPTIONS

As an additional experiment, we trained our model on the MNIST dataset with artificial captions.Either one or two digits from the MNIST training dataset were placed on a 60 ⇥ 60 blank image.One digit was placed in one of the four (top-left, top-right, bottom-left or bottom-right) cornersof the image. Two digits were either placed horizontally or vertically in non-overlapping fashion.The corresponding artificial captions specified the identity of each digit along with their relativepositions, e.g. “The digit three is at the top of the digit one”, or “The digit seven is at the bottom leftof the image”.

The generated images together with the attention alignments are displayed in Figure 6. The modelcorrectly displayed the specified digits at the described positions and even managed to generalizereasonably well to the configurations that were never present during training. In the case of gener-ating two digits, the model would dynamically attend to the digit in the caption it was drawing atthat particular time-step. Similarly, in the setting where the caption specified only a single digit, themodel would correctly attend to the digit in the caption during the whole generation process. In bothcases, the model placed small attention values on the words describing the position of digits in theimages.

APPENDIX B: TRAINING DETAILS

HYPERPARAMETERS

Each parameter in alignDRAW was initialized by sampling from a Gaussian distribution with mean0 and standard deviation 0.01. The model was trained using RMSprop with an initial learning rateof 0.001. For the Microsoft COCO task, we trained our model for 18 epochs. The learning ratewas reduced to 0.0001 after 11 epochs. For the MNIST with Captions task, the model was trainedfor 150 epochs and the learning rate was reduced to 0.0001 after 110 epochs. During each epoch,randomly created 10, 000 training samples were used for learning. The norm of the gradients wasclipped at 10 during training to avoid the exploding gradients problem.

We used a vocabulary size of K = 25323 and K = 22 for the Microsoft COCO and MNIST withCaptions datasets respectively. All capital letters in the words were converted to small letters as apreprocessing step. For all tasks, the hidden states

�!h

lang

i

and �h

lang

i

in the language model had128 units. Hence the dimensionality of the concatenated state of the Bidirectional LSTM h

lang

i

=

[

�!h

lang

i

,

�h

lang

i

] was 256. The parameters in the align operator (Eq. 7) had a dimensionality ofl = 512, so that v 2 R512, U 2 R512⇥256, W 2 R512⇥n

gen

and b 2 R512. The architecturalconfigurations of the alignDRAW models are shown in Table 2.

10

Page 23: Iclr2016 vaeまとめ

実験:MSCOCOデータセット¤ キャプションの⼀部(下線部)だけを変換

¤ 存在していないキャプションから⽣成

Published as a conference paper at ICLR 2016

A yellow school busparked in a parking lot.

A red school bus parkedin a parking lot.

A green school busparked in a parking lot.

A blue school bus parkedin a parking lot.

The decadent chocolatedesert is on the table.

A bowl of bananas is onthe table.

A vintage photo of a cat. A vintage photo of a dog.

Figure 3: Top: Examples of changing the color while keeping the caption fixed. Bottom: Examples of changingthe object while keeping the caption fixed. The shown images are the probabilities �(cT ). Best viewed incolour.

The expectation can be approximated by L Monte Carlo samples z1:T from Q(Z1:T |y,x):

L ⇡ 1

L

LX

l=1

"log p(x |y, zl1:T )�

TX

t=2

DKL

�Q(Z

t

| zl1:t�1,y,x) kP (Z

t

| zl1:t�1,y)�#

�DKL (Q(Z1 |x) kP (Z1)) . (15)

The model can be trained using stochastic gradient descent. In all of our experiments, we usedonly a single sample from Q(Z1:T |y,x) for parameter learning. Training details, hyperparametersettings, and the overall model architecture are specified in Appendix B. The code is available athttps://github.com/emansim/text2image.

3.4 GENERATING IMAGES FROM CAPTIONS

During the image generation step, we discard the inference network and instead sample from theprior distribution. Due to the blurriness of samples generated by the DRAW model, we perform anadditional post processing step where we use an adversarial network trained on residuals of a Lapla-cian pyramid conditioned on the skipthought representation (Kiros et al., 2015) of the captions tosharpen the generated images, similar to (Denton et al., 2015). By fixing the prior of the adversarialgenerator to its mean, it gets treated as a deterministic neural network that allows us to define theconditional data term in Eq. 14 on the sharpened images and estimate the variational lower boundaccordingly.

4 EXPERIMENTS

4.1 MICROSOFT COCO

Microsoft COCO (Lin et al., 2014) is a large dataset containing 82,783 images, each annotated withat least 5 captions. The rich collection of images with a wide variety of styles, backgrounds andobjects makes the task of learning a good generative model very challenging. For consistency withrelated work on caption generation, we used only the first five captions when training and evaluatingour model. The images were resized to 32⇥32 pixels for consistency with other tiny image datasets(Krizhevsky, 2009). In the following subsections, we analyzed both the qualitative and quantitativeaspects of our model as well as compared its performance with that of other, related generativemodels.2 Appendix A further reports some additional experiments using the MNIST dataset.

4.1.1 ANALYSIS OF GENERATED IMAGES

The main goal of this work is to learn a model that can understand the semantic meaning expressedin the textual descriptions of images, such as the properties of objects, the relationships betweenthem, and then use that knowledge to generate relevant images. To examine the understanding of

2To see more generated images, visit http://www.cs.toronto.edu/˜

emansim/cap2im.html

5

Published as a conference paper at ICLR 2016

A stop sign is flying inblue skies.

A herd of elephants fly-ing in the blue skies.

A toilet seat sits open inthe grass field.

A person skiing on sandclad vast desert.

Figure 1: Examples of generated images based on captions that describe novel scene compositions that arehighly unlikely to occur in real life. The captions describe a common object doing unusual things or set in astrange location.

2 RELATED WORK

Deep Neural Networks have achieved significant success in various tasks such as image recognition(Krizhevsky et al., 2012), speech transcription (Graves et al., 2013), and machine translation (Bah-danau et al., 2015). While most of the recent success has been achieved by discriminative models,generative models have not yet enjoyed the same level of success. Most of the previous work ingenerative models has focused on variants of Boltzmann Machines (Smolensky, 1986; Salakhutdi-nov & Hinton, 2009) and Deep Belief Networks (Hinton et al., 2006). While these models are verypowerful, each iteration of training requires a computationally costly step of MCMC to approximatederivatives of an intractable partition function (normalization constant), making it difficult to scalethem to large datasets.

Kingma & Welling (2014), Rezende et al. (2014) have introduced the Variational Auto-Encoder(VAE) which can be seen as a neural network with continuous latent variables. The encoder is usedto approximate a posterior distribution and the decoder is used to stochastically reconstruct the datafrom latent variables. Gregor et al. (2015) further introduced the Deep Recurrent Attention Writer(DRAW), extending the VAE approach by incorporating a novel differentiable attention mechanism.

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are another type of generativemodels that use noise-contrastive estimation (Gutmann & Hyvarinen, 2010) to avoid calculatingan intractable partition function. The model consists of a generator that generates samples usinga uniform distribution and a discriminator that discriminates between real and generated images.Recently, Denton et al. (2015) have scaled those models by training conditional GANs at each levelof a Laplacian pyramid of images.

While many of the previous approaches have focused on unconditional models or models condi-tioned on labels, in this paper we develop a generative model of images conditioned on captions.

3 MODEL

Our proposed model defines a generative process of images conditioned on captions. In particular,captions are represented as a sequence of consecutive words and images are represented as a se-quence of patches drawn on a canvas c

t

over time t = 1, ..., T . The model can be viewed as a partof the sequence-to-sequence framework (Sutskever et al., 2014; Cho et al., 2014; Srivastava et al.,2015).

3.1 LANGUAGE MODEL: THE BIDIRECTIONAL ATTENTION RNN

Let y be the input caption, represented as a sequence of 1-of-K encoded words y = (y1, y2, ..., yN ),where K is the size of the vocabulary and N is the length of the sequence. We obtain the caption sen-tence representation by first transforming each word y

i

to an m-dimensional vector representationh

lang

i

, i = 1, .., N using the Bidirectional RNN. In a Bidirectional RNN, the two LSTMs (Hochre-iter & Schmidhuber, 1997) with forget gates (Gers et al., 2000) process the input sequence from bothforward and backward directions. The Forward LSTM computes the sequence of forward hiddenstates [

�!h

lang

1 ,

�!h

lang

2 , ...,

�!h

lang

N

] , whereas the Backward LSTM computes the sequence of back-ward hidden states [

�h

lang

1 ,

�h

lang

2 , ...,

�h

lang

N

]. These hidden states are then concatenated togetherinto the sequence h

lang

= [h

lang

1 , h

lang

2 , ..., h

lang

N

], with h

lang

i

= [

�!h

lang

i

,

�h

lang

i

], 1 i N .

2

Page 24: Iclr2016 vaeまとめ

ガウス過程とVAE

Page 25: Iclr2016 vaeまとめ

ガウス過程¤ ガウス過程とは・・・

¤ 関数の確率分布¤ D次元の⼊⼒ベクトルのデータセット に対する関数の出⼒ベクトル の同時分布が常にガウス分布

¤ 平均ベクトルは ,共分散⾏列は で完全に記述される

show that the proposed model outperforms state of the art stand-alone deep learning archi-tectures and Gaussian processes with advanced kernel learning procedures on a wide rangeof datasets, demonstrating its practical significance. We achieve scalability while retainingnon-parametric model structure by leveraging the very recent KISS-GP approach (Wilsonand Nickisch, 2015) and extensions in Wilson et al. (2015) for e�ciently representing kernelfunctions, to produce scalable deep kernels.

3 Gaussian Processes

We briefly review the predictive equations and marginal likelihood for Gaussian processes(GPs), and the associated computational requirements, following the notational conven-tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for acomprehensive discussion of GPs.

We assume a dataset D of n input (predictor) vectors X = {x1

, . . . ,xn}, each of dimensionD, which index an n ⇥ 1 vector of targets y = (y(x

1

), . . . , y(xn))>. If f(x) ⇠ GP(µ, k�),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1

), . . . , f(xn)]> ⇠ N (µ,KX,X) , (1)

with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k�(xi,xj), determinedfrom the mean function and covariance kernel of the Gaussian process. The kernel, k� , isparametrized by �. Assuming additive Gaussian noise, y(x)|f(x) ⇠ N (y(x); f(x),�2), thepredictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by

f⇤|X⇤,X,y,�,�2 ⇠ N (E[f⇤], cov(f⇤)) , (2)

E[f⇤] = µX⇤ +KX⇤,X [KX,X + �2I]�1

y ,

cov(f⇤) = KX⇤,X⇤ �KX⇤,X [KX,X + �2I]�1KX,X⇤ .

KX⇤,X , for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n⇥ n covariance matrix evaluatedat training inputs X. All covariance (kernel) matrices implicitly depend on the kernelhyperparameters �.

GPs with RBF kernels correspond to models which have an infinite basis expansion in adual space, and have compelling theoretical properties: these models are universal approxi-mators, and have prior support to within an arbitrarily small epsilon band of any continuousfunction (Micchelli et al., 2006). Indeed the properties of the distribution over functionsinduced by a Gaussian process are controlled by the kernel function. For example, thepopular RBF kernel,

kRBF

(x,x0) = exp(�1

2||x� x

0||/`2) (3)

encodes the inductive bias that function values closer together in the input space, in theEuclidean sense, are more correlated. The complexity of the functions in the input space

4

show that the proposed model outperforms state of the art stand-alone deep learning archi-tectures and Gaussian processes with advanced kernel learning procedures on a wide rangeof datasets, demonstrating its practical significance. We achieve scalability while retainingnon-parametric model structure by leveraging the very recent KISS-GP approach (Wilsonand Nickisch, 2015) and extensions in Wilson et al. (2015) for e�ciently representing kernelfunctions, to produce scalable deep kernels.

3 Gaussian Processes

We briefly review the predictive equations and marginal likelihood for Gaussian processes(GPs), and the associated computational requirements, following the notational conven-tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for acomprehensive discussion of GPs.

We assume a dataset D of n input (predictor) vectors X = {x1

, . . . ,xn}, each of dimensionD, which index an n ⇥ 1 vector of targets y = (y(x

1

), . . . , y(xn))>. If f(x) ⇠ GP(µ, k�),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1

), . . . , f(xn)]> ⇠ N (µ,KX,X) , (1)

with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k�(xi,xj), determinedfrom the mean function and covariance kernel of the Gaussian process. The kernel, k� , isparametrized by �. Assuming additive Gaussian noise, y(x)|f(x) ⇠ N (y(x); f(x),�2), thepredictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by

f⇤|X⇤,X,y,�,�2 ⇠ N (E[f⇤], cov(f⇤)) , (2)

E[f⇤] = µX⇤ +KX⇤,X [KX,X + �2I]�1

y ,

cov(f⇤) = KX⇤,X⇤ �KX⇤,X [KX,X + �2I]�1KX,X⇤ .

KX⇤,X , for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n⇥ n covariance matrix evaluatedat training inputs X. All covariance (kernel) matrices implicitly depend on the kernelhyperparameters �.

GPs with RBF kernels correspond to models which have an infinite basis expansion in adual space, and have compelling theoretical properties: these models are universal approxi-mators, and have prior support to within an arbitrarily small epsilon band of any continuousfunction (Micchelli et al., 2006). Indeed the properties of the distribution over functionsinduced by a Gaussian process are controlled by the kernel function. For example, thepopular RBF kernel,

kRBF

(x,x0) = exp(�1

2||x� x

0||/`2) (3)

encodes the inductive bias that function values closer together in the input space, in theEuclidean sense, are more correlated. The complexity of the functions in the input space

4

show that the proposed model outperforms state of the art stand-alone deep learning archi-tectures and Gaussian processes with advanced kernel learning procedures on a wide rangeof datasets, demonstrating its practical significance. We achieve scalability while retainingnon-parametric model structure by leveraging the very recent KISS-GP approach (Wilsonand Nickisch, 2015) and extensions in Wilson et al. (2015) for e�ciently representing kernelfunctions, to produce scalable deep kernels.

3 Gaussian Processes

We briefly review the predictive equations and marginal likelihood for Gaussian processes(GPs), and the associated computational requirements, following the notational conven-tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for acomprehensive discussion of GPs.

We assume a dataset D of n input (predictor) vectors X = {x1

, . . . ,xn}, each of dimensionD, which index an n ⇥ 1 vector of targets y = (y(x

1

), . . . , y(xn))>. If f(x) ⇠ GP(µ, k�),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1

), . . . , f(xn)]> ⇠ N (µ,KX,X) , (1)

with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k�(xi,xj), determinedfrom the mean function and covariance kernel of the Gaussian process. The kernel, k� , isparametrized by �. Assuming additive Gaussian noise, y(x)|f(x) ⇠ N (y(x); f(x),�2), thepredictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by

f⇤|X⇤,X,y,�,�2 ⇠ N (E[f⇤], cov(f⇤)) , (2)

E[f⇤] = µX⇤ +KX⇤,X [KX,X + �2I]�1

y ,

cov(f⇤) = KX⇤,X⇤ �KX⇤,X [KX,X + �2I]�1KX,X⇤ .

KX⇤,X , for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n⇥ n covariance matrix evaluatedat training inputs X. All covariance (kernel) matrices implicitly depend on the kernelhyperparameters �.

GPs with RBF kernels correspond to models which have an infinite basis expansion in adual space, and have compelling theoretical properties: these models are universal approxi-mators, and have prior support to within an arbitrarily small epsilon band of any continuousfunction (Micchelli et al., 2006). Indeed the properties of the distribution over functionsinduced by a Gaussian process are controlled by the kernel function. For example, thepopular RBF kernel,

kRBF

(x,x0) = exp(�1

2||x� x

0||/`2) (3)

encodes the inductive bias that function values closer together in the input space, in theEuclidean sense, are more correlated. The complexity of the functions in the input space

4

show that the proposed model outperforms state of the art stand-alone deep learning archi-tectures and Gaussian processes with advanced kernel learning procedures on a wide rangeof datasets, demonstrating its practical significance. We achieve scalability while retainingnon-parametric model structure by leveraging the very recent KISS-GP approach (Wilsonand Nickisch, 2015) and extensions in Wilson et al. (2015) for e�ciently representing kernelfunctions, to produce scalable deep kernels.

3 Gaussian Processes

We briefly review the predictive equations and marginal likelihood for Gaussian processes(GPs), and the associated computational requirements, following the notational conven-tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for acomprehensive discussion of GPs.

We assume a dataset D of n input (predictor) vectors X = {x1

, . . . ,xn}, each of dimensionD, which index an n ⇥ 1 vector of targets y = (y(x

1

), . . . , y(xn))>. If f(x) ⇠ GP(µ, k�),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1

), . . . , f(xn)]> ⇠ N (µ,KX,X) , (1)

with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k�(xi,xj), determinedfrom the mean function and covariance kernel of the Gaussian process. The kernel, k� , isparametrized by �. Assuming additive Gaussian noise, y(x)|f(x) ⇠ N (y(x); f(x),�2), thepredictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by

f⇤|X⇤,X,y,�,�2 ⇠ N (E[f⇤], cov(f⇤)) , (2)

E[f⇤] = µX⇤ +KX⇤,X [KX,X + �2I]�1

y ,

cov(f⇤) = KX⇤,X⇤ �KX⇤,X [KX,X + �2I]�1KX,X⇤ .

KX⇤,X , for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n⇥ n covariance matrix evaluatedat training inputs X. All covariance (kernel) matrices implicitly depend on the kernelhyperparameters �.

GPs with RBF kernels correspond to models which have an infinite basis expansion in adual space, and have compelling theoretical properties: these models are universal approxi-mators, and have prior support to within an arbitrarily small epsilon band of any continuousfunction (Micchelli et al., 2006). Indeed the properties of the distribution over functionsinduced by a Gaussian process are controlled by the kernel function. For example, thepopular RBF kernel,

kRBF

(x,x0) = exp(�1

2||x� x

0||/`2) (3)

encodes the inductive bias that function values closer together in the input space, in theEuclidean sense, are more correlated. The complexity of the functions in the input space

4

show that the proposed model outperforms state of the art stand-alone deep learning archi-tectures and Gaussian processes with advanced kernel learning procedures on a wide rangeof datasets, demonstrating its practical significance. We achieve scalability while retainingnon-parametric model structure by leveraging the very recent KISS-GP approach (Wilsonand Nickisch, 2015) and extensions in Wilson et al. (2015) for e�ciently representing kernelfunctions, to produce scalable deep kernels.

3 Gaussian Processes

We briefly review the predictive equations and marginal likelihood for Gaussian processes(GPs), and the associated computational requirements, following the notational conven-tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for acomprehensive discussion of GPs.

We assume a dataset D of n input (predictor) vectors X = {x1

, . . . ,xn}, each of dimensionD, which index an n ⇥ 1 vector of targets y = (y(x

1

), . . . , y(xn))>. If f(x) ⇠ GP(µ, k�),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1

), . . . , f(xn)]> ⇠ N (µ,KX,X) , (1)

with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k�(xi,xj), determinedfrom the mean function and covariance kernel of the Gaussian process. The kernel, k� , isparametrized by �. Assuming additive Gaussian noise, y(x)|f(x) ⇠ N (y(x); f(x),�2), thepredictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by

f⇤|X⇤,X,y,�,�2 ⇠ N (E[f⇤], cov(f⇤)) , (2)

E[f⇤] = µX⇤ +KX⇤,X [KX,X + �2I]�1

y ,

cov(f⇤) = KX⇤,X⇤ �KX⇤,X [KX,X + �2I]�1KX,X⇤ .

KX⇤,X , for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n⇥ n covariance matrix evaluatedat training inputs X. All covariance (kernel) matrices implicitly depend on the kernelhyperparameters �.

GPs with RBF kernels correspond to models which have an infinite basis expansion in adual space, and have compelling theoretical properties: these models are universal approxi-mators, and have prior support to within an arbitrarily small epsilon band of any continuousfunction (Micchelli et al., 2006). Indeed the properties of the distribution over functionsinduced by a Gaussian process are controlled by the kernel function. For example, thepopular RBF kernel,

kRBF

(x,x0) = exp(�1

2||x� x

0||/`2) (3)

encodes the inductive bias that function values closer together in the input space, in theEuclidean sense, are more correlated. The complexity of the functions in the input space

4

15.2. GPs for regression 517

−5 0 5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

(a)

−5 0 5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

(b)

Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GPposterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x).Based on Figure 2.2 of (Rasmussen and Williams 2006). Figure generated by gprDemoNoiseFree.

15.2.1 Predictions using noise-free observations

Suppose we observe a training set D = {(xi, fi), i = 1 : N}, where fi = f(xi) is the noise-freeobservation of the function evaluated at xi. Given a test set X∗ of size N∗ × D, we want topredict the function outputs f∗.

If we ask the GP to predict f(x) for a value of x that it has already seen, we want the GP toreturn the answer f(x) with no uncertainty. In other words, it should act as an interpolatorof the training data. This will only happen if we assume the observations are noiseless. We willconsider the case of noisy observations below.

Now we return to the prediction problem. By definition of the GP, the joint distribution hasthe following form

!ff∗

"∼ N

!!µµ∗

",

!K K∗KT∗ K∗∗

""(15.6)

where K = κ(X,X) is N×N , K∗ = κ(X,X∗) is N×N∗, and K∗∗ = κ(X∗,X∗) is N∗×N∗.By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the followingform

p(f∗|X∗,X, f) = N (f∗|µ∗,Σ∗) (15.7)

µ∗ = µ(X∗) +KT∗K

−1(f − µ(X)) (15.8)

Σ∗ = K∗∗ −KT∗K

−1K∗ (15.9)

This process is illustrated in Figure 15.2. On the left we show sample samples from the prior,p(f |X), where we use a squared exponential kernel, aka Gaussian kernel or RBF kernel. In1d, this is given by

κ(x, x′) = σ2f exp(−

1

2ℓ2(x− x′)2) (15.10)

Here ℓ controls the horizontal length scale over which the function varies, and σ2f controls the

vertical variation. (We discuss how to estimate such kernel parameters below.) On the right we

show that the proposed model outperforms state of the art stand-alone deep learning archi-tectures and Gaussian processes with advanced kernel learning procedures on a wide rangeof datasets, demonstrating its practical significance. We achieve scalability while retainingnon-parametric model structure by leveraging the very recent KISS-GP approach (Wilsonand Nickisch, 2015) and extensions in Wilson et al. (2015) for e�ciently representing kernelfunctions, to produce scalable deep kernels.

3 Gaussian Processes

We briefly review the predictive equations and marginal likelihood for Gaussian processes(GPs), and the associated computational requirements, following the notational conven-tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for acomprehensive discussion of GPs.

We assume a dataset D of n input (predictor) vectors X = {x1

, . . . ,xn}, each of dimensionD, which index an n ⇥ 1 vector of targets y = (y(x

1

), . . . , y(xn))>. If f(x) ⇠ GP(µ, k�),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1

), . . . , f(xn)]> ⇠ N (µ,KX,X) , (1)

with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k�(xi,xj), determinedfrom the mean function and covariance kernel of the Gaussian process. The kernel, k� , isparametrized by �. Assuming additive Gaussian noise, y(x)|f(x) ⇠ N (y(x); f(x),�2), thepredictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by

f⇤|X⇤,X,y,�,�2 ⇠ N (E[f⇤], cov(f⇤)) , (2)

E[f⇤] = µX⇤ +KX⇤,X [KX,X + �2I]�1

y ,

cov(f⇤) = KX⇤,X⇤ �KX⇤,X [KX,X + �2I]�1KX,X⇤ .

KX⇤,X , for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n⇥ n covariance matrix evaluatedat training inputs X. All covariance (kernel) matrices implicitly depend on the kernelhyperparameters �.

GPs with RBF kernels correspond to models which have an infinite basis expansion in adual space, and have compelling theoretical properties: these models are universal approxi-mators, and have prior support to within an arbitrarily small epsilon band of any continuousfunction (Micchelli et al., 2006). Indeed the properties of the distribution over functionsinduced by a Gaussian process are controlled by the kernel function. For example, thepopular RBF kernel,

kRBF

(x,x0) = exp(�1

2||x� x

0||/`2) (3)

encodes the inductive bias that function values closer together in the input space, in theEuclidean sense, are more correlated. The complexity of the functions in the input space

4

Page 26: Iclr2016 vaeまとめ

深層ガウス過程¤ より複雑なサンプルを表現するため,process compositionによって多層化する[Lawrence & Moore, 07]

➡ 深層ガウス過程(deep GP)

¤ 以下のように,多層グラフィカルモデルを考える¤ ここでは𝑌がデータ,𝑋が潜在変数.

Published as a conference paper at ICLR 2016

X3 X2

f1 ⇠ GPX1

f2 ⇠ GPY

f3 ⇠ GP

Figure 1: A deep Gaussian process with two hidden layers.

Figure 2: Samples from a deep GP showing the generation of features. The upper plot shows thesuccessive non-linear warping of a two-dimensional input space. The red circles correspond tospecific locations in the input space for which a feature (a “loop”) is created in layer 1. As can beseen, as we traverse the hierarchy towards the right, this feature is maintained in the next layers andis potentially further transformed.

2 DEEP GAUSSIAN PROCESSES

Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima-tion. However, their tractability comes at a price: they can only represent a restricted class offunctions. Indeed, even though sophisticated definitions and combinations of covariance functionscan lead to powerful models (Durrande et al., 2011; Gonen & Alpaydin, 2011; Hensman et al.,2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-tion of instantiations of the latent function remains; this limits the applicability of the models. Oneline of recent research to address this limitation focused on function composition (Snelson et al.,2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process insteademploys process composition (Lawrence & Moore, 2007; Damianou et al., 2011; Lazaro-Gredilla,2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).

A deep GP is a deep directed graphical model that consists of multiple layers of latent variablesand employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observedinputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, considera set of data Y 2 RN⇥D with N datapoints and D dimensions. A deep GP then defines L layers oflatent variables, {Xl}Ll=1,Xl 2 RN⇥Ql through the following nested noise model definition:

Y = f1(X1) + ✏1, ✏1 ⇠ N (0,�21I) (1)

Xl�1 = fl(Xl) + ✏l, ✏l ⇠ N (0,�2l I), l = 2 . . . L (2)

where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠GP(0, kl(x, x

0)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as

a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N (0, I). In thesupervised learning scenario, the inputs of the top hidden layer is observed and govern its hiddenoutputs.

The expressive power of a deep GP is significantly greater than that of a standard GP, because thesuccessive warping of latent variables through the hierarchy allows for modeling non-stationaritiesand sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP isthe limit of an infinitely wide neural network, a deep GP is the limit where the parametric functioncomposition of a deep neural network turns into a process composition. Specifically, a deep neuralnetwork can be written as:

g(x) = V

>L�L(WL�1�L�1(. . .W2�1(U1x))), (3)

where W,U and V are parameter matrices and �(·) denotes an activation function. By non-parametrically treating the stacked function composition g(h) = V

>�(Uh) as process compositionwe obtain the deep GP definition of Equation 2.

2

Published as a conference paper at ICLR 2016

X3 X2

f1 ⇠ GPX1

f2 ⇠ GPY

f3 ⇠ GP

Figure 1: A deep Gaussian process with two hidden layers.

Figure 2: Samples from a deep GP showing the generation of features. The upper plot shows thesuccessive non-linear warping of a two-dimensional input space. The red circles correspond tospecific locations in the input space for which a feature (a “loop”) is created in layer 1. As can beseen, as we traverse the hierarchy towards the right, this feature is maintained in the next layers andis potentially further transformed.

2 DEEP GAUSSIAN PROCESSES

Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima-tion. However, their tractability comes at a price: they can only represent a restricted class offunctions. Indeed, even though sophisticated definitions and combinations of covariance functionscan lead to powerful models (Durrande et al., 2011; Gonen & Alpaydin, 2011; Hensman et al.,2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-tion of instantiations of the latent function remains; this limits the applicability of the models. Oneline of recent research to address this limitation focused on function composition (Snelson et al.,2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process insteademploys process composition (Lawrence & Moore, 2007; Damianou et al., 2011; Lazaro-Gredilla,2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).

A deep GP is a deep directed graphical model that consists of multiple layers of latent variablesand employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observedinputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, considera set of data Y 2 RN⇥D with N datapoints and D dimensions. A deep GP then defines L layers oflatent variables, {Xl}Ll=1,Xl 2 RN⇥Ql through the following nested noise model definition:

Y = f1(X1) + ✏1, ✏1 ⇠ N (0,�21I) (1)

Xl�1 = fl(Xl) + ✏l, ✏l ⇠ N (0,�2l I), l = 2 . . . L (2)

where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠GP(0, kl(x, x

0)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as

a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N (0, I). In thesupervised learning scenario, the inputs of the top hidden layer is observed and govern its hiddenoutputs.

The expressive power of a deep GP is significantly greater than that of a standard GP, because thesuccessive warping of latent variables through the hierarchy allows for modeling non-stationaritiesand sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP isthe limit of an infinitely wide neural network, a deep GP is the limit where the parametric functioncomposition of a deep neural network turns into a process composition. Specifically, a deep neuralnetwork can be written as:

g(x) = V

>L�L(WL�1�L�1(. . .W2�1(U1x))), (3)

where W,U and V are parameter matrices and �(·) denotes an activation function. By non-parametrically treating the stacked function composition g(h) = V

>�(Uh) as process compositionwe obtain the deep GP definition of Equation 2.

2

Published as a conference paper at ICLR 2016

X3 X2

f1 ⇠ GPX1

f2 ⇠ GPY

f3 ⇠ GP

Figure 1: A deep Gaussian process with two hidden layers.

Figure 2: Samples from a deep GP showing the generation of features. The upper plot shows thesuccessive non-linear warping of a two-dimensional input space. The red circles correspond tospecific locations in the input space for which a feature (a “loop”) is created in layer 1. As can beseen, as we traverse the hierarchy towards the right, this feature is maintained in the next layers andis potentially further transformed.

2 DEEP GAUSSIAN PROCESSES

Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima-tion. However, their tractability comes at a price: they can only represent a restricted class offunctions. Indeed, even though sophisticated definitions and combinations of covariance functionscan lead to powerful models (Durrande et al., 2011; Gonen & Alpaydin, 2011; Hensman et al.,2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-tion of instantiations of the latent function remains; this limits the applicability of the models. Oneline of recent research to address this limitation focused on function composition (Snelson et al.,2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process insteademploys process composition (Lawrence & Moore, 2007; Damianou et al., 2011; Lazaro-Gredilla,2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).

A deep GP is a deep directed graphical model that consists of multiple layers of latent variablesand employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observedinputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, considera set of data Y 2 RN⇥D with N datapoints and D dimensions. A deep GP then defines L layers oflatent variables, {Xl}Ll=1,Xl 2 RN⇥Ql through the following nested noise model definition:

Y = f1(X1) + ✏1, ✏1 ⇠ N (0,�21I) (1)

Xl�1 = fl(Xl) + ✏l, ✏l ⇠ N (0,�2l I), l = 2 . . . L (2)

where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠GP(0, kl(x, x

0)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as

a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N (0, I). In thesupervised learning scenario, the inputs of the top hidden layer is observed and govern its hiddenoutputs.

The expressive power of a deep GP is significantly greater than that of a standard GP, because thesuccessive warping of latent variables through the hierarchy allows for modeling non-stationaritiesand sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP isthe limit of an infinitely wide neural network, a deep GP is the limit where the parametric functioncomposition of a deep neural network turns into a process composition. Specifically, a deep neuralnetwork can be written as:

g(x) = V

>L�L(WL�1�L�1(. . .W2�1(U1x))), (3)

where W,U and V are parameter matrices and �(·) denotes an activation function. By non-parametrically treating the stacked function composition g(h) = V

>�(Uh) as process compositionwe obtain the deep GP definition of Equation 2.

2

Published as a conference paper at ICLR 2016

X3 X2

f1 ⇠ GPX1

f2 ⇠ GPY

f3 ⇠ GP

Figure 1: A deep Gaussian process with two hidden layers.

Figure 2: Samples from a deep GP showing the generation of features. The upper plot shows thesuccessive non-linear warping of a two-dimensional input space. The red circles correspond tospecific locations in the input space for which a feature (a “loop”) is created in layer 1. As can beseen, as we traverse the hierarchy towards the right, this feature is maintained in the next layers andis potentially further transformed.

2 DEEP GAUSSIAN PROCESSES

Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima-tion. However, their tractability comes at a price: they can only represent a restricted class offunctions. Indeed, even though sophisticated definitions and combinations of covariance functionscan lead to powerful models (Durrande et al., 2011; Gonen & Alpaydin, 2011; Hensman et al.,2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-tion of instantiations of the latent function remains; this limits the applicability of the models. Oneline of recent research to address this limitation focused on function composition (Snelson et al.,2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process insteademploys process composition (Lawrence & Moore, 2007; Damianou et al., 2011; Lazaro-Gredilla,2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).

A deep GP is a deep directed graphical model that consists of multiple layers of latent variablesand employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observedinputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, considera set of data Y 2 RN⇥D with N datapoints and D dimensions. A deep GP then defines L layers oflatent variables, {Xl}Ll=1,Xl 2 RN⇥Ql through the following nested noise model definition:

Y = f1(X1) + ✏1, ✏1 ⇠ N (0,�21I) (1)

Xl�1 = fl(Xl) + ✏l, ✏l ⇠ N (0,�2l I), l = 2 . . . L (2)

where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠GP(0, kl(x, x

0)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as

a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N (0, I). In thesupervised learning scenario, the inputs of the top hidden layer is observed and govern its hiddenoutputs.

The expressive power of a deep GP is significantly greater than that of a standard GP, because thesuccessive warping of latent variables through the hierarchy allows for modeling non-stationaritiesand sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP isthe limit of an infinitely wide neural network, a deep GP is the limit where the parametric functioncomposition of a deep neural network turns into a process composition. Specifically, a deep neuralnetwork can be written as:

g(x) = V

>L�L(WL�1�L�1(. . .W2�1(U1x))), (3)

where W,U and V are parameter matrices and �(·) denotes an activation function. By non-parametrically treating the stacked function composition g(h) = V

>�(Uh) as process compositionwe obtain the deep GP definition of Equation 2.

2

Page 27: Iclr2016 vaeまとめ

深層ガウス過程からのサンプリング

Page 28: Iclr2016 vaeまとめ

VAE-DGP¤ DGPで変分推論する枠組みは提案されている[Damianou &

Lawrence 13]が,少ないデータでしか学習できなかった.¤ 共分散⾏列の逆数や,膨⼤なパラメータのため.

¤ DGPの推論をVAEの識別モデル(エンコーダー)と考える.¤ 制約が加わり,パラメータを減らして推論が速くなる.¤ 従来のDGPより過学習を抑えられる.

➡VAE-DGP

Variationally Auto-Encoded Deep Gaussian Processes [Dai+ 15; ICLR 2016]

Published as a conference paper at ICLR 2016

X3 X2

f1 ⇠ GPX1

f2 ⇠ GPY

f3 ⇠ GP

{g1(y(n))}Nn=1{g2(µ(n)

1 )}Nn=1{g3(µ(n)2 )}Nn=1

Figure 3: A deep Gaussian process with three hidden layers and back-constraints.

2.1 VARIATIONAL INFERENCE

In a standard GP model, inference is performed by analytically integrating out the latent function f .In the DGP case, the latent variables have to additionally be integrated out, to obtain the marginallikelihood of DGPs over the observed data:

p(Y) =

Zp(Y|X1)

LY

l=2

p(Xl�1|Xl)p(XL)dX1 . . . dXL. (4)

The above marginal likelihood and the following derivation aims at unsupervised learning prob-lems, however, it is straight-forward to extend the formulation to supervised scenario by assumingobserved XL. Bayesian inference in DGPs involves optimizing the model hyper-parameters withrespect to the marginal likelihood and inferring the posterior distributions of latent variables fortraining/testing data. The exact inference of DGPs is intractable due to the intractable integral in(4). Approximated inference techniques such as variational inference and EP have been developed(Damianou & Lawrence, 2013; Bui et al., 2015). By taking a variational approach, i.e. assuming avariational posterior distribution of latent variables, q({Xl}Ll=1) =

QLl=1 q(Xl), a lower bound of

the log marginal distribution can be derived as

L =

LX

l=1

Fl +

L�1X

l=1

H(q(Xl))� KL (q(XL) k p(XL)) , (5)

where F1 = hlog p(Y|X1)iq(X1)and Fl = hlog p(Xl�1|Xl)iq(Xl�1)q(Xl)

, l = 2 . . . L, are knownas free energy for individual layers. H(q(Xl)) denotes the entropy of the variational distribu-tion q(Xl) and KL (q(XL) k p(XL)) denotes the Kullback-Leibler divergence between q(XL) andp(XL). According to the model definition, both p(Y|X1) and p(Xl�1|Xl) are Gaussian pro-cesses. The variational distribution of Xl is typically parameterized as a Gaussian distributionq(Xl) =

QNn=1 N (x

(n)l|µ(n)l,⌃(n)l).

3 VARIATIONAL AUTO-ENCODED MODEL

Damianou & Lawrence (2013) provides a tractable variational inference method for DGP by de-riving a closed-form lower bound of the marginal likelihood. While successfully demonstratingstrengths of DGP, the experiments that they show are limited to very small scales (hundreds of data-points). The limitation on scalability is mostly due to the computational expensive covariance matrixinversion and the large number of variational parameters (growing linearly with the size of data).

To scale up DGP to handle large datasets, we propose a new deep generative model, by augmentingDGP with a variationally auto-encoded inference mechanism. We refer to this inference mecha-nism as a recognition model (see Figure 3). A recognition model provides us with a mechanism forconstraining the variational posterior distributions of latent variables. Instead of representing vari-ational posteriors as individual variational parameters, which become a big burden to optimization,we define them as a transformation of observed data. This allows us to reduce the number of pa-rameters for optimization (which no longer grow linearly with the size of data) and to perform fastinference at test time. A similar constraint mechanism has been referred to as a “back-constraint” inthe GP literature. Lawrence & Quinonero Candela (2006) constrained the latent inputs of a GP witha parametric model to enforce local distance preservation in the inputs; Ek et al. (2008) followedthe same approach for constraining the latent space with information from additional views of thedata. Our formulation differs from the above in that we rather constrain a whole latent posterior

3

Page 29: Iclr2016 vaeまとめ

実験:⽋損補間¤ テストデータの⽋損補間

¤ 各例の右端が元画像Published as a conference paper at ICLR 2016

(a) (b) (c)

Figure 5: (a) The samples generated from VAE-DGP trained on the combination of Frey faces andYale faces (Frey-Yale). (b) Imputation from the test set of Frey-Yale. (c) Imputation from the testset of SVHN. The gray color indicates the missing area. The 1st column shows the input images,the 2nd column show the imputed images and 3rd column shows the original full images.

where KF1F1 , KU1U1 are the covariance matrices of F1 and U1 respectively, KF1U1 is the cross-covariance matrix between F1 and U1, and 1 = Tr(hKF1F1iq(X1)

), 1 = hKF1U1iq(X1)and

�1 =

⌦K

>F1U1

KF1U1

↵q(X1)

, and ⇤1 = KU1U1 + �1. This enables data-parallelism by dis-tributing the computation that depends on individual datapoints and only collecting the intermediateresults that do not scale with the size of data. Gal et al. (2014) and Dai et al. (2014) exploit a similarformulation for distributing the computation of BGPLVM, however, in their formulations, the gra-dients of variational parameters that depend on individual datapoints have to be collected centrally.Such collection severely limits the scalability of the model.

For hidden layers, the free energy terms are slightly different. Their data-dependent terms ad-ditionally involve the expectation with respect to the variational distribution of output variables:Tr(⌦X

>l�1Xl�1

↵q(Xl�1)

), Tr(⇤l�1 l>

⌦Xl�1X

>l�1

↵q(Xl�1)

l), l and �l. The first term can benaturally reformulated as a sum across datapoints:

Tr(⌦X

>l�1Xl�1

↵q(Xl�1)

) =

NX

n=1

(µ(n)l�1)

>µ(n)l�1 + Tr(⌃(n)

l�1). (11)

For the second term, we can rewrite⌦Xl�1X

>l�1

↵q(Xl�1)

= R

>l�1Rl�1+Al�1Al�1, where Rl�1 =

[(µ(1)(l�1))

>, . . . , (µ(N)(l�1))

>], Al�1 = diag(↵(1)

l�1, . . . ,↵(N)l�1) and ↵(n)

l�1 = Tr(⌃(n)l�1)

12 . This enables

us to formulate it into a distributable form:Tr(⇤l�1

l> hXl�1Xl�1i>q(Xl�1) l) = Tr

�⇤l�1

� l>R>

l�1

�(Rl�1 l)

+ Tr

0

@⇤l�1

NX

n=1

l(n)↵(n)l�1

! NX

n=1

l(n)↵(n)l�1

!>1

A .

(12)With the above formulations, we obtain distributable a variational lower bound. For optimization, thegradients of all the model and variational parameters can be derived with respect to the lower bound.As the variational distributions q({Xl}Ll=1) are computed according to the recognition model, thegradients of q({Xl}Ll=1) are back-propagated (through the recognition model), which allows to com-pute the gradients of its the parameters.

5 EXPERIMENTS

As a probabilistic generative model, VAE-DGP is applicable to a range of different tasks such asdata generation, data imputation, etc. In this section we evaluate our model in a variety of problemsand compare it with the alternatives in the in the literature.

6

Published as a conference paper at ICLR 2016

5.1 UNSUPERVISED LEARNING

Model MNISTDBN 138±2

Stacked CAE 121 ± 1.6Deep GSN 214 ± 1.1

Adversarial nets 225 ± 2GMMN+AE 282 ± 2

VAE-DGP (5) 301.67VAE-DGP (10-50) 674.86

VAE-DGP (5-20-50) 723.65

Table 1: Log-likelihood for the MNIST testdata with different models. The baselines areDBN and Stacked CAE (Bengio et al., 2013),Deep GSN (Bengio et al., 2014), Adversarialnets (Goodfellow et al., 2014) and GMMN+AE(Li et al., 2015).

Figure 6: Samples of imputation on the testsets. The gray color indicates the missingarea. The 1st column shows the input im-ages, the 2nd column show the imputed im-ages and 3rd column shows the original fullimages.

We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Freyfaces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images,which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 imagesrandomly from Yale faces as the test set and use the rest for training. The intensity of the originalgray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D tophidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all thelayers with 100 inducing points. All the MLPs in the recognition model have two hidden layers withwidths (500-300). As a generative model, we can draw samples from the learned model by samplingfirst from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case)and layer-wise downwards. The generated images are shown in Figure 5a.

To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST(LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 28 ⇥ 28

images. The intensity of the original gray-scale images are normalized to [0, 1]. We train our modelwith three different model settings (one, two and three hidden layers). The trained models areevaluated by the log-likelihood of the test set3, which consists of 10,000 images. The results areshown in Table 1 along with some baseline performances taken from the literature. The numbers inthe parenthesis indicate the dimensionality of hidden layers from top to bottom. The exponentiatedquadratic kernel are used for all the layers with 300 inducing points. All the MLPs in the recognitionmodel has two hidden layers with width (500-300). All our models are trained as a whole fromrandomly initialized recognition model.

5.2 DATA IMPUTATION

We demonstrate the model’s ability to impute missing data by showing half of images on the test set.We use the learned VAE-DGP to impute the other half of the images. this is challenging problembecause there might be ambiguities in the answers. For instance, by showing the right half of a digit“8”, the answers “3” and “8” are both reasonable. We show the imputation performance for the testimages in Frey-Yale and MNIST in Fig. 5b and Fig. 6 respectively. We also apply VAE-DGP to thestreet view house number dataset (SVHN) (Netzer et al., 2011). We use three hidden layers withthe dimensionality of latent space from top to bottom (5-30-500). The top two hidden layers usethe exponentiated quadratic kernel and the observed layer uses the linear kernel with 500 inducingpoints. The learned model is used for imputing the images in the test set (see Fig. 5c).

3As a non-parametric model, the test log-likelihood of VAE-DGP is formulated as 1N⇤

log p(Y⇤|Y), whereY⇤ is the test data and Y is the training data. As the true test log-likelihood is intractable, we approximate itas 1

N⇤(L(Y⇤,Y)� L(Y)).

7

Page 30: Iclr2016 vaeまとめ

実験:精度評価¤ 対数尤度(MNIST)

¤ 教師あり学習(回帰)¤ データセット:

¤ The Abalone dataset

¤ The Creep dataset

Published as a conference paper at ICLR 2016

5.1 UNSUPERVISED LEARNING

Model MNISTDBN 138±2

Stacked CAE 121 ± 1.6Deep GSN 214 ± 1.1

Adversarial nets 225 ± 2GMMN+AE 282 ± 2

VAE-DGP (5) 301.67VAE-DGP (10-50) 674.86

VAE-DGP (5-20-50) 723.65

Table 1: Log-likelihood for the MNIST testdata with different models. The baselines areDBN and Stacked CAE (Bengio et al., 2013),Deep GSN (Bengio et al., 2014), Adversarialnets (Goodfellow et al., 2014) and GMMN+AE(Li et al., 2015).

Figure 6: Samples of imputation on the testsets. The gray color indicates the missingarea. The 1st column shows the input im-ages, the 2nd column show the imputed im-ages and 3rd column shows the original fullimages.

We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Freyfaces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images,which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 imagesrandomly from Yale faces as the test set and use the rest for training. The intensity of the originalgray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D tophidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all thelayers with 100 inducing points. All the MLPs in the recognition model have two hidden layers withwidths (500-300). As a generative model, we can draw samples from the learned model by samplingfirst from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case)and layer-wise downwards. The generated images are shown in Figure 5a.

To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST(LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 28 ⇥ 28

images. The intensity of the original gray-scale images are normalized to [0, 1]. We train our modelwith three different model settings (one, two and three hidden layers). The trained models areevaluated by the log-likelihood of the test set3, which consists of 10,000 images. The results areshown in Table 1 along with some baseline performances taken from the literature. The numbers inthe parenthesis indicate the dimensionality of hidden layers from top to bottom. The exponentiatedquadratic kernel are used for all the layers with 300 inducing points. All the MLPs in the recognitionmodel has two hidden layers with width (500-300). All our models are trained as a whole fromrandomly initialized recognition model.

5.2 DATA IMPUTATION

We demonstrate the model’s ability to impute missing data by showing half of images on the test set.We use the learned VAE-DGP to impute the other half of the images. this is challenging problembecause there might be ambiguities in the answers. For instance, by showing the right half of a digit“8”, the answers “3” and “8” are both reasonable. We show the imputation performance for the testimages in Frey-Yale and MNIST in Fig. 5b and Fig. 6 respectively. We also apply VAE-DGP to thestreet view house number dataset (SVHN) (Netzer et al., 2011). We use three hidden layers withthe dimensionality of latent space from top to bottom (5-30-500). The top two hidden layers usethe exponentiated quadratic kernel and the observed layer uses the linear kernel with 500 inducingpoints. The learned model is used for imputing the images in the test set (see Fig. 5c).

3As a non-parametric model, the test log-likelihood of VAE-DGP is formulated as 1N⇤

log p(Y⇤|Y), whereY⇤ is the test data and Y is the training data. As the true test log-likelihood is intractable, we approximate itas 1

N⇤(L(Y⇤,Y)� L(Y)).

7

Published as a conference paper at ICLR 2016

Figure 7: Bayesian optimization experiments forthe Branin function using a standard GP and ourVEA-DGP.

Model Abalone

VEA-DGP 825.31± 64.35GP 888.96 ± 78.22Lin. Reg. 917.31 ± 53.76

Model Creep

VEA-DGP 575.39± 29.10GP 602.11 ± 29.59Lin. Reg. 1865.76 ± 23.36

Table 2: MSE obtained from our VEA-DGP,standard GP and linear regression for theAbalone and Creep benchmarks.

5.3 SUPERVISED LEARNING AND BAYESIAN OPTIMIZATION

In this section we consider two supervised learning problem instances: regression and Bayesianoptimization (BO) (Osborne, 2010; Snoek et al., 2012b). We demonstrate the utility of VEA-DGPin these settings by evaluating its performance in terms of predictive accuracy and predictive un-certainty quantification. For these experiments we use a VEA-DGP with one hidden layer (andone observed inputs layer) and exponentiated quadratic covariance functions. Furthermore, we in-corporate the deep GP modification of Duvenaud et al. (2014) so that the observed input layer hasan additional connection to the output layer. Duvenaud et al. (2014) showed that this modifica-tion increases the general stability of the method. Since the sample size of the data considered forsupervised learning is relatively small, we do not use the recognition model to back-constrain thevariational distributions.

In the regression experiments we use the Abalone dataset (4177 1-dimensional outputs and8�dimensional inputs) from UCI and the Creep dataset (2066 1-dimensional outputs and30�dimensional inputs) from (Cole et al., 2000). A typical split for this data is to use 1000 (Abalone)and 800 (Creep) instaces for training. We used 100 inducing inputs for each layer and performed 4runs with different random splits. We summarize the results in Table 2.

Next, we show how the VAE-DGP can be used in the context of probabilistic numerics, in particularfor Bayesian optimization (BO) (Osborne, 2010; Snoek et al., 2012b). In BO, the goal is to findxmin = argminX f(x) for X ⇢ RQ where a limited number of evaluations are available. Typically,a GP is used to fit the available data, as a surrogate model. The GP is iteratively updated withnew function evaluations and used to build an acquisition function able to guide the collection ofnew observations of f . This is done by balancing exploration (regions with large uncertainty) andexploitation (regions with a low mean). In BO, the model is a crucial element of the process: itshould be able to express complex classes of functions and to provide coherent estimates of thefunction uncertainty. In this experiment we use the non-stationary Branin function4 to compare theperformance of standard GPs and the VEA-DPP in the context of BO. We used the popular expectedimprovement (Jones et al., 1998) acquisition function and we ran 10 replicates of the experimentusing different initializations, each kicking-off optimization with 3 points randomly selected fromthe functions’ domain X . In each replicate we iteratively collected 20 evaluations of the functions.In the VEA-DPP we used 30 inducing points. Figure 7 shows that using the VEA-DPP as a surrogatemodel results in a gain, especially in the first steps of the optmization. This is due to ability of VEA-DPP to deal with the non-stationary components of the function and to model a much richer class ofdistributions (e.g. multi-modal) in the output layer (as opposed to the standard GP which assumesjoint Gaussianity in the outputs).

4See http://www.sfu.ca/˜

ssurjano/optimization.html for details. The default domain isin the experiments.

8

Page 31: Iclr2016 vaeまとめ

変分推論における平均場近似¤ VAEでは近似分布は𝑞(𝑧|𝑥)と考えてきた

¤ 𝑞(𝑧|𝑥)はニューラルネットワークで表現

¤ ⼀般的に近似分布は平均場近似によって近似される.

¤ もっとリッチな近似分布を考えることもできる¤ パラメータ𝜆を確率変数として事前分布を考える(階層変分モデル)

log𝑝 𝑥= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))

4

Variational Models• We want to compute posterior p(z|x) (z: latent variables, x: data)

• Variational inference seeks to minimize for a family q(z;�)

KL(q(z;�)||p(z|x))

• Maximizing evidence lower bound (ELBO)

log p(x) � Eq(z;�)[log p(x|z)]�KL(q(z;�)||p(z))

• (Common) Mean-field distribution q(z;�) =Y

i

q(zi;�i)

• Hierarchical variational models

• (Newer) Interpret the family as a variational model for posterior latent variables z (introducing new latent variables)[1]

Lawrence, N. (2000). Variational Inference in Probabilistic Models. PhD thesis.

Page 32: Iclr2016 vaeまとめ

変分ガウス過程¤ The Variational Gaussain Process [Tran+ 15; ICLR 2016]

¤ とても強⼒な変分モデルを提案

¤ を変分データ(パラメータ)とし,次のような𝑧の⽣成過程を考える.¤ 潜在変数¤ ⾮線形写像をDによって条件づけられたガウス過程から⽣成

¤ 潜在変数zを⽣成

7

Variational Gaussian Processes

7

Variational Gaussian Processes

7

Variational Gaussian Processes

7

Variational Gaussian Processes

8

Variational Gaussian Processes

Page 33: Iclr2016 vaeまとめ

変分ガウス過程の尤度¤ 全ページの⽣成過程から,潜在変数𝑧の周辺尤度は

¤ このようにしてモデル化した近似分布は,𝑝(𝑧|𝑥)がどんな分布であろうと, とするパラメータが存在する(Universal Approximation)¤ つまり,これまでのどの⼿法よりも限りなく柔軟なモデルとなる.

Under review as a conference paper at ICLR 2016

as parameters to a mean-field distribution. The random mappings are drawn conditional on “vari-ational data,” which is itself learned as part of variational inference. We will show that the VGPenables samples from the mean-field to follow arbitrarily complex posteriors.

Let D = {(sn, tn)}mn=1 be variational data, comprising input-output pairs that are parameters to thevariational distribution. (This idea appears in a different context in Blei & Lafferty (2006).) TheVGP specifies the following generative process for posterior latent variables z:

1. Draw latent input ⇠ 2 Rc: ⇠ ⇠ N (0, I).

2. Draw non-linear mapping f : Rc ! Rd conditioned on D: f ⇠ Qdi=1 GP(0,K⇠⇠) | D.

3. Draw approximate posterior samples z 2 supp(p): z = (z1, . . . , zd) ⇠Qd

i=1 q(fi(⇠)).

Figure 1 displays a graphical model for the VGP. Marginalizing over all non-linear mappings andlatent inputs, the VGP is

qVGP(z;✓,D) =

ZZ "dY

i=1

q(zi | fi(⇠))#"

dY

i=1

GP(fi;0,K⇠⇠) | D#N (⇠;0, I) df d⇠, (4)

which is parameterized by kernel hyperparameters ✓ and variational data.

As a variational model, the VGP forms an infinite ensemble of mean-field distributions. A mean-fielddistribution is specified conditional on a fixed function f(·) and input ⇠; the d outputs fi(⇠) = �i arethe mean-field’s parameters. The VGP is a form of a hierarchical variational model (Ranganath et al.,2015); it places a continuous Bayesian nonparametric prior over mean-field parameters.

Note that the VGP evaluates the d draws from a GP at the same latent input ⇠, which induces cor-relation between their outputs, the mean-field parameters. In turn, this induces correlation betweenlatent variables of the variational model, correlations that are not captured in classical mean-field.Finally, the complex non-linear mappings drawn from the GP make the VGP a flexible model forcomplex discrete and continuous posteriors.

We emphasize that the VGP needs variational data because—unlike typical GP regression—there isno observed data available to learn a distribution over non-linear mappings. The variational dataappear in the conditional distribution of f in Eq.3, anchoring the random non-linear mappings atcertain input-ouput pairs. Thus, when we optimize the VGP, the learned variational data enables acomplex distribution of variational parameters f(⇠).

We also study the generative process of the VGP by limiting it in various ways; in doing so we recoverwell-known models as special cases. In particular, we recover the discrete mixture of mean-fielddistributions (Bishop et al., 1998; Jaakkola & Jordan, 1998; Lawrence, 2000), which is a classicallystudied model with dependencies between latent variables, as well as a form of factor analysis (Tip-ping & Bishop, 1999) in the variational space. The mathematical details are in Appendix A.

2.4 UNIVERSAL APPROXIMATION THEOREM

To gain intuition about how the VGP adapts to complex distributions, we analyze the role of theGP. For simplicity, suppose the latent variables z are real-valued, and the VGP treats the outputof the function draws from the GP as posterior samples. Consider the optimal function f⇤, i.e.,the transformation such that when we draw ⇠ ⇠ N (0, I) and calculate z = f⇤

(⇠), the resultingmarginal of z is the posterior distribution.

An explicit construction of f⇤ exists if the dimension of the latent input ⇠ is equal to the number oflatent variables. Let P�1 denote the inverse posterior CDF and � the standard normal CDF. Usingtechniques common in copula literature (Nelsen, 2006), the optimal function is

f⇤(⇠) = P�1

(�(⇠1), . . . ,�(⇠d)).

Imagine generating samples z using this function. For latent input ⇠ ⇠ N (0, I), the standard normalCDF � applies the probability integral transform: it squashes ⇠i such that its output ui = �(⇠i)follows a uniform distribution. The inverse posterior CDF then transforms the uniform randomvariables P�1

(u1, . . . ,ud) = z to follow the posterior. The function provides exact samples.

4

再掲log𝑝 𝑥= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))

Under review as a conference paper at ICLR 2016

R

auxiliaryinference

Q

variationalinference

P

Figure 2: Sequence of domain mappings during inference, from variational latent variable spaceR to posterior latent variable space Q to data space P . We perform variational inference in theposterior space and auxiliary inference in the variational space.

The VGP addresses the task of posterior inference by learning f⇤: conditional on variational datawhich are parameters to learn, the distribution of the GP learns to concentrate around this optimalfunction during inference. This perspective provides intuition behind the following result.Theorem 1 (Universal approximation). Let q(z;✓,D) denote the variational Gaussian process. Forany posterior distribution p(z |x) with a finite number of latent variables and continuous quantilefunction (inverse CDF), there exist a set of parameters (✓,D) such that

KL(q(z;✓,D) k p(z |x)) = 0.

See Appendix B for a proof. Theorem 1 states that any posterior distribution with strictly posi-tive density can be represented by a VGP. Thus the VGP is a flexible model for learning posteriordistributions.

3 BLACK BOX INFERENCE

3.1 VARIATIONAL OBJECTIVE

We derive an algorithm for performing black box inference over a wide class of generative models.The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP(z) (Eq.4). Wederive a tractable variational objective inspired by auto-encoders.

Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtractingan expected KL divergence term from the ELBO:

log p(x) � EqVGP [log p(x | z)]�KL(qVGP(z)kp(z))� EqVGP

hKL(q(⇠, f | z)kr(⇠, f | z))

i,

where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posteriorlatent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn anauxiliary model. See Figure 2.

Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:

eL = EqVGP [log p(x | z)]� EqVGP

hKL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))

i, (5)

where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-pected divergence between the variational model and the original model’s prior, and an expecteddivergence between the auxiliary model and the variational model’s prior. This is simply a nestedinstantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergencebetween the inference model and a prior is taken as regularizers on both the posterior and variationalspaces. This interpretation justifies the previously proposed bound for variational models; as weshall see, it also enables lower variance gradients during stochastic optimization.

5

Page 34: Iclr2016 vaeまとめ

下界¤ 学習では次の下界を最⼤化する

¤ イメージとしては次のような感じ¤ 近似モデルでxからzを⽣成

¤ 補助モデルでxとzから写像と潜在変数を⽣成

Under review as a conference paper at ICLR 2016

R

auxiliaryinference

Q

variationalinference

P

Figure 2: Sequence of domain mappings during inference, from variational latent variable spaceR to posterior latent variable space Q to data space P . We perform variational inference in theposterior space and auxiliary inference in the variational space.

The VGP addresses the task of posterior inference by learning f⇤: conditional on variational datawhich are parameters to learn, the distribution of the GP learns to concentrate around this optimalfunction during inference. This perspective provides intuition behind the following result.Theorem 1 (Universal approximation). Let q(z;✓,D) denote the variational Gaussian process. Forany posterior distribution p(z |x) with a finite number of latent variables and continuous quantilefunction (inverse CDF), there exist a set of parameters (✓,D) such that

KL(q(z;✓,D) k p(z |x)) = 0.

See Appendix B for a proof. Theorem 1 states that any posterior distribution with strictly posi-tive density can be represented by a VGP. Thus the VGP is a flexible model for learning posteriordistributions.

3 BLACK BOX INFERENCE

3.1 VARIATIONAL OBJECTIVE

We derive an algorithm for performing black box inference over a wide class of generative models.The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP(z) (Eq.4). Wederive a tractable variational objective inspired by auto-encoders.

Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtractingan expected KL divergence term from the ELBO:

log p(x) � EqVGP [log p(x | z)]�KL(qVGP(z)kp(z))� EqVGP

hKL(q(⇠, f | z)kr(⇠, f | z))

i,

where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posteriorlatent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn anauxiliary model. See Figure 2.

Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:

eL = EqVGP [log p(x | z)]� EqVGP

hKL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))

i, (5)

where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-pected divergence between the variational model and the original model’s prior, and an expecteddivergence between the auxiliary model and the variational model’s prior. This is simply a nestedinstantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergencebetween the inference model and a prior is taken as regularizers on both the posterior and variationalspaces. This interpretation justifies the previously proposed bound for variational models; as weshall see, it also enables lower variance gradients during stochastic optimization.

5

Under review as a conference paper at ICLR 2016

R

auxiliaryinference

Q

variationalinference

P

Figure 2: Sequence of domain mappings during inference, from variational latent variable spaceR to posterior latent variable space Q to data space P . We perform variational inference in theposterior space and auxiliary inference in the variational space.

The VGP addresses the task of posterior inference by learning f⇤: conditional on variational datawhich are parameters to learn, the distribution of the GP learns to concentrate around this optimalfunction during inference. This perspective provides intuition behind the following result.Theorem 1 (Universal approximation). Let q(z;✓,D) denote the variational Gaussian process. Forany posterior distribution p(z |x) with a finite number of latent variables and continuous quantilefunction (inverse CDF), there exist a set of parameters (✓,D) such that

KL(q(z;✓,D) k p(z |x)) = 0.

See Appendix B for a proof. Theorem 1 states that any posterior distribution with strictly posi-tive density can be represented by a VGP. Thus the VGP is a flexible model for learning posteriordistributions.

3 BLACK BOX INFERENCE

3.1 VARIATIONAL OBJECTIVE

We derive an algorithm for performing black box inference over a wide class of generative models.The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP(z) (Eq.4). Wederive a tractable variational objective inspired by auto-encoders.

Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtractingan expected KL divergence term from the ELBO:

log p(x) � EqVGP [log p(x | z)]�KL(qVGP(z)kp(z))� EqVGP

hKL(q(⇠, f | z)kr(⇠, f | z))

i,

where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posteriorlatent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn anauxiliary model. See Figure 2.

Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:

eL = EqVGP [log p(x | z)]� EqVGP

hKL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))

i, (5)

where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-pected divergence between the variational model and the original model’s prior, and an expecteddivergence between the auxiliary model and the variational model’s prior. This is simply a nestedinstantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergencebetween the inference model and a prior is taken as regularizers on both the posterior and variationalspaces. This interpretation justifies the previously proposed bound for variational models; as weshall see, it also enables lower variance gradients during stochastic optimization.

5

再構成誤差 正規化項 補助モデル

Under review as a conference paper at ICLR 2016

3.2 AUTO-ENCODING VARIATIONAL MODELS

Inference networks provide a flexible parameterization of approximating distributions as usedin Helmholtz machines (Hinton & Zemel, 1994), deep Boltzmann machines (Salakhutdinov &Larochelle, 2010), and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014). Itreplaces local variational parameters with global parameters coming from a neural network. Specif-ically, for latent variables zn which correspond to a data point xn, an inference network specifiesa neural network which takes xn as input and its local variational parameters �n as output. Thisamortizes inference by only defining a set of global parameters.

To auto-encode the VGP we specify inference networks to parameterize both the variational andauxiliary models. Unique from other auto-encoder approaches, we let the auxiliary model take bothobserved data point xn and variational data point zn as input:

xn 7! q(zn |xn;✓n), xn, zn 7! r(⇠n, fn |xn, zn;�n),

where q has local variational parameters given by the variational data Dn, and r is specified as afully factorized Gaussian with local variational parameters �n = (µn 2 Rc+d, �2

n 2 Rc+d). 1

Note that by letting r’s inference network take both xn and zn as input, we avoid the restrictiveexplicit specification of r(✏, f | z). This idea was first suggested but not implemented in Ranganathet al. (2015).

3.3 STOCHASTIC OPTIMIZATION

We maximize the variational objective eL(✓,�) over both ✓ and �, where ✓ newly denotes boththe kernel hyperparameters and the inference network’s parameters for the VGP, and � denotes theinference network’s parameters for the auxiliary model. Following the standard procedure in blackbox methods, we write the gradient as an expectation and apply stochastic approximations (Robbins& Monro, 1951), sampling from the variational model and evaluating stochastic gradients.

First, we reduce variance of the stochastic gradients by analytically deriving any tractable expecta-tions. The KL divergence between r(⇠, f | z) and q(⇠, f) is analytic as we’ve specified both jointdistributions to be Gaussian. The KL divergence between q(z | f(⇠)) and p(z) is standard and usedto reduce variance in traditional variational auto-encoders: it is analytic for widely used deep gen-erative models such as the deep latent Gaussian model (Rezende et al., 2014) and deep recurrentattentive writer (Gregor et al., 2015). See Appendix C for these calculations.

To derive black box gradients, we can first reparameterize the VGP, separating noise generation ofsamples from the parameters in its generative process (Kingma & Welling, 2014; Rezende et al.,2014). The GP easily enables reparameterization: for latent inputs ⇠ ⇠ N (0, I), the transformationf(⇠;✓) = L⇠ + K⇠sK

�1ss ti is a location-scale transform, where LL

>= K⇠⇠ � K⇠sK

�1ss K

>⇠s.

This is equivalent to evaluating ⇠ with a random mapping from the GP. Suppose the mean-fieldq(z | f(⇠)) is also reparameterizable, and let ✏ ⇠ w such that z(✏; f) ⇠ q(z | f(⇠)). This two-levelreparameterization is equivalent to the generative process for z outlined in Section 2.3.

We now rewrite the variational objective aseL(✓,�) = EN (⇠)

hEw(✏)

hlog p(x | z(✏; f(⇠;✓)))

ii(6)

� EN (⇠)

hEw(✏)

hKL(q(z | f(⇠;✓))kp(z)) + KL(q(⇠, f ;✓)kr(⇠, f | z(✏; f(⇠;✓));�))

ii.

Eq.6 enables gradients to move inside the expectations and backpropagate over the nested reparam-eterization. Thus we can take unbiased stochastic gradients, which exhibit low variance due to boththe analytic KL terms and reparameterization. The gradients are derived in Appendix D, includingthe case when the first KL is analytically intractable.

An outline is given in Algorithm 1. For massive data, we apply subsampling on x (Hoffman et al.,2013). For gradients of the model log-likelihood, we employ convenient differentiation tools suchas those in Stan and Theano (Carpenter et al., 2015; Bergstra et al., 2010). For non-differentiablelatent variables z, or mean-field distributions without efficient reparameterizations, we apply thescore function estimator for gradients of expectations with respect to the mean-field (Ranganathet al., 2014).

1We let the kernel hyperparameters of the VGP be fixed across data points.

6

Under review as a conference paper at ICLR 2016

3.2 AUTO-ENCODING VARIATIONAL MODELS

Inference networks provide a flexible parameterization of approximating distributions as usedin Helmholtz machines (Hinton & Zemel, 1994), deep Boltzmann machines (Salakhutdinov &Larochelle, 2010), and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014). Itreplaces local variational parameters with global parameters coming from a neural network. Specif-ically, for latent variables zn which correspond to a data point xn, an inference network specifiesa neural network which takes xn as input and its local variational parameters �n as output. Thisamortizes inference by only defining a set of global parameters.

To auto-encode the VGP we specify inference networks to parameterize both the variational andauxiliary models. Unique from other auto-encoder approaches, we let the auxiliary model take bothobserved data point xn and variational data point zn as input:

xn 7! q(zn |xn;✓n), xn, zn 7! r(⇠n, fn |xn, zn;�n),

where q has local variational parameters given by the variational data Dn, and r is specified as afully factorized Gaussian with local variational parameters �n = (µn 2 Rc+d, �2

n 2 Rc+d). 1

Note that by letting r’s inference network take both xn and zn as input, we avoid the restrictiveexplicit specification of r(✏, f | z). This idea was first suggested but not implemented in Ranganathet al. (2015).

3.3 STOCHASTIC OPTIMIZATION

We maximize the variational objective eL(✓,�) over both ✓ and �, where ✓ newly denotes boththe kernel hyperparameters and the inference network’s parameters for the VGP, and � denotes theinference network’s parameters for the auxiliary model. Following the standard procedure in blackbox methods, we write the gradient as an expectation and apply stochastic approximations (Robbins& Monro, 1951), sampling from the variational model and evaluating stochastic gradients.

First, we reduce variance of the stochastic gradients by analytically deriving any tractable expecta-tions. The KL divergence between r(⇠, f | z) and q(⇠, f) is analytic as we’ve specified both jointdistributions to be Gaussian. The KL divergence between q(z | f(⇠)) and p(z) is standard and usedto reduce variance in traditional variational auto-encoders: it is analytic for widely used deep gen-erative models such as the deep latent Gaussian model (Rezende et al., 2014) and deep recurrentattentive writer (Gregor et al., 2015). See Appendix C for these calculations.

To derive black box gradients, we can first reparameterize the VGP, separating noise generation ofsamples from the parameters in its generative process (Kingma & Welling, 2014; Rezende et al.,2014). The GP easily enables reparameterization: for latent inputs ⇠ ⇠ N (0, I), the transformationf(⇠;✓) = L⇠ + K⇠sK

�1ss ti is a location-scale transform, where LL

>= K⇠⇠ � K⇠sK

�1ss K

>⇠s.

This is equivalent to evaluating ⇠ with a random mapping from the GP. Suppose the mean-fieldq(z | f(⇠)) is also reparameterizable, and let ✏ ⇠ w such that z(✏; f) ⇠ q(z | f(⇠)). This two-levelreparameterization is equivalent to the generative process for z outlined in Section 2.3.

We now rewrite the variational objective aseL(✓,�) = EN (⇠)

hEw(✏)

hlog p(x | z(✏; f(⇠;✓)))

ii(6)

� EN (⇠)

hEw(✏)

hKL(q(z | f(⇠;✓))kp(z)) + KL(q(⇠, f ;✓)kr(⇠, f | z(✏; f(⇠;✓));�))

ii.

Eq.6 enables gradients to move inside the expectations and backpropagate over the nested reparam-eterization. Thus we can take unbiased stochastic gradients, which exhibit low variance due to boththe analytic KL terms and reparameterization. The gradients are derived in Appendix D, includingthe case when the first KL is analytically intractable.

An outline is given in Algorithm 1. For massive data, we apply subsampling on x (Hoffman et al.,2013). For gradients of the model log-likelihood, we employ convenient differentiation tools suchas those in Stan and Theano (Carpenter et al., 2015; Bergstra et al., 2010). For non-differentiablelatent variables z, or mean-field distributions without efficient reparameterizations, we apply thescore function estimator for gradients of expectations with respect to the mean-field (Ranganathet al., 2014).

1We let the kernel hyperparameters of the VGP be fixed across data points.

6

Page 35: Iclr2016 vaeまとめ

実験:対数尤度¤ 前⼈未到の70代に突⼊

¤ ⽣成部分のモデルをDRAW, 近似分布をVGPとしたモデルが⼀番良い

Published as a conference paper at ICLR 2016

Model � log p(x) DLGM + VAE [1] 86.76DLGM + HVI (8 leapfrog steps) [2] 85.51 88.30DLGM + NF (k = 80) [3] 85.10EoNADE-5 2hl (128 orderings) [4] 84.68DBN 2hl [5] 84.55DARN 1hl [6] 84.13Convolutional VAE + HVI [2] 81.94 83.49DLGM 2hl + IWAE (k = 50) [1] 82.90DRAW [7] 80.97

DLGM 1hl + VGP 84.79DLGM 2hl + VGP 81.32DRAW + VGP 79.88

Table 1: Negative predictive log-likelihood for binarized MNIST. Previous best results are[1] (Burda et al., 2016), [2] (Salimans et al., 2015), [3] (Rezende & Mohamed, 2015), [4] (Raikoet al., 2014), [5] (Murray & Salakhutdinov, 2009), [6] (Gregor et al., 2014), [7] (Gregor et al., 2015).

nonparametric priors such as an infinite mixture of mean-field distributions, the GP enables blackbox inference with lower variance gradients—it applies a location-scale transform for reparameteri-zation and has analytically tractable KL terms.

Transformations, which convert samples from a tractable distribution to the posterior, is a classictechnique in Bayesian inference. It was first studied in Monte Carlo methods, where it is core tothe development of methods such as path sampling, annealed importance sampling, and sequentialMonte Carlo (Gelman & Meng, 1998; Neal, 1998; Chopin, 2002). These methods can be recast asspecifying a discretized mapping ft for times t0 < . . . < tk, such that for draws ⇠ from the tractabledistribution, ft0(⇠) outputs the same samples and ftk(⇠) outputs exact samples following the poste-rior. By applying the sequence in various forms, the transformation bridges the tractable distributionto the posterior. Specifying a good transformation—termed “schedule” in the literature—is crucialto the efficiency of these methods. Rather than specify it explicitly, the VGP adaptively learns thistransformation and avoids discretization.

Limiting the VGP in various ways recovers well-known probability models as variational approxima-tions. Specifically, we recover the discrete mixture of mean-field distributions (Bishop et al., 1998;Jaakkola & Jordan, 1998). We also recover a form of factor analysis (Tipping & Bishop, 1999) inthe variational space. Mathematical details are in Appendix A.

5 EXPERIMENTS

Following standard benchmarks for variational inference in deep learning, we learn generative mod-els of images. In particular, we learn the deep latent Gaussian model (DLGM) (Rezende et al., 2014),a layered hierarchy of Gaussian random variables following neural network architecures, and therecently proposed Deep Recurrent Attentive Writer (DRAW) (Gregor et al., 2015), a latent attentionmodel that iteratively constructs complex images using a recurrent architecture and a sequence ofvariational auto-encoders (Kingma & Welling, 2014).

For the learning rate we apply a version of RMSProp (Tieleman & Hinton, 2012), in which wescale the value with a decaying schedule 1/t1/2+✏ for ✏ > 0. We fix the size of variational datato be 500 across all experiments and set the latent input dimension equal to the number of latentvariables.

5.1 BINARIZED MNIST

The binarized MNIST data set (Salakhutdinov & Murray, 2008) consists of 28x28 pixel imageswith binary-valued outcomes. Training a DLGM, we apply two stochastic layers of 100 random vari-ables and 50 random variables respectively, and in-between each stochastic layer is a deterministic

8

Page 36: Iclr2016 vaeまとめ

まとめ¤ 今回はICLR2016のVAE研究を中⼼に

¤ 変分推論とVAE¤ 条件付きVAEと半教師あり学習¤ ガウス過程とVAEについてまとめた

¤ 感想¤ ICLRの傾向がなんとなくわかった¤ まとめるのが難しかった