Improving Variational Inference with Inverse Autoregressive Flow

ImprovingVariational InferencewithInverseAutoregressiveFlow

Jan.19,2017

TatsuyaShirakawa ([email protected])

Diederik P.Kingma (OpenAI)TimSalimans (OpenAI)Rafal Jozefowics (OpenAI)XiChen(OpenAI)IlyaSutskever (OpenAI)MaxWelling(UniversityofAmsterdam)

1

Variational Autoencoder (VAE)

log 𝑝 𝒙

≥

𝔼( 𝒛|𝒙 log 𝑝 𝒙, 𝒛 − log 𝑞(𝒛 |𝒙)∥

log 𝑝 𝒙 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛 𝒙∥

𝔼( 𝒛|𝒙 log 𝑝 𝒙 𝒛 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛

=: ℒ 𝒙; 𝜽

Modelz ~ p(z;η)x ~ p(x|z;η)

Optimization

maximize𝜼

1𝑁B log 𝑝 𝒙𝒏; 𝜼

D

EFG

Inference Modelz ~ q(z|x;ν)

Optimization

maximize𝜽F(𝜼,𝝂)

1𝑁Bℒ 𝒙𝒏; 𝜽

D

EFG

ELBO

𝒘𝒊𝒕𝒉𝜽 = 𝝁, 𝝂

P(z|x;μ*)

𝑫𝑲𝑳(𝒒 ∥ 𝒑)q(z|x;ν*)

P(z|x;μ)

q(z|x;ν)

2

Requirementsfortheinferencemodelq(z|x)

ComputationalTractability1. Computationallycheaptocomputeanddifferentiate2. Computationallycheaptosamplefrom3. Parallelcomputation

Accuracy4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)

P(z|x;μ*)

𝑫𝑲𝑳(𝒒 ∥ 𝒑)

q(z|x;ν*)

P(z|x;μ)

q(z|x;ν)

3

PreviousDesignsofq(z|x)

BasicDesigns- DiagonalGaussianDistribution- FullCovarianceGaussianDistribution

DesignsbasedonChangeofVariables- Nice

L.Dinh etal.,“Nice:non-linearindependentcomponentsestimation”,2014

- NormalizingFlowD.J.Rezende etal.,“Variational inferencewithnormalizingflows”,ICML2015

DesignsbasedonAddingAuxiliaryVariables- HamiltonianFlow/HamiltonianVariational Inference

T.Salimans etal.,”MarkovchainMonteCarloandvariational inference:Bridgingthegap”,2014

4

Diagonal/FullCovarianceGaussianDistribution

Diagonal:Efficientbutnotflexible𝑞 𝒛 𝒙 = ΠU𝑁 𝒛𝒊|𝜇U 𝒙 , 𝜎U 𝒙

FullCovariance:NotEfficientandnotflexible(unimodal)𝑞 𝒛 𝒙 = 𝑁 𝒛|𝝁 𝒙 , 𝚺 𝒙

1. Computationallycheaptocomputeanddifferentiate ✓ / ✗2. Computationallycheaptosamplefrom ✓ / ✗3. Parallelcomputation ✓ / ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✗

5

ChangeofVariablesbasedmethods

Transoform𝑞 𝑧Z 𝑥 tomakemorepowerful distribution𝑞 𝑧\ 𝑥 viasequentialapplicationofchangeofvariables

𝒛𝒕 = 𝑓 𝒛𝒕_𝟏

𝑞 𝒛𝒕 𝒙 = 𝑞 𝒛𝒕_𝟏 𝒙 det𝑑𝑓 𝒛𝒕_𝟏𝑑𝒛𝒕_𝟏

_G

⇒ log 𝑞 𝒛𝑻 𝒙 = log 𝑞 𝒛𝟎 𝒙 −Blog det𝑑𝑓 𝒛𝒕_𝟏𝑑𝒛𝒕_𝟏

�

^

• NiceL.Dinh etal.,“Nice:non-linearindependentcomponentsestimation”,2014

• NormalizingFlowD.J.Rezende etal.,“Variational inferencewithnormalizingflows”,ICML2015

6

NormalizingFlow

Transformationvia𝒛𝒕 = 𝒛𝒕_𝟏 + 𝒖𝒕𝑓 𝒘𝒕

\𝒛𝒕_𝟏 + 𝑏^KeyFeatures- Determinantsarecomputable

Drawbacks- Informationgoesthroughsinglebottleneck

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallelcomputation ✗4. Sufficientlyflexibletomatch


singlebottleneck

⊕

𝒛𝒕_𝟏

𝒛𝒕

𝒘𝒕𝑻𝒛𝒕 + 𝑏^

𝒖𝒕𝑓 𝒘𝒕𝑻𝒛𝒕 + 𝑏^

7

HamiltonianFlow/HamiltonianVariational Inference

ELBOwithauxiliaryvariablesylog 𝑝 𝒙 ≥ log 𝑝 𝒙 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛 𝒙 − 𝐷23 𝑞 𝒚 𝒙, 𝒛 ∥ 𝑟 𝒚 𝒙, 𝒛 =: ℒ 𝒙

Drawing(y,z)viaHMC𝑦^, 𝑧^ ~𝐻𝑀𝐶 𝑦^, 𝑧^|𝑦^_G, 𝑧^_G

KeyFeatures- Capabilitytosamplefromexactposterior

Drawbacks- LongmixingtimeandlowerELBO

1. Computationallycheaptocomputeanddifferentiate ✗2. Computationallycheaptosamplefrom ✗3. Parallelcomputation ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✓

8

Nice

Transformonlyhalfofzateachsteps

𝒛𝒕 = 𝒛𝒕𝜶, 𝒛𝒕𝜷 = 𝒛𝒕_𝟏𝜶 , 𝒛𝒕_𝟏

𝜷 + 𝑓 𝒙, 𝒛𝒕_𝟏𝜶 ,KeyFeatures- DeterminantoftheJacobiandet uvw 𝒛𝒕x𝟏

u𝒛𝒕x𝟏isalways1

Drawbacks- Limitedformoftransformation- lessaccuratepowerfulthanNormalizingFlow(Next)

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallel computation ✗4. Sufficientlyflexibletomatch


9

Autoregressive Flow(proposed)

AutoregressiveFlow(𝑑𝜇^,U/𝑑𝑧^,z =𝑑𝜎^,U/𝑑𝑧^,z =0if𝑖 ≤ 𝑗)𝑧^,U = 𝜇^,U 𝒛𝒕,𝟎:𝒊_𝟏 + 𝜎^,U 𝒛𝒕,𝟎:𝒊_𝟏 ⊙ 𝑧^_G,U

Keyfeatures- Powerful- Easytocomputedet 𝜕𝒛𝒕/𝜕𝒛𝒕_𝟏 = ΠU𝜎^,U 𝐳𝐭_𝟏

Drawbacks- Difficulttoparallelize

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallel computation ✗4. Sufficientlyflexibletomatch


10

InverseAutoregressive Flow(proposed)

InvertingAF(𝝁𝒕, 𝝈𝒕isalsoautoregressive)

𝒛𝒕 =𝒛𝒕_𝟏 − 𝝁𝒕 𝒛𝒕_𝟏

𝝈𝒕 𝒛𝒕_𝟏KeyFeatures- EquallypowerfulasAF- Easytocomputedet 𝜕𝒛𝒕/𝜕𝒛𝒕_𝟏 = 1/ΠU𝜎^,U 𝐳𝐭_𝟏- Parallelizable

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallelcomputation ✓4. Sufficientlyflexibletomatch


11

IAFthroughMaskedAutoencoder (MADE)

Modelingautoregressive𝝁𝒕 and𝝈𝒕 withMADE

• RemovingpathsfromfuturesfromAutoencodersbyintroducingmasks•MADEisaprobabilisticmodel𝑝 𝑥 = ΠU𝑝 𝑥U 𝑥Z:U_G

12

Experiments

IAFisevaluatedonimagegeneratingmodels

ModelsforMNIST- ConvolutionalVAEwithResNet blocks- IAF=2-layerMADE- IAFtransformationsarestackedwithorderingreversedalternately

ModelsforCIFAR-10(verycomplicated)

13

MNIST

14

CIFAR-10

15

IAFin1slide

𝑫𝑲𝑳(𝒒 ∥ 𝒑)

𝒒 𝒛𝑻 𝒙; 𝝂𝑻 𝝂𝑻

𝒑 𝒛 𝒙; 𝝁∗𝒑 𝒛 𝒙; 𝝁

𝒒 𝒛 𝒙; 𝝂𝑻∗

𝒒 𝒛𝒕 𝒙; 𝝂𝒕 𝝂𝒕

𝒒 𝒛𝟎 𝒙; 𝝂𝟎 𝝂𝟎

Autoregressive Flow

Inverse Autoregressive Flow

IAF isü Easy to compute and differentiateü Easy to sample fromü Parallelizableü Flexible

𝒒 𝒛 𝒙; 𝝂𝑻

Wearehiring!http://www.abeja.asia/

https://www.wantedly.com/companies/abeja

Data & Analytics

Improving Variational Inference with Inverse Autoregressive Flow