17
Improving Variational Inference with Inverse Autoregressive Flow Jan. 19, 2017 Tatsuya Shirakawa ([email protected]) Diederik P. Kingma (OpenAI) Tim Salimans (OpenAI) Rafal Jozefowics (OpenAI) Xi Chen (OpenAI) Ilya Sutskever (OpenAI) Max Welling (University of Amsterdam)

Improving Variational Inference with Inverse Autoregressive Flow

Embed Size (px)

Citation preview

ImprovingVariational InferencewithInverseAutoregressiveFlow

Jan.19,2017

TatsuyaShirakawa ([email protected])

Diederik P.Kingma (OpenAI)TimSalimans (OpenAI)Rafal Jozefowics (OpenAI)XiChen(OpenAI)IlyaSutskever (OpenAI)MaxWelling(UniversityofAmsterdam)

1

Variational Autoencoder (VAE)

log 𝑝 𝒙

𝔼( 𝒛|𝒙 log 𝑝 𝒙, 𝒛 − log 𝑞(𝒛 |𝒙)∥

log 𝑝 𝒙 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛 𝒙∥

𝔼( 𝒛|𝒙 log 𝑝 𝒙 𝒛 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛

=: ℒ 𝒙; 𝜽

Modelz ~ p(z;η)x ~ p(x|z;η)

Optimization

maximize𝜼

1𝑁B log 𝑝 𝒙𝒏; 𝜼

D

EFG

Inference Modelz ~ q(z|x;ν)

Optimization

maximize𝜽F(𝜼,𝝂)

1𝑁Bℒ 𝒙𝒏; 𝜽

D

EFG

ELBO

𝒘𝒊𝒕𝒉𝜽 = 𝝁, 𝝂

P(z|x;μ*)

𝑫𝑲𝑳(𝒒 ∥ 𝒑)q(z|x;ν*)

P(z|x;μ)

q(z|x;ν)

2

Requirementsfortheinferencemodelq(z|x)

ComputationalTractability1. Computationallycheaptocomputeanddifferentiate2. Computationallycheaptosamplefrom3. Parallelcomputation

Accuracy4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)

P(z|x;μ*)

𝑫𝑲𝑳(𝒒 ∥ 𝒑)

q(z|x;ν*)

P(z|x;μ)

q(z|x;ν)

3

PreviousDesignsofq(z|x)

BasicDesigns- DiagonalGaussianDistribution- FullCovarianceGaussianDistribution

DesignsbasedonChangeofVariables- Nice

L.Dinh etal.,“Nice:non-linearindependentcomponentsestimation”,2014

- NormalizingFlowD.J.Rezende etal.,“Variational inferencewithnormalizingflows”,ICML2015

DesignsbasedonAddingAuxiliaryVariables- HamiltonianFlow/HamiltonianVariational Inference

T.Salimans etal.,”MarkovchainMonteCarloandvariational inference:Bridgingthegap”,2014

4

Diagonal/FullCovarianceGaussianDistribution

Diagonal:Efficientbutnotflexible𝑞 𝒛 𝒙 = ΠU𝑁 𝒛𝒊|𝜇U 𝒙 , 𝜎U 𝒙

FullCovariance:NotEfficientandnotflexible(unimodal)𝑞 𝒛 𝒙 = 𝑁 𝒛|𝝁 𝒙 , 𝚺 𝒙

1. Computationallycheaptocomputeanddifferentiate ✓ / ✗2. Computationallycheaptosamplefrom ✓ / ✗3. Parallelcomputation ✓ / ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✗

5

ChangeofVariablesbasedmethods

Transoform𝑞 𝑧Z 𝑥 tomakemorepowerful distribution𝑞 𝑧\ 𝑥 viasequentialapplicationofchangeofvariables

𝒛𝒕 = 𝑓 𝒛𝒕_𝟏

𝑞 𝒛𝒕 𝒙 = 𝑞 𝒛𝒕_𝟏 𝒙 det𝑑𝑓 𝒛𝒕_𝟏𝑑𝒛𝒕_𝟏

_G

⇒ log 𝑞 𝒛𝑻 𝒙 = log 𝑞 𝒛𝟎 𝒙 −Blog det𝑑𝑓 𝒛𝒕_𝟏𝑑𝒛𝒕_𝟏

^

• NiceL.Dinh etal.,“Nice:non-linearindependentcomponentsestimation”,2014

• NormalizingFlowD.J.Rezende etal.,“Variational inferencewithnormalizingflows”,ICML2015

6

NormalizingFlow

Transformationvia𝒛𝒕 = 𝒛𝒕_𝟏 + 𝒖𝒕𝑓 𝒘𝒕

\𝒛𝒕_𝟏 + 𝑏^KeyFeatures- Determinantsarecomputable

Drawbacks- Informationgoesthroughsinglebottleneck

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallelcomputation ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✗

singlebottleneck

𝒛𝒕_𝟏

𝒛𝒕

𝒘𝒕𝑻𝒛𝒕 + 𝑏^

𝒖𝒕𝑓 𝒘𝒕𝑻𝒛𝒕 + 𝑏^

7

HamiltonianFlow/HamiltonianVariational Inference

ELBOwithauxiliaryvariablesylog 𝑝 𝒙 ≥ log 𝑝 𝒙 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛 𝒙 − 𝐷23 𝑞 𝒚 𝒙, 𝒛 ∥ 𝑟 𝒚 𝒙, 𝒛 =: ℒ 𝒙

Drawing(y,z)viaHMC𝑦^, 𝑧^ ~𝐻𝑀𝐶 𝑦^, 𝑧^|𝑦^_G, 𝑧^_G

KeyFeatures- Capabilitytosamplefromexactposterior

Drawbacks- LongmixingtimeandlowerELBO

1. Computationallycheaptocomputeanddifferentiate ✗2. Computationallycheaptosamplefrom ✗3. Parallelcomputation ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✓

8

Nice

Transformonlyhalfofzateachsteps

𝒛𝒕 = 𝒛𝒕𝜶, 𝒛𝒕𝜷 = 𝒛𝒕_𝟏𝜶 , 𝒛𝒕_𝟏

𝜷 + 𝑓 𝒙, 𝒛𝒕_𝟏𝜶 ,KeyFeatures- DeterminantoftheJacobiandet uvw 𝒛𝒕x𝟏

u𝒛𝒕x𝟏isalways1

Drawbacks- Limitedformoftransformation- lessaccuratepowerfulthanNormalizingFlow(Next)

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallel computation ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✗

9

Autoregressive Flow(proposed)

AutoregressiveFlow(𝑑𝜇^,U/𝑑𝑧^,z =𝑑𝜎^,U/𝑑𝑧^,z =0if𝑖 ≤ 𝑗)𝑧^,U = 𝜇^,U 𝒛𝒕,𝟎:𝒊_𝟏 + 𝜎^,U 𝒛𝒕,𝟎:𝒊_𝟏 ⊙ 𝑧^_G,U

Keyfeatures- Powerful- Easytocomputedet 𝜕𝒛𝒕/𝜕𝒛𝒕_𝟏 = ΠU𝜎^,U 𝐳𝐭_𝟏

Drawbacks- Difficulttoparallelize

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallel computation ✗4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✓

10

InverseAutoregressive Flow(proposed)

InvertingAF(𝝁𝒕, 𝝈𝒕isalsoautoregressive)

𝒛𝒕 =𝒛𝒕_𝟏 − 𝝁𝒕 𝒛𝒕_𝟏

𝝈𝒕 𝒛𝒕_𝟏KeyFeatures- EquallypowerfulasAF- Easytocomputedet 𝜕𝒛𝒕/𝜕𝒛𝒕_𝟏 = 1/ΠU𝜎^,U 𝐳𝐭_𝟏- Parallelizable

1. Computationallycheaptocomputeanddifferentiate ✓2. Computationallycheaptosamplefrom ✓3. Parallelcomputation ✓4. Sufficientlyflexibletomatch

thetrueposteriorp(z|x)✓

11

IAFthroughMaskedAutoencoder (MADE)

Modelingautoregressive𝝁𝒕 and𝝈𝒕 withMADE

• RemovingpathsfromfuturesfromAutoencodersbyintroducingmasks•MADEisaprobabilisticmodel𝑝 𝑥 = ΠU𝑝 𝑥U 𝑥Z:U_G

12

Experiments

IAFisevaluatedonimagegeneratingmodels

ModelsforMNIST- ConvolutionalVAEwithResNet blocks- IAF=2-layerMADE- IAFtransformationsarestackedwithorderingreversedalternately

ModelsforCIFAR-10(verycomplicated)

13

MNIST

14

CIFAR-10

15

IAFin1slide

𝑫𝑲𝑳(𝒒 ∥ 𝒑)

𝒒 𝒛𝑻 𝒙; 𝝂𝑻 𝝂𝑻

𝒑 𝒛 𝒙; 𝝁∗𝒑 𝒛 𝒙; 𝝁

𝒒 𝒛 𝒙; 𝝂𝑻∗

𝒒 𝒛𝒕 𝒙; 𝝂𝒕 𝝂𝒕

𝒒 𝒛𝟎 𝒙; 𝝂𝟎 𝝂𝟎

Autoregressive Flow

Inverse Autoregressive Flow

IAF isü Easy to compute and differentiateü Easy to sample fromü Parallelizableü Flexible

𝒒 𝒛 𝒙; 𝝂𝑻

Wearehiring!http://www.abeja.asia/

https://www.wantedly.com/companies/abeja