Bayes Deep

Embed Size (px)

Citation preview

  • 7/26/2019 Bayes Deep

    1/43

    Bayesian Reasoningand Deep Learning

    Shakir Mohamed

    @shakir_zashakirm.com

    !""#$%&'

    9 October 2015

  • 7/26/2019 Bayes Deep

    2/43

    Bayesian Reasoning and Deep Learning 2

    Abstract

    Deep learning and Bayesian machine learning are currently two of the most

    active areas of machine learning research. Deep learningprovides a powerfulclass of models and an easy framework for learning that now provides state-of-

    the-art methods for applications ranging from image classification to speechrecognition. Bayesian reasoningprovides a powerful approach for information

    integration, inference and decision making that has established it as the keytool for data-efficient learning, uncertainty quantification and robust model

    composition that is widely used in applications ranging from informationretrieval to large-scale ranking. Each of these research areas has shortcomings

    that can be effectively addressed by the other, pointing towards a neededconvergence of these two areas of machine learning; the complementary

    aspects of these two research areas is the focus of this talk. Using the tools ofauto-encoders and latent variable models, we shall discuss some of the ways in

    which our machine learning practice is enhanced by combining deep learningwith Bayesian reasoning. This is an essential, and ongoing, convergence that

    will only continue to accelerate and provides some of the most exciting

    prospects, some of which we shall discuss, for contemporary machine learningresearch.

  • 7/26/2019 Bayes Deep

    3/43

    Bayesian Reasoning and Deep Learning 3

  • 7/26/2019 Bayes Deep

    4/43

    Bayesian Reasoning and Deep Learning 4

    Deep Learning

    + Rich non-linear models for

    classification and sequence prediction.

    + Scalable learning using stochasticapproximations and conceptually simple.

    + Easily composable with other gradient-

    based methods

    - Only point estimates

    - Hard to score models, domodel selection andcomplexity penalisation.

    A framework for constructing flexible models

  • 7/26/2019 Bayes Deep

    5/43

    Bayesian Reasoning and Deep Learning 5

    Bayesian Reasoning

    + Unified framework for model building,inference, prediction and decision making

    + Explicit accounting for uncertainty andvariability of outcomes

    + Robust to overfitting; tools for modelselection and composition.

    - Mainly conjugate and linearmodels

    - Potentially intractableinference leading toexpensive computation orlong simulation times.

    A framework for inference and decision making

  • 7/26/2019 Bayes Deep

    6/43

    Bayesian Reasoning and Deep Learning 6

    Two Streams of Machine Learning

    - Mainly conjugate and linearmodels

    - Potentially intractable inference,

    computationally expensive or longsimulation time.

    + Unified framework for modelbuilding, inference, prediction anddecision making

    + Explicit accounting for uncertaintyand variability of outcomes

    + Robust to overfitting; tools formodel selection and composition.

    Bayesian Reasoning

    + Rich non-linear models forclassification and sequenceprediction.

    + Scalable learning using stochasticapproximation and conceptuallysimple.

    + Easily composable with othergradient-based methods

    - Only point estimates

    - Hard to score models, do selectionand complexity penalisation.

    Deep Learning

  • 7/26/2019 Bayes Deep

    7/43Bayesian Reasoning and Deep Learning 7

    Outline

    Complementary strengths that we shouldexpect to be successfully combined.

    Why is this a good idea?

    !Review of deep learning

    !Limitations of maximum likelihood and MAP estimation

    1

    2

    3 What else can we do?

    !Semi-supervised learning, classification, better inferenceand more.

    +Bayesian Reasoning Deep Learning

    How can we achieve this convergence?

    !Case study using auto-encoders and latent variable models

    !Approximate Bayesian inference

  • 7/26/2019 Bayes Deep

    8/43Bayesian Reasoning and Deep Learning

    Generalised Linear Regression

    8

    A (Statistical) Review of Deep Learning

    =w>x

    +b

    p(y|x) = p(y|g(); )

    Maximum likelihood estimation

    Optimise the negative log-likelihood

    L= log p(y|g(); )

    "g(.) is an inverse link functionthat wellrefer to as an activation function.

    Target Regression Link Inv link Activation

    Real Linear Identity Identity

    Binary Logistic Logit log 1

    Sigmoid1

    1+exp(

    )

    Sigmoid

    Binary Probit Inv Gauss

    CDF1()

    Gauss CDF

    ()

    Probit

    Binary Gumbel Compl.

    log-log

    log(log())

    Gumbel CDF

    eex

    Binary Logistic Hyperbolic

    Tangent

    tanh()

    Tanh

    Categorical Multinomial Multin. Logiti

    Pj j

    Softmax

    Counts Poisson log() exp()

    Counts Poissonp

    () 2

    Non-neg. Gamma Reciprocal 1

    1

    Sparse Tobit max max(0;) ReLU

    Ordered Ordinal Cum. Logit

    (k)

    "The basic function can be any linearfunction, e.g., a!ne, convolution.

  • 7/26/2019 Bayes Deep

    9/43Bayesian Reasoning and Deep Learning

    A general framework for buildingnon-linear, parametric models

    9

    A (Statistical) Review of Deep Learning

    Recursive Generalised Linear Regression

    Problem: Overfitting of MLE leading to limited generalisation.

    "Recursively compose the basic linear functions.

    "Gives a deep neural network.

    E[y] =hL . . . hl h0(x)

  • 7/26/2019 Bayes Deep

    10/43Bayesian Reasoning and Deep Learning 10

    A (Statistical) Review of Deep Learning

    !Large data sets! Input noise/jittering and data augmentation/expansion.

    !L2 /L1 regularisation (Weight decay, Gaussian prior)

    !Binary or Gaussian Dropout

    !Batch normalisation

    "Regularisation is essential to overcome the limitations of maximumlikelihood estimation.

    "Regularisation, penalised regression, shrinkage.

    "A wide range of available regularisation techniques:

    Regularisation Strategies for Deep Networks

    More robust loss function using MAP estimation instead.

  • 7/26/2019 Bayes Deep

    11/43Bayesian Reasoning and Deep Learning 11

    More Robust Learning

    MAP estimators and limitations

    "Power of MAP estimators is that they providesome robustness to overfitting.

    "Creates sensitivities to parameterisation.

    1. Sensitivities a!ect gradients and can make learning hard

    2. Still no way to measure confidence of our model.

    Can generate frequentist confidence intervalsand bootstrap estimates.

    Invariant MAP estimators and exploiting naturalgradients, trust region methods and other

    improved optimisation.

  • 7/26/2019 Bayes Deep

    12/43Bayesian Reasoning and Deep Learning 12

    Towards Bayesian Reasoning

    Given this powerful model class and invaluable tools for

    regularisation and optimisation, let us develop a

    Issues arise as a consequence of:

    !Reasoning only about the most likely solution and

    !Not maintaining knowledge of the underlying variability (andaveraging over this).

    Pragmatic Bayesian Approach forProbabilistic Reasoning in Deep Networks.

    Proposed solutions have not fully dealt with the underlying issues.

    Bayesian reasoning over some, but not all parts of our models (yet).

  • 7/26/2019 Bayes Deep

    13/43Bayesian Reasoning and Deep Learning 13

    Outline

    Complementary strengths that we shouldexpect to be successfully combined.

    1 Why is this a good idea?

    !Review of deep learning

    !Limitations of maximum likelihood and MAP estimation

    2 How can we achieve this convergence?

    !Case study using auto-encoders and latent variable models

    !Approximate Bayesian inference

    3 What else can we do?

    !Semi-supervised learning, classification, better inferenceand more.

    +Bayesian Reasoning Deep Learning

  • 7/26/2019 Bayes Deep

    14/43Bayesian Reasoning and Deep Learning 14

    Unsupervised learning and auto-encoders

    !A generic tool for dimensionalityreduction and feature extraction.

    !Minimise reconstruction error using anencoder and a decoder.

    Data y

    Encoder

    f(.)

    z = f(y)

    Decoder

    g(.)

    y* = g(z)

    z

    + Non-linear dimensionality reductionusing deep networks for encoder anddecoder.

    + Easy to implement as a singlecomputational graph and train using

    SGD

    - No natural handling of missing data

    - No representation of variability of therepresentation space. L= kyg(f(y))k

    2

    2

    L= log p(y|g(z))

    Dimensionality Reduction and Auto-encoders

  • 7/26/2019 Bayes Deep

    15/43Bayesian Reasoning and Deep Learning 15

    Dimensionality Reduction and Auto-encoders

    Some questions about auto-encoders:

    !What is the model we are interested in?

    !Why use an encoder?

    !How do we regularise?

    Data y

    Encoder

    f(.)

    z = f(y)

    Decoder

    g(.)

    y* = g(z)

    z

    Best to be explicit about the:

    Probabilistic modelof interest and

    Mechanism we use for inference.

  • 7/26/2019 Bayes Deep

    16/43Bayesian Reasoning and Deep Learning 16

    Latent variable models:

    ! Generic and flexible model class for density estimation.

    ! Specifies a generative process that gives rise to the data.

    z

    y

    W

    n = 1, , N

    ! "

    Use our knowledge of deep learning to design even richer models.

    BXPCA

    z N(z|,)

    Latent Variable

    y Expon(y|)

    Observation Model

    =Wz+ b

    Exponential fam natural parameters ".

    Density Estimation and Latent Variable Models

    Latent Gaussian Models:

    !Probabilistic PCA, Factor analysis (FA), Bayesian ExponentialFamily PCA (BXPCA).

  • 7/26/2019 Bayes Deep

    17/43Bayesian Reasoning and Deep Learning 17

    Deep Generative Models

    Rich extension of previous model using deep neural networks.

    E.g., non-linear factor analysis, non-linear Gaussian beliefnetworks, deep latent Gaussian models (DLGM).

    z1

    y

    W

    n = 1, , N

    ! "

    h1

    h2

    z2

    h3

    h4

    W1

    DLGM

    Latent Variables (Stochastic layers)

    zl N(zl|fl(zl+1),l)

    fl(z) = (Wh(z) + b)Deterministic layers

    hi(x) = (Ax + c)

    y Expon(y|)

    =Wh1 + b

    Observation Model

    Can also use non-exponential family.

  • 7/26/2019 Bayes Deep

    18/43Bayesian Reasoning and Deep Learning 18

    Deep Latent Gaussian Models

    z1

    y

    W

    n = 1, , N

    h1

    h2

    ! "

    p(z|y,W) p(y|z,W)p(z)1. Explain this data

    p(y|y) =

    Z p(y|z,W)p(z|y,W)dz

    2. Make predictions:

    p(y|W) =

    Z p(y|z,W)p(z)dz

    3. Choose the best model

    Our inferential tasks are:

  • 7/26/2019 Bayes Deep

    19/43

    Bayesian Reasoning and Deep Learning 19

    Variational Inference

    PenaltyReconstruction

    F(y, q) = Eq(z)[logp(y|z)]KL[q(z)kp(z)]

    Use tools from approximate inference to handle intractable integrals.

    q(z)

    KL[q(z|y)kp(z|y)] Approximation class

    True posterior

    Reconstruction cost:Expected log-likelihoodmeasures how wellsamples from q(z)are ableto explain the data y.

    Penalty: Explanation ofthe data q(z)doesnt deviatetoo far from your beliefs

    p(z) - Okhams razor.

    Penalty is derived from your model and does not need to be designed.

  • 7/26/2019 Bayes Deep

    20/43

    Bayesian Reasoning and Deep Learning 20

    Amortised Variational Inference

    Approximate posterior distributionq(z): Best match

    to true posteriorp(z|y), one of the unknowninferential quantities of interest to us.

    Data y

    Inference/

    Encoder

    q(z |y)

    z ~ q(z | y)

    Inference network: qis an encoder or inverse model.

    Parameters of qare now a set of global parameters

    used for inference of all data points - test and train.Amortise (spread) the cost of inference over all data.

    Encoders provide an e!cient mechanism foramortised posterior inference

    PenaltyReconstruction

    F(y, q

    ) =E

    q(z)[logp

    (y|z

    )]KL

    [q

    (z

    )kp

    (z

    )]Approx. Posterior

  • 7/26/2019 Bayes Deep

    21/43

    Bayesian Reasoning and Deep Learning 21

    Auto-encoders and Inference in DGMs

    Model (Decoder):likelihoodp(y|z).

    Inference (Encoder): variational distribution q(z|y)

    Stochastic encoder-decoder systemsimplement variational inference.

    PenaltyReconstruction

    F y, q) = Eq(z)

    logp y|z) KL q z)kp z)

    Approx. Posterior

    Data y

    Inference

    Network

    q(z |y)

    z ~ q(z | y)

    Model

    p(y |z)

    y ~ p(y | z)

    z

    Specific combination of variational inferencein latentvariable modelsusing inference networks

    Variational Auto-encoder

    But dont forget what your model is, and what inference you use.

  • 7/26/2019 Bayes Deep

    22/43

    Bayesian Reasoning and Deep Learning 22

    What Have we Gained

    Data y

    Inference

    Network

    q(z |y)

    z ~ q(z | y)

    Model

    p(y |z)

    y ~ p(y | z)

    z

    F(y, q) = Eq(z)log p(y|z) K L q(z)kp(z)

    + Transformed an auto-encoders into moreinteresting deepgenerative models.

    + Rich new class of density estimators builtwith non-linear models.

    + Used a principled approachfor deriving

    loss functions that automatically includeappropriate penalty functions.

    + Explained howan encoder enters intoour models and whythis is a good idea.

    + Able to answer all our desired inferentialquestions.

    + Knowledge of the uncertaintyassociatedwith our latent variables.

  • 7/26/2019 Bayes Deep

    23/43

    Bayesian Reasoning and Deep Learning 23

    What Have we Gained

    Data y

    Inference

    Network

    q(z |y)

    z ~ q(z | y)

    Modelp(y |z)

    y ~ p(y | z)

    z

    F(y, q) = Eq(z) log p(y|z) KL q(z)kp(z) + Able to score our models and do model

    selection using the free energy.

    + Can impute missing dataunder anymissingness assumption

    + Can still combine with natural gradient

    and improved optimisation tools.

    + Easy implementation - have a singlecomputational graph and simple MonteCarlo gradient estimators.

    + Computational complexity the same asany large-scale deep learning system.

    A true marriage of Bayesian Reasoning and Deep Learning

  • 7/26/2019 Bayes Deep

    24/43

    Bayesian Reasoning and Deep Learning

    MNIST Handwritten digits

    24

    Data Visualisation

    . . .

    . . .

    500

    28x28

    Labels in 2D latent spaceSamples from 2D latent model

    DLGM

  • 7/26/2019 Bayes Deep

    25/43

    Bayesian Reasoning and Deep Learning 25

    Visualising MNIST in 3D

    DLGM

  • 7/26/2019 Bayes Deep

    26/43

    Bayesian Reasoning and Deep Learning 26

    Data Simulation

    DLGM

    Data Samples

  • 7/26/2019 Bayes Deep

    27/43

    Bayesian Reasoning and Deep Learning 27

    Missing Data Imputation

    10%

    observed

    50%

    observed

    unobserved pixelsOriginal Data Inferred Image

    DLGM

  • 7/26/2019 Bayes Deep

    28/43

    Bayesian Reasoning and Deep Learning 28

    Outline

    Complementary strengths that we shouldexpect to be successfully combined.

    1 Why is this a good idea?

    !Review of deep learning

    !Limitations of maximum likelihood and MAP estimation

    2 How can we achieve this convergence?

    !Auto-encoders and latent variable models

    !Approximate and variational inference

    3 What else can we do?

    !Semi-supervised learning, recurrent networks, classification,better inference and more.

    +Bayesian Reasoning Deep Learning

  • 7/26/2019 Bayes Deep

    29/43

    Bayesian Reasoning and Deep Learning 29

    Semi-supervised Learning

    z

    x

    W

    n = 1, , N

    ! "

    y

    #

    Can extend the marriage of Bayesian reasoning and deep learning to the

    problem of semi-supervised classification.

    Semi-supervised

    DLGM

  • 7/26/2019 Bayes Deep

    30/43

    Bayesian Reasoning and Deep Learning 30

    Analogical Reasoning

    Semi-supervisedDLGM

    d l h

  • 7/26/2019 Bayes Deep

    31/43

    Bayesian Reasoning and Deep Learning 31

    Generative Models with Attention

    We can also combine other tools from deep learning to design

    even more powerful generative models: recurrent networksand attention.

    DRAW

    read

    write

    encoder

    RNN

    sample

    decoder

    RNN

    read

    write

    encoder

    RNN

    sample

    decoder

    RNN

    ct1 ct cT

    henc

    t1

    hdec

    t1

    Q(zt+1|x, z1:t)

    . .

    decoding

    (generative model)

    encoding

    (inference)

    encoder

    FNN

    sample

    decoder

    FNN

    z

    P(x|z)

    d l

  • 7/26/2019 Bayes Deep

    32/43

    Bayesian Reasoning and Deep Learning 32

    Uncertainty on Model Parameters

    We can also combine other tools from deep learning to design even more

    powerful generative models: recurrent networks and attention.

    Bayesia

    nNeuralNe

    tworks

    x

    y

    W3

    n = 1, , N

    h1

    h2 W2

    W1

    H1

    H2

    H3 1

    X 1

    Y

    I R i

  • 7/26/2019 Bayes Deep

    33/43

    Bayesian Reasoning and Deep Learning 33

    In Review

    Bayesian reasoning as a generalframework forinference that allows us to account for

    uncertaintyand a principled approach forregularisation and model scoring.

    Combined Bayesian reasoning with auto-encoders andshowed just how much can be gained by a marriage of these

    two streamsof machine learning research.

    Data y

    Inference

    Network

    q(z |y)

    z ~ q(z | y)

    Model

    p(y |z)

    y ~ p(y | z)

    z

    Deep learning as aframework for building highlyflexible non-linear parametric models, but

    regularisation and accounting for uncertaintyand lack of knowledge is still needed.

  • 7/26/2019 Bayes Deep

    34/43

    Bayesian Reasoning and Deep Learning

    Thanks to many people:

    Danilo Rezende, Ivo Danihelka, Karol Gregor, Charles Blundell,Theophane Weber, Andriy Mnih, Daan Wierstra (Google DeepMind),Durk Kingma, Max Welling (U. Amsterdam)

    34

    Thank You.

    S R f

  • 7/26/2019 Bayes Deep

    35/43

    Bayesian Reasoning and Deep Learning

    Probabilistic Deep Learning

    Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models ."ICML (2014).

    Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014. Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014).

    Gregor, Karol, et al. "Deep autoregressive networks." ICML (2014).

    Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589).

    Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation.arXiv preprint arXiv:1502.04623.

    Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv preprint arXiv:1505.05770.

    Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arXiv preprint arXiv:1505.05424.

    Hernndez-Lobato, J. M., & Adams, R. P. (2015).Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arXivpreprint arXiv:1502.05336.

    Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprintarXiv:1506.02142.

    35

    Some References

    Wh i V i i l M h d?

  • 7/26/2019 Bayes Deep

    36/43

    Bayesian Reasoning and Deep Learning

    Variational Principle

    General family of methods for approximatingcomplicated densities by a simpler class of densities.

    36

    What is a Variational Method?

    Deterministic approximationprocedureswith boundson probabilities of interest.

    Fit the variational parameters.

    q(z)

    KL[q(z|y)kp(z|y)] Approximation class

    True posterior

    F IS V i i l I f

  • 7/26/2019 Bayes Deep

    37/43

    Bayesian Reasoning and Deep Learning 37

    From IS to Variational Inference

    =Z q(z) logp(y|z)

    Z q(z) log

    q

    (z

    )p(z)

    = Eq(z)[logp(y|z)]KL[q(z)kp(z)]Variational lower bound

    Jensens inequality logp(y)

    Z q(z) log

    p(y|z)

    p(z)

    q(z)

    dz

    log

    Z p(x)g(x)dx

    Z p(x) log g(x)dx

    Integral problem logp(y) = log Z p(y|z)p(z)dz

    Importance Weight logp(y) = logZ

    p(y|z)

    p z

    q(z) q(z)dz

    Proposal logp(y) = log

    Z p(y|z)p(z)

    q(z)

    q(z)dz

    Mi i D i i L h (MDL)

  • 7/26/2019 Bayes Deep

    38/43

    Bayesian Reasoning and Deep Learning

    Hypothesis codeData code-length

    38

    Minimum Description Length (MDL)

    F(y, q) = Eq(z)[logp(y|z)]KL[q(z)kp(z)]

    Regularity in our data that can be explained withlatent variables, implies that the data is compressible.

    MDL: inference seen as a problem of compression we must find the ideal shortest message of our data y:marginal likelihood.

    Must introduce an approximation to the idealmessage.

    Encoder: variational distribution q(z|y),

    Decoder:likelihoodp(y|z).

    Stochastic encoder-decoder systems implement variational inference.

    Stochastic encoder

    Data y

    Encoder

    q(z |y)

    z ~ q(z | y)

    Decoder

    p(y |z)

    y ~ p(y | z)

    z

    D i i A d (DAE)

  • 7/26/2019 Bayes Deep

    39/43

    Bayesian Reasoning and Deep Learning

    PenaltyReconstruction

    39

    Denoising Auto-encoders (DAE)

    DAE: A mechanism for finding representations orfeatures of data (i.e. latent variable explanations).

    Encoder: variational distribution q(z|y),

    Decoder:likelihoodp(y|z).

    Stochastic encoder-decoder systems implement variational inference.

    Stochastic encoder

    F(y, q) = Eq(z)[log p(y|z)] (z, y)

    Data y

    Encoder

    q(z |y)

    z ~ q(z | y)

    Decoder

    p(y |z)

    y ~ p(y | z)

    z

    The variational approach requires you to be explicitabout your assumptions. Penalty is derived from your

    model and does not need to be designed.

    A ti i th C t f I f

  • 7/26/2019 Bayes Deep

    40/43

    Bayesian Reasoning and Deep Learning 40

    Amortising the Cost of Inference

    Inference network: qis an encoder or inverse model.

    Parameters of qare now a set of global parametersused for inference of all data points - test and train.

    Share the cost of inference (amortise) over all data.

    Combines easily with mini-batches and Monte Carloexpectations.

    Can jointly optimise variational and modelparameters: no need for alternating optimisation.

    Repeat:

    E-ste

    M-step

    For i = 1, N

    1

    N

    X

    n

    logp(yn|zn)

    n Eq(z)[logp(yn|zn)]KL[q(zn)kp(zn)]

    Instead of solving this optimisation

    for every data point n, we caninstead use a model.

    Data y

    Inference

    Network

    q(z |y)

    z ~ q(z | y)

    Model

    p(y |z)

    y ~ p(y | z)

    z

    I l ti V i ti l Al ith

  • 7/26/2019 Bayes Deep

    41/43

    Bayesian Reasoning and Deep Learning

    Ideally want probabilistic programmingusing variational inference.

    41

    Implementing your Variational Algorithm

    Avoid deriving pages of gradient updates for variational inference.

    Variational inference turns integrationinto optimisation:

    Automated Tools:

    Di!erentiation: Theano, Torch7, Stan

    Message passing:infer.NET

    Eq[(

    logp(y|z) + log q(z)

    logp(z)]

    Inference

    q(z |x)

    H[q(z)]

    Model

    p(x |z)

    Prior

    p(z)

    z

    log p(z)

    log p(x|z)

    Inference

    q(z |x)

    Model

    p(x |z)

    Prior

    p(z)

    Data x

    Forward pass Backward pass

    Stochastic gradient descent andother preconditioned optimisation.

    Same code can run on both GPUsor on distributed clusters.

    Probabilistic models are modular,can easily be combined.

    St h ti B k ti

    http://infer.net/
  • 7/26/2019 Bayes Deep

    42/43

    Bayesian Reasoning and Deep Learning

    A Monte Carlo method that works with continuous latent variables.

    42

    Stochastic Backpropagation

    Can use any likelihood function, avoids the need for additional lower bounds. Low-variance, unbiased estimator of the gradient.

    Can use just one samplefrom the base distribution.

    Possible for many distributions with location-scale or other knowntransformations, such as the CDF.

    Eq(z)[f(z)]Original problem

    z N(, 2)z= + N(0, 1)Reparameterisation

    EN(0,1)[f(+ )]

    N(0,1)[={,}f(+ )]Backpropagation

    with Monte Carlo

    M t C l C t l V i t E ti t

  • 7/26/2019 Bayes Deep

    43/43

    More general Monte Carlo approach that can be used with both discreteor continuous latent variables.

    Monte Carlo Control Variate Estimators

    log q(z|x) =q(z|x)

    q(z|x)Property of the score function:

    cis known as a control variateand is usedto control the variance of the estimator.

    Original problem Eq(z)[log p(y|z)]

    Score ratio q(z)[log p(y z)log q(z y)]

    MCCV Estimate Eq(z)[(log p(y z) c)log q(z y)]