Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x...

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Yee Whye Teh

in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb, Sebastian Vollmer, Charles Blundell

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Bayesian Learning

! Parameter vector X.

! Data items Y = y1, y2,... yN.

! Model:

! Aim:

! Inference algorithms:!Variational inference: parametrise posterior as qθ and optimize θ. !Markov chain Monte Carlo: construct samples X1…Xn ~ p(X|Y).

p(X,Y ) = p(X)NY

p(yi|X)

y1 y2 y3 y4 ..... yN

p(X|Y ) =p(X)p(Y |X)

Machine Learning on Distributed Systems

y1i y2i y3i y4i

! Distributed storage

! Distributed computation

! costly network communications

Parameter Server

! Parameter server [Ahmed et al 2012], DistBelief network [Dean et al 2012].

parameter server:• parameter x

worker:• xi = x• updates to xi’ • returns Δxi = xi’ - xi

y1i y2i y3i y4i

! Only communication at the combination stage.

Embarassingly Parallel MCMC Sampling

y1i y2i y3i y4i

Treat as independentinference problems.Collect samples.

“Combine” samples together.

{Xji}j=1...m,i=1...n

{Xi}i=1...n

[Scott et al 2013, Neiswanger et al 2013, Wang & Dunson 2013, Stanislav et al 2014]

Embarassingly Parallel MCMC Sampling

! Unclear how to combine worker samples well.

! Particularly if local posteriors on worker machines do not overlap.

Figure from Wang & Dunson

Main Idea

! Identify regions of high (global) posterior probability mass.

! Shift each local posterior to agree with high probability region, and draw samples from these.

! How to find high probability region?!Defined in terms of low order moments.!Use information gained from local posterior samples (using small amount of communication). Figure from Wang & Dunson

Tilting Local Posteriors

! Each worker machine j has access only to its data subset.

! where pj(X) is a local prior and pj(X | yj) is local posterior.

! Adapt local priors pj(X) so that local posterior agree on certain moments

! Use expectation propagation (EP) [Minka 2001] to adapt local priors.

pj(X | yj) = pj(X)IY

p(yji |X)

Epj(X|yj)[s(X)] = s0 8j

Expectation Propagation

! If N is large, the worker j likelihood term p(yj | X) should be well approximated by Gaussian

! Parameters fit iteratively to minimize KL divergence:

! Optimal qj is such that first two moments of agree with ! Moments of local posterior estimated using MCMC sampling.! At convergence, first two moments of all local posteriors agree.

p(yj |X) ⇡ qj(X) = N (X;µj ,⌃j)

[Minka 2001]

p(X | y) ⇡ pj(X | y) / p(yj |X) p(X)Y

| {z }pj(X)

qnewj (·) = arg minN (·;µ,⌃)

KL�pj(· | y) kN (·;µ,⌃)pj(·)

N (·;µ,⌃)pj(·) pj(·|y)

Posterior Server Architecture

Likelihood

Posterior

Cavity

Likelihoodapproximation

Likelihood

Cavity

Likelihood

Cavity

Posterior Server

Worker WorkerWorker

Network

Bayesian Logistic Regression

! Simulated dataset.!d=20, # data items N=1000.

! NUTS based sampler.!# workers m = 4,10,50.!# MCMC iters T = 1000,1000,10000.

! # EP iters k given as vertical lines.200 400 600 800 1000 1200 1400

−2.5

−1.5

−0.5

k × T × N/m × 103

100 200 300 400 500 600−2.5

−1.5

−0.5

k × T × N/m × 103

250 500 750 1000 1250 1500−2.5

−1.5

−0.5

k × T × N/m × 103

Bayesian Logistic Regression

! MSE of posterior mean, as function of total # iterations.

3.2 6.4 9.6 12.8 16 19.2x 105

10−6

10−4

10−2

k × T × m

SMS(s)SMS(a)SCOTNEIS(p)NEIS(n)WANG

Stochastic Natural-gradient EP

! EP has no guarantee of convergence.! EP technically cannot handle stochasticity in moment estimates.! Long MCMC run needed for good moment estimates.! Fails for neural nets and other complex high-dimensional models.

! Stochastic Natural-gradient EP:!Alternative variational algorithm to EP.!Convergent, even with Monte Carlo estimates of moments.!Double-loop algorithm [Welling & Teh 2001, Yuille 2002, Heskes & Zoeter 2002]

Demonstrative Example

EP (500 samples) SNEP (500 samples)

Comparison to Maximum Likelihood SGD

! Maximum likelihood via SGD:!DistBelief [Dean et al 2012]!Elastic-averaging SGD [Zhang et al 2015]

! Separate likelihood approximations and states per worker.!Worker parameters not forced to be exactly same.

! Each worker learns to approximate its own likelihood. !Can be achieved without detailed knowledge from other workers.

! Diagonal Gaussian exponential family. !Variance estimates are important to learning.

Likelihood

Posterior

Cavity

Likelihood

Cavity

Likelihood

Cavity

Posterior Server

Worker WorkerWorker

Network

Experiments on Distributed Bayesian Neural Networks

! Bayesian approach to learning neural network:! compute parameter posterior given complex neural network likelihood.!Diagonal covariance Gaussian prior and exponential-family approximation.

! Two datasets and architectures: MNIST fully-connected, CIFAR10 convnet.

! Implementation in Julia.!Workers are cores on a server.!Sampler is stochastic gradient Langevin dynamics [Welling & Teh 2011].

!Adagrad [Duchi et al 2011]/RMSprop [Tieleman & Hinton 2012] type adaptation.

!Evaluated on test accuracy.

MNIST 500x300

epochs per worker0 5 10 15 20

1248Adam

rror i

Varying the number of workers

MNIST 500x300

rror i

Power SNEP - Varying the number of workers

MNIST 500x300

SNEPp-SNEPA-SGDEASGDAdam

rror i

Comparison of distributed methods (8 workers)

MNIST Very Deep MLP

epochs per worker0 10 20 30 40 50

ASGDEASGDp-SNEP

rror i

16 workers

CIFAR10 ConvNet

epochs per worker0 10 20 30

SNEPp-SNEPA-SGDEASGDAdam

rror i

Comparison of distributed methods (8 workers)

Concluding Remarks

! Novel distributed learning based on a combination of Monte Carlo and a convergent alternative to expectation propagation.

! Combination of variational and MCMC algorithms.!Advantageous over both pure variational and pure MCMC algorithms.

! Being Bayesian can be advantageous computationally in distributed setting.

! Thank you!

Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x...

Documents

repo-nkm.batan.go.idrepo-nkm.batan.go.id/1950/1/2016 NUR ROKHIM SEMINAR KIMIA PEMBANGUN… · 1y = parameter stabilitas horisontal (fungsi x dan Stabilitas). 1z = parameter stabilitas

X-431 製品カタログ - AMTECSgMW. OBO 75 VW/AudiØ X-431 lax; 0570-013-431 MORE POWER. MORE VALUE. MORE PROFITS. Standard Features LAUNCH S sten Parameter BL 320.2405 . I/O/COWlpt

KESAN PARAMETER-PARAMETER PENGGORENGAN KE …eprints.usm.my/31405/1/VASANTI_NAIR_CHANDERAN_K.pdfkesan parameter-parameter penggorengan ke atas perubahan-perubahan fizikal keropok dan

Symbolbibliothek – Library Of Signsschuh-anlagentechnik.de/files/Downloads/D_GB_Symbolbibliothek_20… · No. Benennung Designation Symbol X Messgröße Schwingung Parameter vibration

FUNDAMENTAL PARAMETER METHOD USING SCATTERING X-RAYS IN X ... · FUNDAMENTAL PARAMETER METHOD USING SCATTERING X-RAYS IN X-RAY FLUORESCENCE ANALYSIS Yoshiyuki Kataoka1, Naoki Kawahara1,

Bab X Estimasi Statistik - elearning.gunadarma.ac.idelearning.gunadarma.ac.id/docmodul/statistika_untuk_ekonomi_dan_bisnis/... · sebagai indikator dari nilai parameter populasi yang

Der Parameter „Migrationsmatrix“mammen.vwl.uni-mannheim.de/.../Materialien/Migrationsmatrix_I.pdf · Der Parameter „Migrationsmatrix“ Teil I Anne-Christine Barthel ... Diskrete

Funktionen, Felder und Parameter- übergabe. Funktionsaufruf mit Feld als Parameter: Parameter = Name des Feldes

Hochwertiger Multi - Parameter Überwachungsmonitor...Compact 7NOTFALL Hochwertiger Multi - Parameter Überwachungsmonitor • 10.4" TFT Farbdisplay ( 800 x 600 Pixel) • 7 - Kanal

Metode Statistika Pertemuan IX-X 211_Metoda Statistika... · 2020. 5. 5. · Metode Statistika Pertemuan IX-X Statistika Inferensia: Pendugaan Parameter Populasi : Parameter Contoh

Parameter parameter Uap Air dalam Meteorologi Fisik

kompakter und kosteneffektiver Multi - Parameter ......Compact 5NOTFALL kompakter und kosteneffektiver Multi - Parameter Überwachungsmonitor • 7" TFT Farbdisplay ( 800 x 480 Pixel)

FUNGSI NAIK DAN TURUN - Arisgunaryati's Blog | Be … · Web viewMendeferensialkan Persamaan Bentuk Parameter t = parameter Jika y’= maka = = Contoh: x= 2 – t y=t2 – 6t + 5

Page 1 Network Tuning Parameter 2003. x. xx 강사 : 최원규 과장 HPCS/ESC

TUGAS AKHIR IDENTIFIKASI PARAMETER MODEL MATEMATIKA …repository.its.ac.id/3441/7/1212100014-Undergraduate-Theses.pdf · i tugas akhir – sm141501 identifikasi parameter model matematika

Technisches Handbuch - myGEKKO€¦ · GEK.GAT.RAB.USB1 Technische Daten: Parameter Wert Gehäuse: Kunststoff, transparent Abmessungen (B x H x T): 90mm x 21mm x 12mm Anzeigeelemente:

part1 - Y-ADAGIO · Web view;; -- Scheme --;;;; parameter-build.scm;; -- make parameter from spec;;;; -- print objects and cr. (define (print . x) (map (lambda (e) (write e)(display

Penentuan Waktu Simpan Produk Minuman Sari Ginseng Merk “ X “ Berdasarkan Parameter Mikrobiologi.pdf

Algoritma Pemrograman - Ifa's · Pada bahasa pemrograman, seperti halnya parameter keluaran, parameter masukan/keluaran juga sering dinamakan parameter acuan (reference parameter

x . . . . . . Parameter