probabilistic backpropagation for scalable learning of ...€¦ · multilayer neural networks...

Probabilistic Backpropagation forScalable Learning of Bayesian Neural Networks

Jose Miguel Hernandez-Lobato

joint work with Ryan P. Adams

July 9, 2015

Motivation

• Multilayer neural networks trained with backpropagation havestate-of-the-art results in many regression problems, but they...

• Require tuning of hyper-parameters.• Can be affected by overfitting problems.• Lack estimates of uncertainty in their predictions.

In principle, the Bayesian approach can solve these problems, but mostBayesian methods lack scalability.

Motivation

• Require tuning of hyper-parameters.

• Can be affected by overfitting problems.• Lack estimates of uncertainty in their predictions.

Motivation

• Require tuning of hyper-parameters.• Can be affected by overfitting problems.

• Lack estimates of uncertainty in their predictions.

Motivation

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

q(W, γ, λ) =[∏L

∏Vli=1

∏Vl−1+1j=1

Probabilistic Backpropagation (PBP)

Network output Modeling noise

Easy if is Gaussian distributed when .

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

We obtain the gradients by backpropagation.

Forward Pass

We obtain the gradients by backpropagation.8/23

Forward Pass

Once we compute log Z , we obtain its gradients by backpropagation.

Forward Pass

Once we compute log Z , we obtain its gradients by backpropagation.12/23

Backward Pass

Like in classic backpropagation, we obtain a recursion in terms of deltas:

δmj =∂ logZ

∂maj

k∈O(j)

{δmk∂ma

∂maj

+ δvk∂vak∂ma

δvj =∂ logZ

∂vaj=

∑k∈O(j)

{δmk∂ma

∂vaj+ δvk

∂vak∂vaj

Can be automatically implemented with Theano or autograd.

Variational Inference

Graves [2011] proposes a variational method for Bayesian neuralnetworks that optimizes the ELBO

LD(φ)− DKL[qφ(w)|p(w)]

whereLD(φ) =

∑(xn,yn)∈D

Eqφ(w)[log p(yn|w , xn)]

A double stochastic approximation to LD(φ) is obtained as

LD(φ) ≈ |D| log p(yn|xn,w ′)

where w ′ ∼ qφ(w). SGD can then be used to optimize the ELBO.

However, sampling from qφ(w) introduces additional noise that reducessignificantly the convergence speed.

whereLD(φ) =

∑(xn,yn)∈D

whereLD(φ) =

∑(xn,yn)∈D

Results on Toy Dataset

40 training epochs.

100 hidden units.

BP and VI use SGD.

BP and VI tuned withBayesian optimization(www.whetlab.com).

−6 −4 −2 0 2 4

Exhaustive Evaluation on 10 Datasets

Table : Characteristics of the analyzed data sets.

Dataset N d

Boston Housing 506 13Concrete Compression Strength 1030 8Energy Efficiency 768 8Kin8nm 8192 8Naval Propulsion 11,934 16Combined Cycle Power Plant 9568 4Protein Structure 45,730 9Wine Quality Red 1599 11Yacht Hydrodynamics 308 6Year Prediction MSD 515,345 90

Always 50 hidden units except in Year and Protein where we use 100.40 training epochs.

Exhaustive Evaluation on 10 Datasets

Table : Characteristics of the analyzed data sets.

Dataset N d

Boston Housing 506 13Concrete Compression Strength 1030 8Energy Efficiency 768 8Kin8nm 8192 8Naval Propulsion 11,934 16Combined Cycle Power Plant 9568 4Protein Structure 45,730 9Wine Quality Red 1599 11Yacht Hydrodynamics 308 6Year Prediction MSD 515,345 90

Always 50 hidden units except in Year and Protein where we use 100.40 training epochs. 16/23

Average Test RMSE

Table : Average test RMSE and standard errors.

Dataset VI BP PBPBoston 4.320±0.2914 3.228±0.1951 3.010±0.1850Concrete 7.128±0.1230 5.977±0.2207 5.552±0.1022Energy 2.646±0.0813 1.185±0.1242 1.729±0.0464Kin8nm 0.099±0.0009 0.091±0.0015 0.096±0.0008Naval 0.005±0.0005 0.001±0.0001 0.006±0.0000Power Plant 4.327±0.0352 4.182±0.0402 4.116±0.0332Protein 4.842±0.0305 4.539±0.0288 4.731±0.0129Wine 0.646±0.0081 0.645±0.0098 0.635±0.0078Yacht 6.887±0.6749 1.182±0.1645 0.922±0.0514Year 9.034±NA 8.932±NA 8.881± NA# Wins 0 4 6

Average Training Time in Seconds

PBP does not need to optimize learning-rates or regularizationparameters and is run only once.

Table : Average running time in seconds.

Problem VI BP PBPBoston 1035 677 13Concrete 1085 758 24Energy 2011 675 19Kin8nm 5604 2001 156Naval 8373 2351 220Power Plant 2955 2114 178Protein 7691 4831 485Wine 1195 917 50Yacht 954 626 12Year 142,077 65,131 6119

These results are for the Theano implementation of PBP. C code forPBP based on open-blas is about 5 times faster.

Results with Multiple Hidden Layers

Performance of networks with up to 4 hidden layers.

Same number of units in each hidden layer as before.

Table : Average Test RMSE.

Method BP PBPDataset 1 Layer 2 Layers 3 Layers 4 Layers 1 Layer 2 Layers 3 Layers 4 LayersBoston 3.23±0.20 3.18±0.24 3.02±0.18 2.87±0.16 3.01±0.18 2.80±0.16 2.94±0.16 3.09±0.15Concrete 5.98±0.22 5.40±0.13 5.57±0.13 5.53±0.14 5.67±0.09 5.24±0.12 5.73±0.11 5.96±0.16Energy 1.18±0.12 0.68±0.04 0.63±0.03 0.67±0.03 1.80±0.05 0.90±0.05 1.24±0.06 1.18±0.06Kin8nm 0.09±0.00 0.07±0.00 0.07±0.00 0.07±0.00 0.10±0.00 0.07±0.00 0.07±0.00 0.07±0.00Naval 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.01±0.00 0.00±0.00 0.01±0.00 0.00±0.00Power Plant 4.18±0.04 4.22±0.07 4.11±0.04 4.18±0.06 4.12±0.03 4.03±0.03 4.06±0.04 4.08±0.04Protein 4.54±0.02 4.18±0.03 4.02±0.03 3.95±0.02 4.69±0.01 4.24±0.01 4.10±0.02 3.98±0.03Wine 0.65±0.01 0.65±0.01 0.65±0.01 0.65±0.02 0.63±0.01 0.64±0.01 0.64±0.01 0.64±0.01Yacht 1.18±0.16 1.54±0.19 1.11±0.09 1.27±0.13 1.01±0.05 0.85±0.05 0.89±0.10 1.71±0.23Year 8.93±NA 8.98±NA 8.93±NA 9.04±NA 8.87± NA 8.92±NA 8.87±NA 8.93±NA

# Wins 0 1 1 1 2 5 0 0

Active Learning

Constant!

(MacKay [1992]):

MacKay [1992b], Jylanki et al. [2014].MacKay [1992a]20/23

Active Learning

Constant!

(MacKay [1992]):

MacKay [1992b], Jylanki et al. [2014].MacKay [1992a]21/23

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

Summary

PBP...

Summary

PBP...

Summary

PBP...

Summary

PBP...

Summary

PBP...

Summary

PBP...

Summary

PBP...

References I

A. Graves. Practical variational inference for neural networks. InAdvances in Neural Information Processing Systems 24, pages2348–2356. Curran Associates, Inc., 2011.

P. Jylanki, A. Nummenmaa, and A. Vehtari. Expectation propagation forneural networks with sparsity-promoting priors. The Journal ofMachine Learning Research, 15(1):1849–1901, 2014.

D. J. C. MacKay. Information-based objective functions for active dataselection. Neural computation, 4(4):590–604, 1992a.

D. J. C. MacKay. A practical Bayesian framework for backpropagationnetworks. Neural computation, 4(3):448–472, 1992b.

probabilistic backpropagation for scalable learning of ...€¦ · multilayer neural networks...

Documents