probabilistic backpropagation for scalable learning of ...€¦ · multilayer neural networks...

50
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks Jos´ e Miguel Hern´ andez-Lobato joint work with Ryan P. Adams July 9, 2015

Upload: others

Post on 31-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Backpropagation forScalable Learning of Bayesian Neural Networks

Jose Miguel Hernandez-Lobato

joint work with Ryan P. Adams

July 9, 2015

Page 2: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Motivation

• Multilayer neural networks trained with backpropagation havestate-of-the-art results in many regression problems, but they...

• Require tuning of hyper-parameters.• Can be affected by overfitting problems.• Lack estimates of uncertainty in their predictions.

In principle, the Bayesian approach can solve these problems, but mostBayesian methods lack scalability.

1/23

Page 3: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Motivation

• Multilayer neural networks trained with backpropagation havestate-of-the-art results in many regression problems, but they...

• Require tuning of hyper-parameters.

• Can be affected by overfitting problems.• Lack estimates of uncertainty in their predictions.

In principle, the Bayesian approach can solve these problems, but mostBayesian methods lack scalability.

1/23

Page 4: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Motivation

• Multilayer neural networks trained with backpropagation havestate-of-the-art results in many regression problems, but they...

• Require tuning of hyper-parameters.• Can be affected by overfitting problems.

• Lack estimates of uncertainty in their predictions.

In principle, the Bayesian approach can solve these problems, but mostBayesian methods lack scalability.

1/23

Page 5: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Motivation

• Multilayer neural networks trained with backpropagation havestate-of-the-art results in many regression problems, but they...

• Require tuning of hyper-parameters.• Can be affected by overfitting problems.• Lack estimates of uncertainty in their predictions.

In principle, the Bayesian approach can solve these problems, but mostBayesian methods lack scalability.

1/23

Page 6: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Motivation

• Multilayer neural networks trained with backpropagation havestate-of-the-art results in many regression problems, but they...

• Require tuning of hyper-parameters.• Can be affected by overfitting problems.• Lack estimates of uncertainty in their predictions.

In principle, the Bayesian approach can solve these problems, but mostBayesian methods lack scalability.

1/23

Page 7: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 8: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 9: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 10: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 11: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 12: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 13: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Multilayer Neural Networks

• L layers with W = {Wl}Ll=1 as weight matrices and output ZL

• ReLUs activations for the hidden layers: a(x) = max(x , 0) .

• The likelihood: p(y|W,X, γ) =∏N

n=1N (yn|zL(xn|W), γ −1)≡ fn .

• The priors: p(W|λ) =∏L

l=1

∏Vli=1

∏Vl−1+1j=1 N (wij ,l |0, λ

−1)≡ gk ,

p(λ) = Gamma(λ|αλ0 , βλ0 )≡ h, p(γ) = Gamma(γ|αγ0 , βγ0 )≡ s .

The posterior approximation is

q(W, γ, λ) =[∏L

l=1

∏Vli=1

∏Vl−1+1j=1

N (wij ,l |mij ,l , vij ,l)]

Gamma(γ|αγ , βγ)

Gamma(λ|αλ, βλ) .

2/23

Page 14: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Backpropagation (PBP)

Network output Modeling noise

zL |

3/23

Page 15: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Backpropagation (PBP)

Network output Modeling noise

zL |

4/23

Page 16: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Backpropagation (PBP)

Network output Modeling noise

zL |

5/23

Page 17: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Backpropagation (PBP)

Network output Modeling noise

zL |

6/23

Page 18: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Probabilistic Backpropagation (PBP)

Network output Modeling noise

Easy if is Gaussian distributed when .

zL |

zL |

7/23

Page 19: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

We obtain the gradients by backpropagation.

8/23

Page 20: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

We obtain the gradients by backpropagation.8/23

Page 21: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

We obtain the gradients by backpropagation.9/23

Page 22: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

We obtain the gradients by backpropagation.10/23

Page 23: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

We obtain the gradients by backpropagation.11/23

Page 24: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

Once we compute log Z , we obtain its gradients by backpropagation.

12/23

Page 25: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Forward Pass

Propagate distributions through the network, approximating them withGaussians by moment matching.

Once we compute log Z , we obtain its gradients by backpropagation.12/23

Page 26: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Backward Pass

Like in classic backpropagation, we obtain a recursion in terms of deltas:

δmj =∂ logZ

∂maj

=∑

k∈O(j)

{δmk∂ma

k

∂maj

+ δvk∂vak∂ma

j

},

δvj =∂ logZ

∂vaj=

∑k∈O(j)

{δmk∂ma

k

∂vaj+ δvk

∂vak∂vaj

}.

Can be automatically implemented with Theano or autograd.

13/23

Page 27: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Variational Inference

Graves [2011] proposes a variational method for Bayesian neuralnetworks that optimizes the ELBO

LD(φ)− DKL[qφ(w)|p(w)]

whereLD(φ) =

∑(xn,yn)∈D

Eqφ(w)[log p(yn|w , xn)]

A double stochastic approximation to LD(φ) is obtained as

LD(φ) ≈ |D| log p(yn|xn,w ′)

where w ′ ∼ qφ(w). SGD can then be used to optimize the ELBO.

However, sampling from qφ(w) introduces additional noise that reducessignificantly the convergence speed.

14/23

Page 28: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Variational Inference

Graves [2011] proposes a variational method for Bayesian neuralnetworks that optimizes the ELBO

LD(φ)− DKL[qφ(w)|p(w)]

whereLD(φ) =

∑(xn,yn)∈D

Eqφ(w)[log p(yn|w , xn)]

A double stochastic approximation to LD(φ) is obtained as

LD(φ) ≈ |D| log p(yn|xn,w ′)

where w ′ ∼ qφ(w). SGD can then be used to optimize the ELBO.

However, sampling from qφ(w) introduces additional noise that reducessignificantly the convergence speed.

14/23

Page 29: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Variational Inference

Graves [2011] proposes a variational method for Bayesian neuralnetworks that optimizes the ELBO

LD(φ)− DKL[qφ(w)|p(w)]

whereLD(φ) =

∑(xn,yn)∈D

Eqφ(w)[log p(yn|w , xn)]

A double stochastic approximation to LD(φ) is obtained as

LD(φ) ≈ |D| log p(yn|xn,w ′)

where w ′ ∼ qφ(w). SGD can then be used to optimize the ELBO.

However, sampling from qφ(w) introduces additional noise that reducessignificantly the convergence speed.

14/23

Page 30: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Results on Toy Dataset

40 training epochs.

100 hidden units.

BP and VI use SGD.

BP and VI tuned withBayesian optimization(www.whetlab.com).

−6 −4 −2 0 2 4

−50

050

−6 −4 −2 0 2 4

−50

050

−6 −4 −2 0 2 4

−50

050

−6 −4 −2 0 2 4

−50

050

15/23

Page 31: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Exhaustive Evaluation on 10 Datasets

Table : Characteristics of the analyzed data sets.

Dataset N d

Boston Housing 506 13Concrete Compression Strength 1030 8Energy Efficiency 768 8Kin8nm 8192 8Naval Propulsion 11,934 16Combined Cycle Power Plant 9568 4Protein Structure 45,730 9Wine Quality Red 1599 11Yacht Hydrodynamics 308 6Year Prediction MSD 515,345 90

Always 50 hidden units except in Year and Protein where we use 100.40 training epochs.

16/23

Page 32: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Exhaustive Evaluation on 10 Datasets

Table : Characteristics of the analyzed data sets.

Dataset N d

Boston Housing 506 13Concrete Compression Strength 1030 8Energy Efficiency 768 8Kin8nm 8192 8Naval Propulsion 11,934 16Combined Cycle Power Plant 9568 4Protein Structure 45,730 9Wine Quality Red 1599 11Yacht Hydrodynamics 308 6Year Prediction MSD 515,345 90

Always 50 hidden units except in Year and Protein where we use 100.40 training epochs. 16/23

Page 33: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Average Test RMSE

Table : Average test RMSE and standard errors.

Dataset VI BP PBPBoston 4.320±0.2914 3.228±0.1951 3.010±0.1850Concrete 7.128±0.1230 5.977±0.2207 5.552±0.1022Energy 2.646±0.0813 1.185±0.1242 1.729±0.0464Kin8nm 0.099±0.0009 0.091±0.0015 0.096±0.0008Naval 0.005±0.0005 0.001±0.0001 0.006±0.0000Power Plant 4.327±0.0352 4.182±0.0402 4.116±0.0332Protein 4.842±0.0305 4.539±0.0288 4.731±0.0129Wine 0.646±0.0081 0.645±0.0098 0.635±0.0078Yacht 6.887±0.6749 1.182±0.1645 0.922±0.0514Year 9.034±NA 8.932±NA 8.881± NA# Wins 0 4 6

17/23

Page 34: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Average Training Time in Seconds

PBP does not need to optimize learning-rates or regularizationparameters and is run only once.

Table : Average running time in seconds.

Problem VI BP PBPBoston 1035 677 13Concrete 1085 758 24Energy 2011 675 19Kin8nm 5604 2001 156Naval 8373 2351 220Power Plant 2955 2114 178Protein 7691 4831 485Wine 1195 917 50Yacht 954 626 12Year 142,077 65,131 6119

These results are for the Theano implementation of PBP. C code forPBP based on open-blas is about 5 times faster.

18/23

Page 35: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Average Training Time in Seconds

PBP does not need to optimize learning-rates or regularizationparameters and is run only once.

Table : Average running time in seconds.

Problem VI BP PBPBoston 1035 677 13Concrete 1085 758 24Energy 2011 675 19Kin8nm 5604 2001 156Naval 8373 2351 220Power Plant 2955 2114 178Protein 7691 4831 485Wine 1195 917 50Yacht 954 626 12Year 142,077 65,131 6119

These results are for the Theano implementation of PBP. C code forPBP based on open-blas is about 5 times faster.

18/23

Page 36: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Average Training Time in Seconds

PBP does not need to optimize learning-rates or regularizationparameters and is run only once.

Table : Average running time in seconds.

Problem VI BP PBPBoston 1035 677 13Concrete 1085 758 24Energy 2011 675 19Kin8nm 5604 2001 156Naval 8373 2351 220Power Plant 2955 2114 178Protein 7691 4831 485Wine 1195 917 50Yacht 954 626 12Year 142,077 65,131 6119

These results are for the Theano implementation of PBP. C code forPBP based on open-blas is about 5 times faster.

18/23

Page 37: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Results with Multiple Hidden Layers

Performance of networks with up to 4 hidden layers.

Same number of units in each hidden layer as before.

Table : Average Test RMSE.

Method BP PBPDataset 1 Layer 2 Layers 3 Layers 4 Layers 1 Layer 2 Layers 3 Layers 4 LayersBoston 3.23±0.20 3.18±0.24 3.02±0.18 2.87±0.16 3.01±0.18 2.80±0.16 2.94±0.16 3.09±0.15Concrete 5.98±0.22 5.40±0.13 5.57±0.13 5.53±0.14 5.67±0.09 5.24±0.12 5.73±0.11 5.96±0.16Energy 1.18±0.12 0.68±0.04 0.63±0.03 0.67±0.03 1.80±0.05 0.90±0.05 1.24±0.06 1.18±0.06Kin8nm 0.09±0.00 0.07±0.00 0.07±0.00 0.07±0.00 0.10±0.00 0.07±0.00 0.07±0.00 0.07±0.00Naval 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.01±0.00 0.00±0.00 0.01±0.00 0.00±0.00Power Plant 4.18±0.04 4.22±0.07 4.11±0.04 4.18±0.06 4.12±0.03 4.03±0.03 4.06±0.04 4.08±0.04Protein 4.54±0.02 4.18±0.03 4.02±0.03 3.95±0.02 4.69±0.01 4.24±0.01 4.10±0.02 3.98±0.03Wine 0.65±0.01 0.65±0.01 0.65±0.01 0.65±0.02 0.63±0.01 0.64±0.01 0.64±0.01 0.64±0.01Yacht 1.18±0.16 1.54±0.19 1.11±0.09 1.27±0.13 1.01±0.05 0.85±0.05 0.89±0.10 1.71±0.23Year 8.93±NA 8.98±NA 8.93±NA 9.04±NA 8.87± NA 8.92±NA 8.87±NA 8.93±NA

# Wins 0 1 1 1 2 5 0 0

19/23

Page 38: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Results with Multiple Hidden Layers

Performance of networks with up to 4 hidden layers.

Same number of units in each hidden layer as before.

Table : Average Test RMSE.

Method BP PBPDataset 1 Layer 2 Layers 3 Layers 4 Layers 1 Layer 2 Layers 3 Layers 4 LayersBoston 3.23±0.20 3.18±0.24 3.02±0.18 2.87±0.16 3.01±0.18 2.80±0.16 2.94±0.16 3.09±0.15Concrete 5.98±0.22 5.40±0.13 5.57±0.13 5.53±0.14 5.67±0.09 5.24±0.12 5.73±0.11 5.96±0.16Energy 1.18±0.12 0.68±0.04 0.63±0.03 0.67±0.03 1.80±0.05 0.90±0.05 1.24±0.06 1.18±0.06Kin8nm 0.09±0.00 0.07±0.00 0.07±0.00 0.07±0.00 0.10±0.00 0.07±0.00 0.07±0.00 0.07±0.00Naval 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.01±0.00 0.00±0.00 0.01±0.00 0.00±0.00Power Plant 4.18±0.04 4.22±0.07 4.11±0.04 4.18±0.06 4.12±0.03 4.03±0.03 4.06±0.04 4.08±0.04Protein 4.54±0.02 4.18±0.03 4.02±0.03 3.95±0.02 4.69±0.01 4.24±0.01 4.10±0.02 3.98±0.03Wine 0.65±0.01 0.65±0.01 0.65±0.01 0.65±0.02 0.63±0.01 0.64±0.01 0.64±0.01 0.64±0.01Yacht 1.18±0.16 1.54±0.19 1.11±0.09 1.27±0.13 1.01±0.05 0.85±0.05 0.89±0.10 1.71±0.23Year 8.93±NA 8.98±NA 8.93±NA 9.04±NA 8.87± NA 8.92±NA 8.87±NA 8.93±NA

# Wins 0 1 1 1 2 5 0 0

19/23

Page 39: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Results with Multiple Hidden Layers

Performance of networks with up to 4 hidden layers.

Same number of units in each hidden layer as before.

Table : Average Test RMSE.

Method BP PBPDataset 1 Layer 2 Layers 3 Layers 4 Layers 1 Layer 2 Layers 3 Layers 4 LayersBoston 3.23±0.20 3.18±0.24 3.02±0.18 2.87±0.16 3.01±0.18 2.80±0.16 2.94±0.16 3.09±0.15Concrete 5.98±0.22 5.40±0.13 5.57±0.13 5.53±0.14 5.67±0.09 5.24±0.12 5.73±0.11 5.96±0.16Energy 1.18±0.12 0.68±0.04 0.63±0.03 0.67±0.03 1.80±0.05 0.90±0.05 1.24±0.06 1.18±0.06Kin8nm 0.09±0.00 0.07±0.00 0.07±0.00 0.07±0.00 0.10±0.00 0.07±0.00 0.07±0.00 0.07±0.00Naval 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.01±0.00 0.00±0.00 0.01±0.00 0.00±0.00Power Plant 4.18±0.04 4.22±0.07 4.11±0.04 4.18±0.06 4.12±0.03 4.03±0.03 4.06±0.04 4.08±0.04Protein 4.54±0.02 4.18±0.03 4.02±0.03 3.95±0.02 4.69±0.01 4.24±0.01 4.10±0.02 3.98±0.03Wine 0.65±0.01 0.65±0.01 0.65±0.01 0.65±0.02 0.63±0.01 0.64±0.01 0.64±0.01 0.64±0.01Yacht 1.18±0.16 1.54±0.19 1.11±0.09 1.27±0.13 1.01±0.05 0.85±0.05 0.89±0.10 1.71±0.23Year 8.93±NA 8.98±NA 8.93±NA 9.04±NA 8.87± NA 8.92±NA 8.87±NA 8.93±NA

# Wins 0 1 1 1 2 5 0 0

19/23

Page 40: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Active Learning

Constant!

(MacKay [1992]):

MacKay [1992b], Jylanki et al. [2014].MacKay [1992a]20/23

Page 41: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Active Learning

Constant!

(MacKay [1992]):

MacKay [1992b], Jylanki et al. [2014].MacKay [1992a]21/23

Page 42: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 43: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 44: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 45: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 46: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 47: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 48: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 49: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

Summary

PBP...

• produces state-of-the-art scalable Bayesian inference in NNs.

• works similarly as backpropagation.

• does not require hyper-parameter tuning.

• performs similarly to BP with BO optimization at a lower cost.

• produces accurate uncertainty estimates.

Fast C and Theano code available at https://github.com/HIPS

Thank you for your attention!

22/23

Page 50: Probabilistic Backpropagation for Scalable Learning of ...€¦ · Multilayer neural networks trained with backpropagation have state-of-the-art results in many regression problems,

References I

A. Graves. Practical variational inference for neural networks. InAdvances in Neural Information Processing Systems 24, pages2348–2356. Curran Associates, Inc., 2011.

P. Jylanki, A. Nummenmaa, and A. Vehtari. Expectation propagation forneural networks with sparsity-promoting priors. The Journal ofMachine Learning Research, 15(1):1849–1901, 2014.

D. J. C. MacKay. Information-based objective functions for active dataselection. Neural computation, 4(4):590–604, 1992a.

D. J. C. MacKay. A practical Bayesian framework for backpropagationnetworks. Neural computation, 4(3):448–472, 1992b.

23/23