wray buntine, arxiv:2007.06823v1 [cs.lg] 14 jul 2020

1

Hands-on Bayesian Neural Networks – a Tutorialfor Deep Learning Users

Laurent Valentin Jospin1, Hamid Laga2, Farid Boussaid1, Wray Buntine3, Mohammed Bennamoun1, SeniorMember, IEEE,

1University of Western Australia, 2Murdoch University, 3Monash [email protected], [email protected],

farid.boussaid,[email protected], [email protected]

Abstract—Modern deep learning methods constitute incrediblypowerful tools to tackle a myriad of challenging problems.However, since deep learning methods operate as black boxes, theuncertainty associated with their predictions is often challengingto quantify. Bayesian statistics offer a formalism to understandand quantify the uncertainty associated with deep neural networkpredictions. This tutorial provides an overview of the relevantliterature and a complete toolset to design, implement, train, useand evaluate Bayesian Neural Networks, i.e., Stochastic ArtificialNeural Networks trained using Bayesian methods.

Index Terms—Bayesian methods, Bayesian Deep Learning,Bayesian neural networks, Approximate Bayesian methods

I. INTRODUCTION

Deep learning has led to a revolution in machine learning,providing solutions to tackle ever more challenging prob-lems. However, deep learning models are prone to overfitting,which adversely affects their generalization capabilities [1].They also tend to be overconfident about their predictionswhen they provide a confidence interval. This is problematicfor applications where silent failure can lead to dramaticoutcomes, e.g., self driving cars [2], medical diagnosis [3]or finance [4]. Consequently, many approaches have beenproposed to mitigate this risk [5]. Among them, the Bayesianparadigm provides a rigorous framework to analyse and trainuncertainty-aware neural networks, and more generally, tosupport the development of learning algorithms.

The Bayesian paradigm in statistics contrasts with thefrequentist paradigm, with a major area of distinction beingin hypothesis testing [6]. It is based on two simple ideas: (1)that probability is a measure of belief in the occurrence ofevents, rather than the limit in the frequency of occurrencewhen the number of samples goes towards infinity as assumedin the frequentist paradigm, and (2) that prior beliefs influenceposterior beliefs. Bayes’ theorem, which states that:

P (H|D) =P (D|H)P (H)

P (D)=

P (D,H)∫HP (D,H ′)dH ′

, (1)

summarizes this interpretation. This formula is still true in thefrequentist interpretation, where H and D are considered assets of outcomes. The Bayesian interpretation considers H tobe a hypothesis, about which one holds some prior belief, andD to be some data, that will update one’s belief about H .P (D|H) is called the likelihood. It encodes the aleatoric un-certainty in the model, i.e., the uncertainty due to the noise in

the process. P (H) is the prior and P (D) =∫HP (D,H ′)dH ′

the evidence. P (H|D) is called the posterior. It encodes theepistemic uncertainty, i.e., the uncertainty due to the lack ofdata. P (D|H)P (H) = P (D,H) is the joint probability of Dand H .

Using Bayes formula to train a predictor can be understoodas learning from the data D. In other words, the Bayesianparadigm does not only offer a solid approach for the quantifi-cation of uncertainty in deep learning models. It also providesa mathematical framework to understand many regularizationtechniques and learning strategies that are already used inclassic deep learning [7] (Sec.IV-C3).

There is a rich literature about Bayesian neural networks(BNNs), i.e., stochastic neural networks trained using aBayesian approach, or the larger field of Bayesian (deep)learning [8, 9, 10]. However, navigating through this literatureis challenging without some prior background in Bayesianstatistics. This brings an additional layer of complexity fordeep learning practitioners interested in bulding and usingBNNs.

This paper, conceived as a tutorial, presents a unified work-flow to design, implement, train and evaluate a BNN (Fig. 1).It also provides an overview of the relevant literature. A largenumber of approaches have been developed to efficiently trainand use BNNs. A good knowledge of those different methodsis a prerequisite for an efficient use of BNNs in big dataapplications of deep learning.

The remaining parts of this paper are organised as follows.Section II introduces the concept of a BNN. Section IIIpresents the motivations for and applications of BNNs. SectionIV explains how to design the stochastic model associated witha BNN. Section V explores the most important algorithmsused for Bayesian inference and how they were adaptedfor deep learning. Section VI reviews BNN simplificationmethods. Section VII presents the methods used to evaluatethe performance of a BNN. Finally, Section VIII concludesthe paper.

II. WHAT IS A BAYESIAN NEURAL NETWORK?A BNN is defined slightly differently across the literature,

but a common definition is that a BNN is a stochastic artificialneural network trained using Bayesian inference.

The goal of Artificial Neural Networks (ANNs) is to repre-sent an arbitrary function y = Φ(x). Traditional ANNs such

arX

iv:2

007.

0682

3v2

[cs

.LG

] 3

0 Se

p 20

21

2

Prior Variational posterior

Stochastic model

Functional model

Inference (training)

Variational inferenceMCMC

Metropolis hasting, HMC, NUTS, SGLD, ...

Bayes by backprop, MC-Dropout,Deep ensembles, ...

Posterior

Marginal

Uncertainty

MAP

Summary

(a) (b) (c)

Training data Input ...

Fig. 1: Workflow to design (a), train (b) and use a BNN for predictions (c).

as feedforward networks and recurrent networks are built usinga succession of one input layer, a number of hidden layers andone output layer. In the simplest architecture of feedforwardnetworks, each layer l is represented as a linear transformation,followed by a nonlinear operation s, also known as activationfunction:

l0 = x,li = si(W ili−1 + bi) ∀i ∈ [1, n],y = ln.

(2)

A given ANN architecture represents a set of functions iso-morphic to the set of possible parameters θ = (W , b), whereW is the weights of the network and b the biases. Deeplearning is the process of regressing the parameters θ fromthe training data D. D is composed of a series of input xand their corresponding labels y. The standard approach isto approximate a minimal cost point estimate of the networkparameters θ, i.e., a single value for each parameter (Fig. 2a),using the back-propagation algorithm, with all other possibleparametrizations of the network discarded. The cost functionis often defined as the log likelihood of the training set,sometimes with a regularization term included. From a statis-tician point of view, this is a Maximum Likelihood Estimation(MLE), or a Maximum A Posteriori (MAP) estimation whenregularization is used.

The point estimate approach, which is the traditional ap-proach in deep learning, is relatively easy to deploy withmodern algorithms and software packages, but tends to lackexplainability. The final model might also generalize in unfore-seen and overconfident ways on out-of-training-distributiondata points [11, 12]. This property, in addition to the inabilityof ANNs to say “I don’t know” is problematic for manycritical applications. Of all the techniques that exist to mitigatethis [13], stochastic neural networks have proved to be one ofthe most generic and flexible.

Stochastic neural networks are a type of ANNs built byintroducing stochastic components into the network. This isperformed by giving the network either a stochastic activation(Fig. 2b) or stochastic weights (Fig. 2c) in order to simulatemultiple possible models θ with their associated probabilitydistribution p(θ). Thus, they can be considered as a specialcase of ensemble learning [14].

The main motivation behind ensemble learning comes fromthe observation that aggregating the predictions of a largeset of average-performing but independent predictors can leadto better predictions than a single well-performing expertpredictor [15]. Stochastic neural networks might improve their

(a) (b) (c)

Fig. 2: Point estimate neural network (a), stochastic neuralnetwork with a probability distribution for the activations (b),and with a probability distribution over the weights (c).

performance over their point-estimate counterparts in a similarfashion, but this is not their main aim. Rather, the main goalof using a stochastic neural network architecture is to obtaina better idea of the uncertainty associated with the underlyingprocesses. This is accomplished by comparing the predictionsof multiple sampled model parametrizations θ. If the differentmodels agree, then the uncertainty is low. If they disagree,then the uncertainty is high. This process can be summarizedas follows:

θ ∼ p(θ),y = Φθ(x) + ε,

(3)

where ε represents random noise to account for the fact thatthe function Φ is only an approximation.

A BNN can then be defined as any stochastic artificialneural network trained using Bayesian inference [16]. Todesign a BNN, the first step is the choice of a deep neuralnetwork architecture, i.e., a functional model. Then, onehas to choose a stochastic model, i.e., a prior distributionover the possible model parametrization p(θ) and a priorconfidence in the predictive power of the model p(y|x,θ)(Fig. 1a). The model parametrization can be considered tobe the hypothesis H and the training set is the data D. Thechoice of a BNN’s stochastic model is somehow equivalent tothe choice of a loss function when training a point estimateneural network, see Section IV-C3. In the rest of this paper,we will denote the model parameters by θ, the training setby D, the training inputs by Dx, and the training labels byDy . By applying Bayes theorem, and enforcing independencebetween the model parameters and the inputs, the Bayesianposterior can then be written as:

p(θ|D) =p(Dy|Dx,θ)p(θ)∫

θp(Dy|Dx,θ′)p(θ′)dθ′

∝ p(Dy|Dx,θ)p(θ). (4)

3

Computing this distribution and sampling it using standardmethods is usually an intractable problem, especially sincecomputing the evidence

∫θp(Dy|Dx,θ′)p(θ′)dθ′ is hard. To

address this, two broad approaches are available, MarkovChain Monte Carlo and Variational Inference. These are pre-sented in more details in Section V.

When using a BNN for prediction, the distribution inwhich one is really interested is the marginal probabilitydistribution p(y|x, D) [17], which quantifies the model’suncertainty on its prediction. Given p(θ|D), p(y|x, D) canbe computed as:

p(y|x, D) =

∫

θ

p(y|x,θ′)p(θ′|D)dθ′. (5)

In practice, p(y|x, D) is sampled indirectly using Equation(3). The final prediction is summarized by a few statistics com-puted using a Monte-Carlo approach (Fig. 1c). Algorithm 1summarizes the inference process for a BNN.

Algorithm 1 Inference procedure for a BNN

Define p(θ|D) =p(Dy|Dx,θ)p(θ)∫

θp(Dy|Dx,θ′)p(θ′)dθ′

for i = 0 to N doDraw θi ∼ p(θ|D)yi = Φθ(x)

end forreturn Y = yi|i ∈ [0, N), Θ = θi|i ∈ [0, N)

In Algorithm 1, Y is a set of samples from p(y|x, D) and Θa collection of samples from p(θ|D). Usually, aggregates arecomputed on those samples to summarize the uncertainty ofthe BNN and get an estimator for the output y. This estimatoris designated as y.

To summarize the predictions of a BNN used to perform re-gression, the procedure that is usually used is model averaging[18]:

y =1

|Θ|∑

θi∈Θ

Φθi(x). (6)

This approach is so common in ensemble learning that itis sometimes called ensembling. To quantify uncertainty, thecovariance matrix can be computed as follows:

Σy|x,D =1

|Θ| − 1

∑

θi∈Θ

(Φθi(x)− y) (Φθi(x)− y)ᵀ . (7)

When performing classification, the average model predictionwill give the relative probability of each class, which can beconsidered as a measure of uncertainty:

p =1

|Θ|∑

θi∈Θ

Φθi(x). (8)

The final prediction is taken as the most likely class:

y = arg maxi

pi ∈ p. (9)

If the cost of a false positive varies across different classes,the minimal risk prediction should be selected instead.

This definition considers BNNs as discriminative models,i.e., models that aim to reconstruct a target variable y givenobservations x. This excludes generative models even if there

exist examples of generative ANNs based on the Bayesianformalism, e.g., VAE [19]. Those are out of the scope of thistutorial.

III. ADVANTAGES OF BAYESIAN METHODS FOR DEEPLEARNING

One of the major critics of Bayesian methods is that theyrely on prior knowledge. This is especially true in deeplearning, as getting any insight about plausible parametrizationfor a given model before training is very hard. Thus, why useBayesian methods for deep learning? Discriminative modelsimplicitly represent the conditional probability p(y|x,θ), and,Bayes formula is an appropriate tool to invert conditionalprobabilities, even if one has little insight about p(θ) a priori.While there are strong theoretical principles and schema uponwhich this Bayes formula can be based [20], this sectionfocuses on some practical benefits of using BNNs.

First, Bayesian methods provide a natural approach toquantify uncertainty in deep learning since BNNs have bettercalibration than classical neural networks [21, 22, 23], i.e.,their uncertainty is more consistent with the observed errors.They are less often overconfident or underconfident.

Second, a BNN allows to distinguish between the epistemicuncertainty p(θ|D) and the aleatoric uncertainty p(y|x,θ)[24, 25]. This makes BNNs very data-efficient since they canlearn from a small dataset without overfitting. At predictiontime, out-of-training distribution points will have a high epis-temic uncertainty.

Third, the no-free-lunch theorem for machine learning [26]can be interpreted as saying that any supervised learningalgorithm includes some implicit prior. Bayesian methods,when used correctly, will at least make the prior explicit. Ifintegrating prior knowledge into ANNs, which work as blackboxes, is hard, it is not impossible. In Bayesian deep learning,priors are often considered as soft constraints, like regular-ization, or via data transformations like data augmentation.Most regularization methods used for point-estimate neuralnetworks can be understood from a Bayesian perspective assetting a prior, see Section IV-C3.

Finally, the Bayesian paradigm enables the analysis oflearning methods. A number of methods initially not pre-sented as Bayesian can be implicitly understood as beingapproximate Bayesian, e.g., regularization (Section IV-C3)or ensembling (Section V-E2b). In fact, most of the BNNsused in practice rely on methods that are approximately orimplicitly Bayesian (Sec.V-E) since the exact algorithms aretoo expensive. The Bayesian paradigm also provides a sys-tematic framework to design new learning and regularizationstrategies, even for point-estimate models.

BNNs have been used in many fields to quantify uncertainty,e.g., in computer vision [27], network traffic monitoring [28],aviation [29], civil engineering [30, 31], hydrology [32],astronomy [33], electronics [34], and medicine [35]. BNNsare also useful in (1) active learning [36, 37] where an oracle(e.g., a human annotator, a crowd, an expensive algorithm)can label new points from an unlabeled dataset U and themodel has to decide which points should be submitted to

4

Algorithm 2 Active learning loop with a BNNwhile U 6= ∅ and Σy|xmax,D < threshold and C < MaxC do

Draw Θ = θi ∼ p(θ|D)|i ∈ [0, N)for x ∈ U do

Σy|x,D =1

|Θ| − 1

∑θi∈Θ (Φθi(x)− y) (Φθi(x)− y)ᵀ .

if Σy|x,D > Σy|xmax,D thenxmax = x

end ifend forDx = Dx ∪ xmaxDy = Dy ∪ Oracle(xmax)U = U \ xmaxC = C + 1

end while

Algorithm 3 Online learning loop with a BNNDefine p(θ) = p(θ)0

while true doDefine p(θ|Di) =

p(Dy,i|Dx,i,θ)p(θ)i∫θp(Dy,i|Dx,i,θ′)p(θ′)idθ′

Define p(θ)i+1 = p(θ|Di)end while

the oracle to maximise its performance while minimizingthe calls to the oracle, and (2) online learning [38], wherethe model is retrained multiple times as new data becomeavailable. For active learning, data points in the training setwith high epistemic uncertainty are scheduled to be labelledwith higher priority (See Algorithm 2), while for onlinelearning, previous posteriors can be recycled as priors whennew data becomes available to avoid the so called problem ofcatastrophic forgetting [39] (See Algorithm 3).

IV. SETTING THE STOCHASTIC MODEL FOR A BAYESIANNEURAL NETWORK

Designing a BNN requires choosing a functional modeland a stochastic model. This tutorial will not cover thedesign of the functional model, as almost any model usedfor point estimate networks can be used as a functionalmodel for a BNN and a rich literature on the subject existsalready, see for example [40]. Instead, this section will focuson how to design the stochastic model. Section IV-A presentsProbabilistic Graphical Models (PGMs) as a tool used torepresent the relationships between the model’s stochasticvariables. Section IV-B details how to derive the posterior for aBNN from its PGM. Section IV-C discusses how to choose theprobability laws used as priors. Finally, Section IV-D presentshow the choice of a PGM can change the degree of supervisionor embed other forms of prior knowledge into the model.

A. Probabilistic graphical models

Probabilistic Graphical Models (PGMs) use graphs to rep-resent the interdependence of multivariate stochastic variablesand decompose their probability distributions accordingly.PGMs cover a large variety of models. The kind of PGMsthis tutorial focuses on are Bayesian Belief Networks (BBN),which are PGMs whose graphs are acyclic and directed. We

y

(a)

θ

(b)

l

(c)

a

(d)

vB

(e)

Fig. 3: PGM symbols, observed variables are in colored circles(a), unobserved variables are in white circles (b), deterministicvariables are in dashed circles (c) and parameters are in rectan-gles (d). Plates, represented as a rectangle around a subgraph,indicate multiple independent instances of the subgraph for abatch of variables B (e).

refer the reader to [41] for more details on how to representlearning algorithms using general PGMs.

In a PGM, variables vi are the nodes in the graph. Differentsymbols are used to distinguish the nature of the consideredvariables (Fig. 3). A directed link (i.e., the only type of linksallowed in a BBN) means that the probability distribution ofthe target variable is defined conditioned on the source vari-able. The fact that the BBN is acyclic allows the computationof the joint probability distribution of all the variables vi inthe graph:

p(v1, ...,vn) =

n∏

i=1

p(vi|parents(vi)). (10)

The type of distribution used to define all the conditionalprobabilities p(vi|parents(vi)) will depend on the context.Once the p(vi|parents(vi)) are defined, the BBN describesa data generation process. Parents are sampled before theirchildren. This is always possible since the graph is acyclic.All the variables together represent a sample from the jointprobability distribution p(v1, ...,vn).

Models usually learn from multiple examples sampled fromthe same distribution. To highlight this fact, the plate notation(Fig. 3e) has been introduced. A plate indicates that thevariables (v1, ...,vn) in the subgraph encapsulated by the plateare copied along a given batch dimension. A plate impliesindependence between all the duplicated nodes. This factcan be exploited to compute the joint probability of a batchB = (v1, ...,vn)b : b = 1, ..., |B| as:

p(B) =∏

(v1,...,vn)∈Bp(v1, ...,vn). (11)

In a PGM, the observed variables, depicted in Fig. 3a usingcolored circles, are treated as the data. The unobserved, alsocalled latent, variables, represented by white circle in Fig. 3b,are treated as the hypothesis. From the joint probability derivedfrom the PGM, defining the posterior for the latent variablesgiven the observed variables is straightforward using Bayesformula:

p(vlatent|vobs) ∝ p(vobs,vlatent). (12)

noindent The joint distribution p(vobs,vlatent) is then usedby the different inference algorithms (see Section V).

5

θ

D

x

y

(a)

D

x

lh θ

y

(b)

Fig. 4: The BBNs corresponding to (a) coefficients as stochas-tic variables and (b) activations as stochastic variables.

B. Defining the stochastic model of a BNN from a PGMConsider the two models presented in Fig. 4, with both the

BNN and the corresponding BBN depicted. The BNN withstochastic weights (Fig. 4a), if meant to do regression, couldrepresent the following data generation process:

θ ∼ p(θ) = N (µ,Σ),y ∼ p(y|x,θ) = N (Φθ(x),Σ).

(13)

The choice of using normal laws N (µ,Σ) is arbitrary, butcommon in practice for its good mathematical properties.

For classification, the model would have a categorical lawCat(pi) to sample the prediction instead:

θ ∼ p(θ) = N (µ,Σ),y ∼ p(y|x,θ) = Cat(Φθ(x)).

(14)

Then, one can use the fact that multiple data points from thetraining set are independent, as indicated by the plate notationin Fig. 4, to write the probability of the training set as:

p(Dy|Dx,θ) =∏

(x,y)∈Dp(y|x,θ). (15)

In the case of stochastic activations (Fig. 4b), the data gener-ation process might become:l0 = x,li ∼ p(li|li−1) = si(N (W ili−1 + bi,Σ)) ∀i ∈ [1, n],y = ln.

(16)

The formulation of the joint probability is slightly morecomplex as we have to account for the chain of dependenciesspanned by the BBN over the multiple latent variables l[1,n−1]:

p(Dy, l[1,n−1]|Dx) =∏

(l0,ln)∈D

(n∏

i=1

p(li|li−1)

). (17)

It is sometimes possible, and often desirable, to definep(li|li−1) such that the BNN described in Fig. 4a and theone described in Fig. 4b can be considered equivalent. Forinstance, sampling l as:

W ∼ N (µW ,ΣW ),b ∼ N (µb,Σb),l = s(Wl−1 + b)

(18)

is equivalent to sampling l as:

l ∼ s(N (µW l−1 + µb, (I ⊗ l−1)ᵀΣW (I ⊗ l−1) + Σb)), (19)

where ⊗ denotes a Kronecker product.The basic Bayesian regression architecture depicted in

Fig. 4a is more common in practice. The alternative depictedin Fig. 4b is sometimes used as it allows to compress thenumber of variational parameters when using Variational In-ference [42] (see Section V).

C. Setting the priorsSetting the prior of a Deep Neural Network is often not

an intuitive task. The main problem is that it is not reallyexplicit how models with a very large number of parametersand a nontrivial architecture like ANNs will generalize fora given parametrization [43]. We first present the commonpractice, discuss the associated issues related to the statisticalunidentifiability of ANNs, and then show the link betweenthe prior for BNNs and regularization for the point-estimatealgorithms. Finally, we present a method to build the priorfrom high level knowledge.

1) A good default prior: For basic architectures such asBayesian Regression (Fig. 4a), a standard procedure is to usea normal prior with a zero mean 0 and a diagonal covarianceσI on the coefficients of the network:

p(θ) = N (0, σI). (20)

This approach is equivalent to a weighted `2 regularization(with weights 1/σ) when training a point estimate network,as will be demonstrated in Section IV-C3. The documentationof the probabilistic programming language Stan [44] providessome examples on how to choose σ knowing the expectedscale of the considered parameters [45].

Yet, if this approach is often used in practice, there isno theoretical argument that makes it better than any otherformulation [46]. The normal law is preferred due to itsmathematical properties and the simple formulation of its log,which is used in most of the learning algorithms.

2) Addressing unidentifiability in Bayesian neural net-works: One of the main problems with Bayesian deep learningis that deep neural networks are over-parametrized models,i.e., they have many equivalent parametrizations [47]. Thisis an example of statistical unidentifiability and can lead tocomplex multimodal posteriors that are hard to sample andapproximate when training a BNN. There are two solutionsto deal with this issue; (1) changing the functional modelparametrization, or (2) constraining the support of the priorto remove unidentifiability.

The two most common classes of non-uniqueness in ANNsthat one might want to address are weight-space symmetryand scaling symmetry [48]. Both are not a concern for pointestimate neural networks but might be for BNNs. Weight-space symmetry implies that one can build an equivalentparametrization of an ANN with at least one hidden layer bypermuting two rows in the weights W i and their correspond-ing bias bi, of one of the hidden layers and the correspondingcolumns in the following layer’s weight matrix W i+1. Thismeans that as the number of hidden layers and the numberof units in the hidden layers grow, the number of equivalentrepresentations, which would roughly correspond to the modesin the posterior distribution, grows factorially. A mitigationstrategy is to enforce the bias vector in each layer to be sortedin an ascending or a descending order. However, the practicaleffects of doing so may be to degrade optimisation: weight-space symmetry may implicitly support the exploration of theparameter space during the early stages of the optimisation.

Scaling symmetry is an unidentifiability problem arisingwhen using non-linearities with the property s(αx) = αs(x),

6

which is the case of RELU and Leaky-RELU, two pop-ular non-linearities in modern machine learning. In thiscase, assigning to two consecutive layers l and l + 1 theweights W l,W l+1 becomes strictly equivalent to assigningαW l, (1/α)W l+1. This can reduce the convergence speed forpoint estimate neural networks, a problem that is addressed inpractice with various activation normalization techniques [49].BNNs are slightly more complex as the scaling symmetry in-fluences the posterior shape, making it harder to approximate.Givens transformations (also called Givens rotations) havebeen proposed as a mean to constrain the norm of the hiddenlayers [48] and address the scaling symmetry issue. In practice,using a Gaussian prior already reduces the scaling symmetryproblem, as it favors weights with the same Frobenius normon each layer. A soft version of the activation normalizationcan also be implemented by using a consistency condition,see Section IV-C4. The additional complexity associated tosampling the network parameters in a constrained space toperfectly remove the scaling symmetry is prohibitive from acomputational complexity point of view. We provide, in thesupplementary material, additional discussion about this issueusing the practical example ”Paperfold” (Practical exampleIII).

3) The link between regularization and priors: The usuallearning procedure for a point-estimate neural network is tofind the set of parameters θ that minimize a loss function builtusing the training set D:

θ = arg minθ

lossDx,Dy (θ). (21)

Assuming that the loss is defined as minus the log-likelihoodfunction up to an additive constant, the problem can berewritten as:

θ = arg maxθ

p(Dy|Dx,θ), (22)

which would be the first half of the model according to theBayesian paradigm. Now, assume that we also have a priorfor θ, and we want to find the most likely point estimate fromthe posterior. The problem then becomes:

θ = arg maxθ

p(Dy|Dx,θ)p(θ). (23)

Next, one would go back to a log-likelihood formulation:

θ = arg minθ

lossDx,Dy (θ) + reg(θ), (24)

which is easier to optimize. This formulation is how reg-ularization is usually applied in machine learning and inmany other fields. Another argument, less formal, is thatregularization acts as a soft constraint on the search space, ina manner that is similar to what a prior does for a posterior.

4) Prior with a consistency condition: Regularization canalso be implemented with a consistency condition C(θ,x),which is a function used to measure how well the modelrespects some hypothesis given a parametrization θ and aninput x. For example, C can be set to favor sparse or regularpredictions, to encourage monotonicity of predictions withrespect to some input variables (e.g., the probability of gettingthe flu increases with age), or to favor decision boundaries inlow density regions when using semi-supervised learning, see

Section IV-D1. C can be seen as the relative log-likelihoodof a prediction given the input x and parameter set θ, andthus can be included in the prior. To this end, C should beaveraged over all possible inputs:

C(θ) =

∫

x

C(θ,x)p(x)dx. (25)

In practice, as p(x) is unknown, C(θ) is approximated fromthe features in the training set:

C(θ) ≈ 1

|Dx|∑

x∈|Dx|C(θ,x). (26)

We can now write a function proportional to the prior withthe consistency condition included:

p(θ|Dx) ∝ p(θ) exp

− 1

|Dx|∑

x∈|Dx|C(θ,x)

, (27)

where p(θ) is the prior without the consistency condition.

D. Degree of supervision and alternative forms of priorknowledge

The architecture presented in Section IV-B focuses mainlyon the use of BNNs in a supervised learning setting. How-ever, in real world applications, getting ground truth labelscan be expensive. Thus, new learning strategies should beadopted [50]. We will now present how to adapt BNNs fordifferent degrees of supervision. While doing so, we will alsodemonstrate how PGMs in general and BBNs in particular areuseful to design or interpret learning strategies. In particular,the formulation of the Bayesian posterior, which is derivedfrom the different PGMs presented in Fig. 5, can also be usedfor a point estimate neural network to obtain a suitable lossfunction to search for a MAP estimator for the parameters(Section IV-C3). We also provide a practical example in thesupplementary material to illustrate how such strategies canbe implemented for an actual BNN (Practical example II).

1) Noisy labels and semi-supervised learning: The inputsin the training sets can be uncertain, either because the labelsDy are corrupted by noise [51], or because labels are missingfor a number of points. In the case of noisy labels, oneshould extend the BBN to add a new variable for the noisylabels y conditioned on y (Fig. 5a). It is common, as thenoise level itself is often unknown, to add a variable σto characterize the noise. Frenay et al. [52] proposed ataxonomy of the different approaches to integrate σ in a PGM(Fig. 6). They distinguish three cases: noise completely atrandom (NCAR); noise at random (NAR); and noise not atrandom (NNAR) models. In the NCAR model, the noise σ isindependent of any other variables, i.e., it is homoscedastic.In the NAR model, σ is dependent on the true label y butremains independent of the features, while the NNAC modelsalso account for the influence of the features x, e.g., if thelevel of noise in an image increases the chances that the imagehas been mislabeled. Both NAR and NNAC models representheteroscedastic, i.e., the antonym of homoscedastic, noise.

These noise aware PGMs are slightly more complex than apurely supervised BNN as presented in Section IV-B. How-ever, they can be treated in a similar fashion by deriving the

7

θ

D

x

y

y

(a) Noisy labels

θx

y

(b) Semi-supervised learning

y

θ

x

x′

D

(c) Data augmentation

ξ

T

θ

D

x

y

(d) Meta-learning

θs

θtθp

D x

l

yp yt

(e) Self supervised learning

Fig. 5: Different examples of PGMs to adapt the learning strategy for a given BNN (with stochastic weights).

formula for the posterior from the PGM (Eq.12) and applyingthe chosen inference algorithm. For the NNAR model, themost general of the three, the posterior becomes:

p(y,σ,θ|D) ∝ p(Dy|y,σ)p(σ|Dx,y)p(y|Dx,θ)p(θ). (28)

During the prediction phase, y and σ can just be disregardedfor each tuple (y,σ,θ) sampled from the posterior.

In the case of partially labelled data (Fig. 5b), alsoknown as semi-supervised learning, the dataset D is splitinto labelled L and unlabelled U examples. In theory, thisPGM can be considered equivalent to the one used in thesupervised learning case depicted in Fig. 4a, but in this case theunobserved data U would bring no information. The additionalinformation of unlabelled data comes from the prior and onlythe prior. In traditional machine learning, the most commonapproaches to implement semi-supervised learning are eitherto use some kind of data-driven regularization [53] or to relyon pseudo labels [54]. In that respect, Bayesian learning is nodifferent.

Data-driven regularization implies modifying the priorassumptions, and thus the stochastic model, to be able toextract meaningful information from the unlabeled datasetU . There are two common ways to approach this process.The first one is to condition the prior distribution of themodel parameters on the unlabeled examples to favor certainproperties of the model, such as a decision boundary in a lowdensity region, using a distribution p(θ|U) instead of p(θ).This implies writing the stochastic model as:

p(θ|D) ∝ p(Ly|Lx,θ)p(θ|U), (29)

where p(θ|U) is a prior with a consistency condition, asdefined in Equation (27). The consistency condition usuallyexpresses the fact that points that are close together should

θ

D

x

yσ

y

(a)

θ

D

x

yσ

y

(b)

θ

D

x

yσ

y

(c)

Fig. 6: BBNs corresponding to (a) the noise completely atrandom (NCAR), (b) noise at random (NAR) and (c) noisenot at random (NNAR) models from [52].

lead to the same prediction, e.g., graph Laplacian norm regu-larization [55].

The second way is to assume some kind of dependencyacross the observed and unobserved labels in the dataset.This type of semi-supervised Bayesian learning relies eitheron undirected PGM [56] to build the prior or a least doesnot assume independence between different training pairs(x,y) [57]. To keep things simple, we represent this fact bydropping the plate around y in Fig. 5b. The posterior is writtenin the usual way (Equation (4)). The main difference is thatp(Dy|Dx,θ) is chosen to enforce some kind of consistencyacross the dataset. For example, one can assume that twoclose points are likely to have similar labels y with a levelof uncertainty that increases with the distance.

Both approaches have a similar effect and the choice of oneover the other will depend on the mathematical formulationfavored to build the model.

The semi-supervised learning strategy can also be refor-mulated as having a weak predictor capable of generatingsome pseudo labels y, sometimes with some confidence level.Many of the algorithms that are used for semi-supervisedlearning use an initial version of the model trained with thelabeled examples [58] to generate the pseudo labels y andtrain the final model with those labels. This is problematicfor BNNs. When the prediction uncertainty is accounted for,it becomes impossible to reduce the uncertainty associatedwith the unlabelled data, at least not without an additionalhypothesis in the prior. Even if less current in practice, usinga simpler model [59] to get the pseudo labels can help mitigatethat problem.

2) Data augmentation: Data augmentation in machinelearning is a strategy that is used to significantly increase thediversity of the data available to train deep models, withoutactually collecting new data. It relies on transformations thatact on the input but have no or very low probability to changethe label (or at least do so in a predictable way) to generate anaugmented dataset A(D). Examples of such transformationsinclude applying rotations, flipping or adding noise in the caseof images. Data augmentation is now at the forefront of state-of-the-art techniques in computer vision [54] and increasinglyin natural language processing [60].

The augmented dataset A(D) could contain an infinite setof possible variants of the initial dataset, e.g., when usingcontinuous transforms such as rotations or additional noise.To achieve this in practice, A(D) is sampled on the fly duringtraining, rather than generating in advance all possible aug-

8

mentations in the training set. This process is straightforwardwhen training point estimate neural networks, but there aresome subtleties when applying it with Bayesian statistics. Themain concern is that the posterior of interest is p(φ|D,Aug),where Aug represents some knowledge about augmentation,not p(φ|A(D), D), since A(D) is not observed. From aBayesian perspective, the additional information is broughtby the knowledge of the augmentation process rather than bysome additional data. Stated otherwise, the data augmentationis a part of the stochastic model (Fig. 5c).

The idea is that if one is given data D then one could alsohave been given data D′, where each element in D is replacedby an augmentation. Then, D′ is a different perspective of thedata D. To model this, we have the augmentation distribu-tion p(x′|x, Aug) that augments the observed data using theaugmentation model Aug to generate (probabilistically) x′,which is data in the vicinity of x (Fig. 5c). x′ can then bemarginalized to simplify the stochastic model. The posterioris given by:

p(θ|x,y, Aug) ∝(∫

x′p(y|x′,θ)p(x′|x, Aug)dx′

)p(θ). (30)

This is a probabilistic counterpart to vicinal risk [61].The integral in Equation (30) can be approximated using

Monte-Carlo integration, by sampling a small set of augmen-tations Ax according to p(x′|x,Aug) and averaging:

p(y|x,θ, Aug) ≈ 1

|Ax|∑

x′∈Ax

p(y|x′,θ). (31)

When training using a Monte-Carlo-based estimate of the loss,Ax can contain as few as a single element as long as it is re-sampled for each optimization iteration. This greatly simplifiesthe evaluation of Equation (31).

An extension of this approach works in the context of semi-supervised learning where one adds a training cost to encour-age consistency of predictions under augmentation [62, 54],where unlabeled data is used to build the samples for theconsistency term. Note that this does not add labelling tothe unlabeled examples but only adds a term to encourageconsistency between the labels for an unlabeled data pointand its augmentation.

3) Meta learning, transfer learning, and self-supervisedlearning: Meta learning [63], in the broadest sense, is theuse of machine learning algorithms to assist in the trainingand optimization of other machine learning models. The meta-knowledge acquired by meta learning can be distinguishedfrom standard knowledge in the sense that it is applicable to aset of related tasks rather than a single task. Transfer learningdesignates methods that reuse some intermediate knowledgeacquired on a given problem to address a different problem.In deep learning, it is used mostly for domain adaptation,when labelled data are available, abundant in a domain that isin some way similar to the domain of interest and scarce inthe domain of interest [64]. Alternatively, pre-trained models[65] could be used to study large architectures whose com-plete training would be very computationally expensive. Self-supervised learning is a learning strategy where the data itselfprovides the labels [66]. Since the labels directly obtainable

from the data do not match the task of interest, the problemis approached by learning a pretext (or proxy) task in additionto the task of interest. The use of self-supervision is nowgenerally regarded as an essential step in some areas. Forinstance, in natural language processing, most state-of-the-artmethods use these pre-trained models [65].

A common approach for meta learning in Bayesian statisticsis to recast the problem as hierarchical Bayes [67], with theprior p(θt|ξ) for each task conditioned on a new global vari-able ξ (Fig. 5d). ξ can represent continuous meta-parametersor discrete information about the structure of the BNN, i.e., tolearn probable functional models, or the underlying subgraphof the PGM, i.e., to learn probable stochastic models. Multiplelevels can be added to organise the tasks in a more complexhierarchy if needed. Here, we present only the case with onelevel since the generalization is straightforward. With thisbroad Bayesian understanding of meta learning, both transferlearning and self-supervised learning are special cases of metalearning. The general posterior becomes:

p(θ, ξ|D) ∝(∏

t∈Tp(Dt

y|Dtx,θt)p(θt|ξ)

)p(ξ). (32)

In practice, the problem is often approached with empiricalBayes (Section V-D), and only a point estimate ξ is consideredfor the global variable, ideally the MAP estimate obtained bymarginalizing p(θ, ξ|D) and selecting the most likely point,but this is not always the case.

In transfer learning, the usual approach would be to setξ = θm, with θm the coefficients of the main task. The newprior can then be obtained from ξ, for example:

p(θ|ξ) = N ((τ(ξ),0), σI), (33)

where τ is a selection of the parameters to transfer and σis a parameter to tune manually. Unselected parameters areassigned a new prior, with a mean of 0 by convention. Ifa BNN has been trained for the main task then σ can beestimated from the previous posterior, with an increment toaccount for the additional uncertainty caused by the domainshift.

Self-supervised learning can be implemented in two steps,first learning the pretext task and then use transfer learning.This can be considered overly complex but might be requiredif the pretext task has a high computational complexity (e.g.,BERT models in natural language processing [65]). Recentcontributions have shown that jointly learning the pretext taskand the final task (Fig. 5e) can improve the results obtained inself-supervised learning [68]. This approach, which is closer tohierarchical Bayes, also allows setting the prior a single timewhile still retaining the benefits of self-supervised learning.

V. BAYESIAN INFERENCE ALGORITHMS

A priori, one does not have to undergo a learning phasewhen using a BNN, just sample the posterior and do modelaveraging, see Algorithm 1. However, sampling the posterior isnot easy in the general case. While the conditional probabilityP (D|H) of the data and the probability P (H) of the modelare given by the stochastic model, the integral for the evidence

9

term∫HP (D|H ′)P (H ′)dH ′ might be excessively difficult to

compute. For non-trivial models, and even if the evidence hasbeen computed, directly sampling the posterior is really harddue to the high dimensionality of the sampling space. Insteadof using traditional methods, e.g., inversion sampling or rejec-tion sampling, to sample the posterior, dedicated algorithmsare used. The most popular ones are Markov Chain MonteCarlo (MCMC) methods, a family of algorithms that exactlysample the posterior, or Variational Inference, a method forlearning an approximation of the posterior (Fig. 1).

This section reviews these methods. First, we introduceMCMC and Variational Inference as they are used in tra-ditional Bayesian statistics. Then, we review different sim-plifications or approximations that have been proposed fordeep learning. We also provide a practical example in thesupplementary material, which compares different learningstrategies (Practical example III).

A. Markov Chain Monte Carlo (MCMC)

The idea behind MCMC methods is to construct a Markovchain, a sequence of random samples Si, which probabilis-tically depend only on the previous sample Si−1, such thatthe Si are distributed following a desired distribution. Unlikestandard sampling methods such as rejection or inversionsampling, most MCMC algorithms require an initial burn-intime before the Markov chain converges to the desired distri-bution. Also, the successive Si’s might be autocorrelated. Thismeans that a large set of samples Θ has to be generated andsubsampled to get approximately independent samples fromthe underlying distribution. The final collection of samples Θhas to be stored after training, which is expensive for mostdeep learning models.

Despite their inherent drawbacks, MCMC methods can beconsidered among the best available and the most popularsolutions for sampling from exact posterior distributions inBayesian statistics [69]. However, not all MCMC algorithmsare relevant for Bayesian deep learning. Gibbs sampling[70], for example, is very popular in general statistics andunsupervised machine learning, but is very ill-suited forBNNs. The most relevant MCMC methods for BNNs arethe Metropolis-Hastings algorithms [71]. The property whichmakes Metropolis-Hasting algorithms popular is that they donot require knowledge about the exact probability distributionP (x) to sample from. Instead, a function f(x) that is propor-tional to that distribution is sufficient. This is the case of aBayesian posterior distribution, which is usually quite easy tocompute except for the evidence term.

The Metropolis-Hasting algorithm, see Algorithm 4, startswith a random initial guess, x0, and then samples a newcandidate point x′ around the previous x, using a proposaldistribution Q(x′|x). If x′ is more likely than x according tothe target distribution, it is accepted. If it is less likely, it isaccepted with a certain probability or rejected otherwise.

The acceptance probability p of Algorithm 4 can be shownto yield the most accepting of all reversible rejection MCMCsamplers, skipping a proposed sample less times on average.Furthermore, the probability p can be simplified if Q is chosen

Algorithm 4 Metropolis-HastingDraw x0 ∼ Initialwhile n = 0 to N do

Draw x′ ∼ Q(x′|xn)

p = min

(1,Q(x′|xn)

Q(xn|x′)f(x′)

f(xn)

)

Draw k ∼ Bernoulli(p)if k thenxn+1 = x′

n = n+ 1end if

end while

to be symmetric, i.e., Q(x′|xn) = Q(xn|x′). The formula forthe acceptance rate then becomes:

p = min

(1,f(x′)

f(xn)

). (34)

In this situation, the algorithm is simply called the Metropolismethod. Common choices for Q can be a normal distri-bution Q(x′|xn) = N (xn, σ

2), or a uniform distributionQ(x′|xn) = U(xn−ε,xn +ε), centered around the previoussample. To deal with non-symmetric proposal distributions,e.g., to accommodate a constraint in the model like a boundeddomain, one has to take into account the correction termimposed by the full Metropolis-Hasting algorithm.

The spread of Q(x′|xn) has to be tweaked. If it is toolarge, the rejection rate will be too high. If it is too small,the samples will be more autocorrelated. There is no generalmethod to tweak those parameters. Yet, a clever strategy to getthe new proposed sample x′ can reduce their impact. This iswhy the Hamiltonian Monte-Carlo method has been proposed.

The Hamiltonian Monte-Carlo algorithm [72] (HMC) isanother example of Metropolis-Hasting algorithm for contin-uous distributions designed with a clever scheme to draw anew proposal x′ to ensure that as few samples as possible arerejected and there is as few correlation as possible betweensamples. Moreover, the burn-in time is extremely short.

Most software packages for Bayesian statistics implementthe No-U-Turn sampler (NUTS for short) [73], which is animprovement over the classic HMC algorithm allowing thehyperparameters of the algorithm to be tweaked automaticallyinstead of having to be set manually.

B. Variational inference

MCMC algorithms are the goto tools to sample from theexact posterior. However, their lack of scalability has madethem less popular for BNNs, given the size of the models underconsideration. Variational Inference [74], which scales betterthan MCMC algorithms, gained a lot of popularity. VariationalInference is not an exact method. Rather than allowing sam-pling from the exact posterior, the idea is to have a distributionqφ(H), called the variational distribution, parametrized by aset of parameters φ. The values of the parameters φ are thenlearned such that the variational distribution qφ(H) is as closeas possible to the exact posterior P (H|D). The measure ofcloseness most readily used is the KL-divergence:

DKL(qφ||P ) =

∫

H

qφ(H ′) log

(qφ(H ′)

P (H ′|D)

)dH ′. (35)

10

0 2000 4000 6000 8000 10000Epoch

10 1

100

-EL

BO

/ |

D|

[log

sca

le]

Loss

Average Loss

Fig. 7: Typical training curve for Bayes by backprop.

There is an apparent problem here, which is that to computeDKL(qφ||P ), one needs to compute P (H|D) anyway. Toovercome this, a different, easily derived formula called theEvidence Lower Bound, or ELBO, serves as a loss:∫

H

qφ(H ′) log

(P (H ′, D)

qφ(H ′)

)dH ′ = log(P (D))−DKL(qφ||P ).

(36)Since log(P (D)) only depends on the prior, minimizingDKL(qφ||P ) is equivalent to maximizing the ELBO.

The most popular method to optimize the ELBO is Stochas-tic Variational Inference (SVI) [75], which is in fact thestochastic gradient descent method applied to VariationalInference. This allows the algorithm to scale to the largedatasets that are encountered in modern machine learning,since the ELBO can be computed on a single mini-batch ateach iteration.

Convergence, when learning the posterior with SVI, will beslow compared to the usual gradient descent. Moreover, mostimplementations use a small number of samples to evaluate theELBO, often just one, before taking a gradient step. In otherwords, the ELBO estimate will be noisy at each iteration.

In traditional machine learning and statistics, qφ(H) ismostly constructed from distributions in the exponential fam-ily, e.g., multivariate normals [76], Gammas and Dirichlets.The ELBO can then be dramatically simplified into com-ponents [77] leading to a generalization of the well knownexpectation-maximization algorithm. To account for correla-tions between the large number of parameters, certain approx-imations are made. For instance block diagonal [78] or lowrank plus diagonal [79] covariance matrices can be used.

C. Bayes by backpropagation

Variational inference offers a good mathematical tool forBayesian inference, but it needs to be adapted to deep learning.The main problem is that stochasticity stops backpropagationfrom functioning at the internal nodes of a network [41].Different solutions have been proposed to mitigate this prob-lem, including probabilistic backpropagation [80] or Bayes-by-backprop [81], which may look more familiar to deeplearning practitioners. We will thus focus on it in this tutotial.Bayes-by-backprop is indeed a practical implementation ofSVI combined with a reparametrization trick [82] to ensurebackpropagation works as usual.

The idea is to use a random variable ε ∼ q(ε) as a non-variational source of noise. θ is not sampled directly but

obtained via a deterministic transformation t(ε,φ) such thatθ = t(ε,φ) follows qφ(θ). ε is sampled and thus changes ateach iteration but can still be considered as a constant withregard to other variables. All other transformations being non-stochastic, backpropagation works as usual for the variationalparameters φ. The general formula for the ELBO becomes:∫

ε

qφ(t(ε,φ))log

(P (t(ε,φ), D)

qφ(t(ε,φ))

)|Det(∇εt(ε,φ))| dε. (37)

This is tedious to work with. Instead, to estimate the gradientof the ELBO, Blundell et al. [81] proposed to use the factthat if qφ(θ)dθ = q(ε)dε, then for a differentiable functionf(θ,φ) we have:

∂

∂φ

∫

φqφ(θ

′)f(θ′,φ)dθ′ =∫

εq(ε)

(∂f(θ,φ)

∂θ

∂θ

∂φ+

∂f(θ,φ)

∂φ

)dε.

(38)A proof is provided in [81]. We also provide in Appendix A analternative proof to give more details on when we can assumeqφ(θ)dθ = q(ε)dε. A sufficient condition is for t(ε,φ) tobe invertible with respect to ε and the distributions q(ε) andqφ(θ) to not be degenerated.

For the case where the weights are treated as stochasticvariables, and thus hypothesis H , the training loop can thenbe implemented as described in Algorithm 5.

Algorithm 5 Bayes-by-backpropφ = φ0

for i = 0 to N doDraw ε ∼ q(ε)θ = t(ε,φ)f(θ,φ) = log(qφ(θ))− log(p(Dy|Dx,θ)p(θ))∆φf = backpropφ(f)φ = φ− α∆φf

end for

The objective function f corresponds to an estimate of theELBO from a single sample. This means that the gradientestimate will be noisy. The convergence graph will also bemuch more noisy than in the case of classic backpropagation(Fig. 7). To get a better estimate of the convergence, one canaverage the loss over multiple epochs.

Algorithm 5 is very similar to the classical training loop forpoint estimate deep learning. This means that most techniquesused for optimization in deep learning are straightforward touse for Bayes-by-backprop, e.g., to use the Adam optimizer[83] instead of the stochastic gradient descent.

Note also that, if Bayes-by-backprop was presented forBNN with stochastic weights, adapting it for BNN withstochastic activations is straightforward. In that case, theactivations l represent the hypothesis H and the weights θare part of the variational parameters φ.

D. Learning the prior

Learning the prior and the posterior afterwards is possible.This is meaningful if there is some aspects of the prior thatare set using prior knowledge, then learning the remaining freeparameters before getting the posterior. In standard Bayesianstatistics, this is known as empirical Bayes and is usuallya valid approximation when the dimensions of the prior

11

parameters being learned are far less than the dimensions ofthe model parameters.

Given a parametrized prior distribution pξ(H), maximizingthe likelihood of the data is a good method to learn theparameters ξ:

ξ = arg maxξ

P (D|ξ) = arg maxξ

∫

H

pξ(D|H ′)pξ(H ′)dH ′. (39)

In general, directly finding ξ is an intractable problem. How-ever, when using Variational Inference, the ELBO is the log-likelihood of the data minus the KL-divergence of qφ(θ) andprior (Eq. 36):

log(P (D|ξ)) = ELBO +DKL(qφ||P ). (40)

This property means that maximizing the ELBO, now afunction of both ξ and φ, is equivalent to maximising a lowerbound on the log-likelihood of the data, a bound which istighter when qφ is from a general family more able to wellfit P . The Bayes-by-backprop algorithm presented in SectionV-C needs only to be slightly modified, see Algorithm 6

Algorithm 6 Bayes by backprop with parametric prior.ξ = ξ0

φ = φ0

for i = 0 to N doDraw ε ∼ q(ε)θ = t(ε,φ)f(θ,φ, ξ) = log(qφ(θ))− log(pξ(Dy|Dx,θ)pξ(θ))∆ξf = backpropξ(f)∆φf = backpropφ(f)ξ = ξ − αξ∆ξfφ = φ− αφ∆φf

end for

E. Inference algorithms adapted for deep learning

We presented so far the fundamental theory to design andtrain BNNs. However, the aforementioned methods are still noteasily applicable to most large scale architectures used today indeep learning. Recent research has also shown that being onlyapproximately Bayesian is enough to get a correctly calibratedmodel with uncertainty estimates [22]. This section presentshow inference algorithms were adapted for deep learning,resulting in more efficient methods. Specific inference methodscan still be distinguished as MCMC algorithm, i.e., theygenerate a sequence of samples from the posterior, or as aform of Variational Inference, i.e., they learn the parametersof an intermediate distribution to approximate the posterior.All methods are summarized in Fig. 8.

1) Bayes via Dropout: Dropout has initially been proposedas a regularization method [84]. It works by applying multi-plicative noise to the target layer. The most commonly usedtype of noise is Bernoulli noise, but other types such as theGaussian noise for Gaussian Dropout [84] might be usedinstead.

Dropout is usually turned off at evaluation time, but leavingit on allows to obtain a distribution for the output predictions[85, 86]. It turns out that this procedure, called Monte Carlo

Dropout, is in fact Variational Inference with a variationaldistribution defined for each weight matrix as:

zi,j ∼ Bernouilli(pi)W i = M i · diag(zi),

(41)

with zi being the random activation coefficients and M i theweights matrix before dropout is applied. pi is the activationprobability for layer i and can be learned or set manually.

When used to train a BNN, dropout should not be seen as aregularization method, as it is part of the variational posterior,not the prior, meaning it should be coupled with a differentkind of regularization [87], e.g., `2 weights penalization. Theequivalence between the objective function used for trainingwith dropout and `2 weight regularization:

Ldropout =1

N

∑

D

f(y, y) + λ∑

θ

θ2i , (42)

and the ELBO, assuming a normal prior on the weights and thedistribution presented in Equation 41 as variational posterior,has been demonstrated [85]. The argument is similar to theone presented in Section IV-C3.

MC-Dropout is a very convenient technique to do BayesianDeep learning. It is straightforward to implement and requireslittle additional knowledge or modelling effort compared totraditional methods. It often leads to a faster training phasecompared to other Variational Inference approaches. If amodel has been trained with dropout layers, which are quitewidespread in today’s deep learning architectures, and anadditional form of regularization acting as prior, it can be usedas a BNN without any need to be retrained.

On the other hand, MC-Dropout might lack some expres-siveness and may not fully capture the uncertainty associatedwith the model predictions [88]. It also lacks flexibility com-pared to other Bayesian methods for online or active learning.

2) Bayes via stochastic gradient descent: Stochastic Gra-dient Descent (SGD) and related algorithms are at the core ofmodern machine learning. The initial goal of SGD is to providean algorithm to converge to a point estimate corresponding toan optimal solution while having only noisy estimates of thegradient of the objective function, especially when the trainingdata has to be split into mini-batches. The parameter updaterule at time t can be written as:

∆θt =εt2

(N

n∇log(p(Dt,y|Dt,x,θi)) +∇log(p(θt))

), (43)

where Dt is a mini-batch subsampled at time t from thecomplete dataset D, εt is the learning rate at time t, N isthe size of the whole dataset and n the size of the mini-batch.

SGD, or related optimization algorithms like ADAM [83],can be reinterpreted as a Markov-Chain algorithm [89]. Usu-ally, the hyperparameters of the algorithm are tweaked toensure that the chain converges to a Dirac distribution, thefinal point estimate, especially by reducing εt towards zerowhile ensuring that

∑∞t=0 εt = ∞. However, if the learning

rate is reduced toward a strictly positive value, the underlyingMarkov Chain will converge to a stationary distribution. Ifa Bayesian prior is accounted for in the objective functionthen this stationary distribution can be an approximation ofthe corresponding posterior.

12

Benefits

Classic methods (HMC, NUTS) (V.A) State of the art samplers limit autocorrelation between samples

SGLD and derivates (V.E.2.a)Provide a well behaved Markov Chain

when minibatches are used during training

MCMC (V.A) Directly samples the posterior

Warm restarts (V.E.2.a) Help a MCMC method explore different modes of the posterior

Variational inference (V.B)Approximate the posterior with

an easy to sample parametric distribution

Bayes by backprop (V.C.) Fit any parametric distribution as posterior

Can

be c

om

bin

ed

Can

be c

om

bin

ed

Monte Carlo-Dropout (V.E.1) Can transform any architecture using dropout and regularization into a BNN

Laplace approximation (V.E.2.b) Use information generated at training time to transform any Maximum-a-posteriori model into a BNN

Deep ensembles (V.E.2.b)

Usecases

Small and critical models

Models with larger datasets

Small and average models

Combined with SGLD orany other MCMC sampler

Large scale models

Large scale models

Dropout based models

Unimodals large scale models

Multimodals models and combined with other VI methods

Help focusing on different modes of the posterior,i.e., different ways the model can generalize.

Limitations

Do not scale well to the large models used in modern deep learning

Focus on a single mode of the posterior andsamples becomes more and more autocorrelated

Requires to store the all generated samples for inference

Requires a new burn-in sequencefor each restart

Is an approximation

Noisy gradient descent

Might lack expressive power

Focus on a single mode of the posterior

Cannot detect single mode uncertaintywhen used alone

Fig. 8: Summary of the different inference approaches to train a BNN with their benefits, limitations and usecases.

a) MCMC algorithms based on SGD dynamic: To ap-proximately sample the posterior using the SGD algorithm, aspecific MCMC method, called Stochastic Gradient LangevinDynamic (SGLD) [90], has been developed, see Algorithm 7.Coupling SGD with Langevin Dynamic leads to a slightlymodified update step:

∆θt = εt2

(N

n∇log(p(Dt,θt)) +∇log(p(θt))

)+ ηt

ηt ∼ N (0, εt).(44)

Welling et al. [90] showed that this method leads to a MarkovChain which samples the posterior if εt goes toward zero.However, in that case, the successive samples become moreand more autocorrelated. To address this problem, the authorsproposed to stop reducing εt at some point, thus making thesamples only an approximation of the posterior. Still, SGLDoffers better theoretical guarantees compared to other MCMCmethods when the dataset is split into mini-batches. Thismakes the algorithm useful in Bayesian deep learning.

To favor the exploration of the posterior, one can use warmrestart of the algorithm [91], i.e., restarting the algorithm at anew random position θ0 and with a large learning rate ε0. Thisoffers multiple benefits. The main one is to avoid the modecollapse problem [92]. In the case of a BNN, the true Bayesianposterior is usually a complex multimodal distribution, asmultiple and sometimes not equivalent parametrizations θof the network can fit the training set. Favoring explorationover precise reconstruction can help to get a better pictureof those different modes. Then, as parameters sampled from

Algorithm 7 SGLDDraw θ0 ∼ Initialfor t = 0 to E do

Select mini-batch Dt,y, Dt,x ⊂ Df(θt) =

N

nlog(p(Dt,y|Dt,x,θt)) + log(p(θt))

∆θf = backpropθ(f)Draw ηt ∼ N (0, εt)

θt+1 = θt −( εt

2∆θf + ηt

)

end for

MCMC methods

Variational inference

Deep ensembles with SGD

Fig. 9: Different techniques to sample the posterior. MCMCalgorithms sample the true posterior but successive samplesmight be correlated, Variational Inference uses a parametricdistribution which can suffer from mode collapse while deepensembles focus on the modes of the distribution.

the same mode are likely to make the model generalize ina similar manner, using warm restarts will enable a muchbetter estimate of the epistemic uncertainty when processingunseen data, even if this approach provides only a very roughapproximation of the exact posterior.

As other MCMC methods, this approach still suffers froma huge memory footprint. This is why some authors haveproposed methods that operate more like traditional VariationalInference and less like a MCMC algorithm.

b) Variational Inference based on SGD dynamic: Insteadof a MCMC algorithm, SGD dynamic can be used as aVariational Inference method to learn a distribution by usingLaplace approximation. Laplace approximation fits a Gaussianposterior by using the maximum a posteriori estimate as meanand the Hessian H of the loss (assuming the loss is the loglikelihood) inverse as covariance matrix:

p(θ|D) ≈ N (θ,H−1) (45)

ComputingH−1 is usually intractable for large neural networkarchitectures. Thus, approximations are used, most of the timeby analysing the variance of the Gradient descent algorithm[78, 79, 93]. However, if those methods are able to capturethe fine shape of one mode of the posterior, they cannot fitmultiple modes.

13

Algorithm 8 Deep ensemblesfor i = 0 to R do

Draw θi ∼ Initialεt = ε0for j = 0 to N dof(θi,j) = log(p(Dy|Dx,θi,j)) + log(p(θi,j))∆θf = backpropθ(f)θi = θi − αθ∆θf

end forend for

Lakshminarayanan et al. [92] proposed to use warm restartsto get different point estimate networks instead of fitting aparametric distribution. This method, called deep ensembles,see Fig. 9 and Algorithm 8, has been used in the past toperform model averaging. The main contribution of [92] isto show that it enables well-calibrated error estimates. WhileLakshminarayanan et al. [92] claim that their method isnon-Bayesian, it has been shown that their approach can beunderstood from a Bayesian point of view [17, 94]. Whenregularization is used, the different point estimates shouldcorrespond to modes of a Bayesian posterior. This can beinterpreted as approximating the posterior with a distributionparametrized as multiple Dirac deltas. It can be seen as aform of Variational Inference, even if for such variationaldistribution computing the ELBO in a sense that is meaningfulfor traditional optimization is impossible:

qφ(θ) =∑

θi∈φαθiδθi(θ), (46)

with the αθibeing positive constants that sum to 1.

VI. SIMPLIFYING BAYESIAN NEURAL NETWORKS

After training a BNN, one has to use Monte-Carlo at eval-uation time to estimate uncertainty. This is a major drawbackof BNNs. For MCMC-based methods, storing a large set ofparametrizations Θ is also not practical. This section presentsmitigation strategies reported in the literature.

A. Bayesian inference on the (n-)last layer(s) only

The architecture of deep neural networks makes it quiteredundant to account for uncertainty for a large number ofsuccessive layers. Instead, recent research aims to use onlya few stochastic layers, usually positioned at the end of thenetworks [95, 96], see Fig. 10. With only a few stochasticlayers, training and evaluation can be drastically sped up while

D

x ξ

l θ

y

Fig. 10: BBNs and BNN corresponding to a last-layer model.

still obtaining meaningful results from a Bayesian perspective.This approach can be seen as learning a point-estimate trans-formation followed by a shallow BNN.

Training a BNN with some non-stochastic layers is similarto learning the parameters for the prior, as presented in SectionV-D. The weights of the non-Bayesian layers should beconsidered as both prior and variational-posterior parameters.

B. Bayesian teachers

Using a BNN as a teacher is an idea derived from anapproach used in Bayesian modelling [97]. The approach is totrain a non-stochastic ANN to predict the marginal probabilityp(y|x, D) using a BNN as a teacher [98]. This is related tothe idea of knowledge distillation [99, 100] where possiblyseveral pre-trained knowledge sources can be used to train amore functional system.

To do so, the KL-divergence between a parametric distri-bution qω(y|x), where ω are the coefficients of the studentnetwork, and p(y|x, D) is minimized:

ω = arg minω

DKL(p(y|x, D)||qω(y|x)). (47)

As this is intractable, Korattikara et al. [98] proposed a MonteCarlo approximation:

ω = arg minω

− 1

|Θ|∑

θi∈Θ

Ep(y|x,θi) [log (qω(y|x))] . (48)

Here, ω can be estimated using a training dataset D′ thatcontains the features x only. During training, the probabilityp(y|x,θ) of the labels is given by the teacher BNN. Thus, D′

can be much larger than D. This helps the student networkretain the calibration and uncertainty from the teacher.

Menon et al. [100] observed that, for classification prob-lems, simply using the class probabilities output by a BNNteacher rather than one-hot labels helps the student to retaincalibration and uncertainty from the teacher.

A Bayesian teacher can also be used to compress a large setof samples generated using MCMC [101]. Instead of storingΘ, a generative model G, a GAN in [101], is trained against theMCMC samples to generate the coefficients θi at evaluationtime. This approach is similar to Variational Inference, withG representing a parametric distribution, but the proposedalgorithm allows to train a much more complex model thanthe distributions usually considered for Variational Inference.

VII. PERFORMANCE METRICS OF BAYESIAN NEURALNETWORKS

One big challenge with BNNs is how to evaluate theirperformance. They do not output directly a point estimate pre-diction y but a conditional probability distribution p(y|x, D),from which an optimal estimate y can later be extracted. Thismeans that both the predictive performance, i.e., the abilityof the model to give correct answers, and the calibration, i.e.,that the network is neither overconfident nor underconfidentabout its prediction, have to be assessed.

The predictive performance, sometimes called sharpnessin statistics, of a network can be assessed by treating theestimator y as the prediction. This procedure often depends on

14

Predicted probability

10

01

Ob

serv

ed

pro

bab

ilit

y

(a)

Predicted probability

10

01

Ob

serv

ed

pro

bab

ilit

y

(b)

Fig. 11: Examples of calibration curve for underconfident (a)and overconfident (b) models.

the type of data the network is meant to treat. Many differentmetrics, e.g., Mean Square Error (MSE), `n distances andcross-entropy, are used in practice. Covering these metrics isout of the scope of this tutorial. Instead, we refer the readerto [102] for more details.

The standard method to asses a model calibration is acalibration curve, also called a reliability diagram [27, 103].It is defined as a function p : [0, 1]→ [0, 1] that represents theobserved probability p, or empirical frequency, as a functionof the predicted probability p, see Fig. 11. If p < p then themodel is overconfident. Otherwise, it is underconfident. A wellcalibrated model should have p ∼= p. Using this approach re-quires first to choose a set of events E with different predictedprobabilities and then to measure the empirical frequency ofeach event using a test set T .

For a binary classifier, the set of test events can be chosenas the set of all sets of datapoints with predicted probabilitiesof acceptance in interval [p − δ, p + δ] for a chosen δ, oralternatively [0, p] or [1−p, 1] for small datasets. The empiricalfrequency is given by:

p =

∑y∈Ty

y · I[p−δ,p+δ](y)∑y∈Ty

I[p−δ,p+δ](y). (49)

For multi-class classifiers, the calibration curve can be inde-pendently checked for each class against all the other classes.In this case, the problem is reduced to a binary classifier.

Regression problems are slightly more complex since thenetwork does not output a confidence level, like a classifier,but a distribution of possible outputs. The solution is to usean intermediate statistic with a known probability distribution.Assuming independence between the y for a sufficiently largeset of different randomly selected inputs x, one can assumethat the normalized sum of squared residuals (NSSR) followsa Chi-square law:

NSSR = (y − y)ᵀΣ−1y (y − y) ∼ χ2

Dim(y). (50)

This allows attributing to each data point in the test setT a predicted probability as the probability of observing avariance-normalized distance between the prediction and thetrue value equal to or lower than the actual distance. Formally,the predicted probability is computed as:

pi = X2Dim(y) (NSSR) ∀(yi,xi) ∈ T, (51)

where X2Dim(y) is the Chi-square cumulative distribution, with

Dim(y) degrees of freedom. The observed probability can becomputed as:

pi =1

|T |

|T |∑

j=1

I[0,∞)(pj − pi). (52)

We present a practical computation of such calibration curvefor the paperfold practical example we provide in the supple-mentary material (Practical example II).

Now, giving the whole calibration curve for a given stochas-tic model is often a good idea, as it allows to see where themodel is likely to be overconfident or underconfident. It alsoallows, to a certain extent, to recalibrate the model [103].However, providing a summary measure to ease comparison orinterpretation might also be necessary. The Area Under Curve(AUC) is a standard metric of the form:

AUC =

∫ 1

0

pdp. (53)

An AUC of 0.5 indicates that the model is, on average, wellcalibrated.

The distance from the ideal calibration curve is also a goodindicator for the calibration of a model:

d(p, p) =

√∫ 1

0

(p− p)2dp. (54)

When d(p, p) = 0 then the model is perfectly calibrated.Other measures have also been proposed. Examples include

the Expected Calibration Error and some discretized variantsof the distance from the ideal calibration curve [12].

VIII. CONCLUSION

This tutorial covers the design, training and evaluation ofBNNs. While their underlying principle is simple, i.e., justtraining an ANN with some probability distribution attachedto its weights, designing efficient algorithms remains verychallenging. Nonetheless, the potential applications of BNNsare huge. In particular, BNNs constitute a promising paradigmallowing the application of deep learning in areas where asystem is not allowed to fail to generalize without emittinga warning. Finally, Bayesian methods can help design newlearning and regularization strategies. Thus, their relevanceextends to traditional point-estimate models.

IX. ACKNOWLEDGEMENTS

This material is partially based on research sponsored by theAustralian Research Council https://www.arc.gov.au/ (GrantsDP150100294 and DP150104251), and Air Force ResearchLaboratory and DARPA https://afrl.dodlive.mil/tag/darpa/ un-der agreement number FA8750-19-2-0501.

REFERENCES

[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus, “Intriguing properties of neuralnetworks,” arXiv preprint arXiv:1312.6199, 2013.

[2] Q. Rao and J. Frtunikj, “Deep learning for self-driving cars:Chances and challenges,” in Proceedings of the 1st Inter-national Workshop on Software Engineering for AI in Au-tonomous Systems, ser. SEFAIS ’18, 2018, pp. 35–38.

15

[3] J. Ker, L. Wang, J. Rao, and T. Lim, “Deep learning appli-cations in medical image analysis,” IEEE Access, vol. 6, pp.9375–9389, 2018.

[4] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P. Nobrega,and A. L. Oliveira, “Computational intelligence and financialmarkets: A survey and future directions,” Expert Systems withApplications, vol. 55, pp. 194–211, 2016.

[5] H. M. D. Kabir, A. Khosravi, M. A. Hosen, and S. Nahavandi,“Neural network-based uncertainty quantification: A surveyof methodologies and applications,” IEEE Access, vol. 6, pp.36 218–36 234, 2018.

[6] A. Etz, Q. F. Gronau, F. Dablander, P. A. Edelsbrunner, andB. Baribault, “How to become a Bayesian in eight easy steps:An annotated reading list,” Psychonomic Bulletin & Review,vol. 25, pp. 219–234, 2018.

[7] N. G. Polson, V. Sokolov et al., “Deep learning: a Bayesianperspective,” Bayesian Analysis, vol. 12, no. 4, pp. 1275–1304,2017.

[8] J. Lampinen and A. Vehtari, “Bayesian approach for neuralnetworks—review and case studies,” Neural Networks, vol. 14,no. 3, pp. 257 – 274, 2001.

[9] D. M. Titterington, “Bayesian methods for neural networksand related models,” Statist. Sci., vol. 19, no. 1, pp. 128–139,02 2004.

[10] H. Wang and D.-Y. Yeung, “A survey on bayesian deeplearning,” ACM Comput. Surv., vol. 53, no. 5, Sep. 2020.

[11] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “Oncalibration of modern neural network,” in Proceedings of the34th International Conference on Machine Learning - Volume70, ser. ICML’17, 2017, pp. 1321–1330.

[12] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran,“Measuring calibration in deep learning,” in The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR)Workshops, June 2019.

[13] D. Hendrycks and K. Gimpel, “A baseline for detecting mis-classified and out-of-distribution examples in neural networks,”in 5th International Conference on Learning Representations,ICLR 2017, Conference Track Proceedings, 2017.

[14] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms,1st ed. Chapman and Hall/CRC, 2012.

[15] F. Galton, “Vox Populi,” Nature, vol. 75, no. 1949, pp. 450–451, Mar 1907.

[16] D. J. C. MacKay, “A practical Bayesian framework for back-propagation networks,” Neural Computation, vol. 4, no. 3, pp.448–472, 1992.

[17] A. G. Wilson and P. Izmailov, “Bayesian deep learningand a probabilistic perspective of generalization,” CoRR, vol.abs/2002.08791, 2020. [Online]. Available: http://arxiv.org/abs/2002.08791

[18] Y. Gal and Z. Ghahramani, “Bayesian convolutional neuralnetworks with Bernoulli approximate variational inference,”in 4th International Conference on Learning Representations(ICLR) workshop track, 2016.

[19] D. P. Kingma and M. Welling, “Stochastic gradient vb and thevariational auto-encoder,” in Second International Conferenceon Learning Representations, ICLR, vol. 19, 2014.

[20] C. Robert, The Bayesian choice: from decision-theoretic foun-dations to computational implementation. Springer Science& Business Media, 2007.

[21] J. Mitros and B. M. Namee, “On the validity of Bayesianneural networks for uncertainty estimation,” in AICS, 2019.

[22] A. Kristiadi, M. Hein, and P. Hennig, “Being Bayesian,even just a bit, fixes overconfidence in ReLU networks,”CoRR, vol. abs/2002.10118, 2020. [Online]. Available:http://arxiv.org/abs/2002.10118

[23] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin,J. Dillon, B. Lakshminarayanan, and J. Snoek, “Can youtrust your models uncertainty? evaluating predictive uncer-tainty under dataset shift,” in Advances in Neural Information

Processing Systems 32. Curran Associates, Inc., 2019, pp.13 991–14 002.

[24] A. D. Kiureghian and O. Ditlevsen, “Aleatory or epistemic?does it matter?” Structural Safety, vol. 31, no. 2, pp. 105–112,2009, risk Acceptance and Risk Communication.

[25] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, andS. Udluft, “Decomposition of uncertainty in Bayesian deeplearning for efficient and risk-sensitive learning,” in Pro-ceedings of the 35th International Conference on MachineLearning, ser. Proceedings of Machine Learning Research,vol. 80, 2018, pp. 1184–1193.

[26] D. H. Wolpert, “The lack of a priori distinctions betweenlearning algorithms,” Neural Computation, vol. 8, no. 7, pp.1341–1390, 1996.

[27] A. Kendall and Y. Gal, “What uncertainties do we need inBayesian deep learning for computer vision?” in Proceedingsof the 31st International Conference on Neural InformationProcessing Systems, ser. NIPS’17, 2017, p. 5580–5590.

[28] T. Auld, A. W. Moore, and S. F. Gull, “Bayesian neuralnetworks for internet traffic classification,” IEEE Transactionson Neural Networks, vol. 18, no. 1, pp. 223–239, 2007.

[29] X. Zhang and S. Mahadevan, “Bayesian neural networks forflight trajectory prediction and safety assessment,” DecisionSupport Systems, vol. 131, p. 113246, 2020.

[30] S. Arangio and F. Bontempi, “Structural health monitoring of acable–stayed bridge with Bayesian neural networks,” Structureand Infrastructure Engineering, vol. 11, no. 4, pp. 575–587,2015.

[31] S. M. Bateni, D.-S. Jeng, and B. W. Melville, “Bayesian neuralnetworks for prediction of equilibrium and time-dependentscour depth around bridge piers,” Advances in EngineeringSoftware, vol. 38, no. 2, pp. 102–111, 2007.

[32] X. Zhang, F. Liang, R. Srinivasan, and M. Van Liew, “Estimat-ing uncertainty of streamflow simulation using bayesian neuralnetworks,” Water Resources Research, vol. 45, no. 2, 2009.

[33] A. D. Cobb, M. D. Himes, F. Soboczenski, S. Zorzan, M. D.O’Beirne, A. G. Baydin, Y. Gal, S. D. Domagal-Goldman,G. N. Arney, and D. A. and, “An ensemble of bayesianneural networks for exoplanetary atmospheric retrieval,” TheAstronomical Journal, vol. 158, no. 1, p. 33, jun 2019.

[34] F. Aminian and M. Aminian, “Fault diagnosis of analogcircuits using Bayesian neural networks with wavelet transformas preprocessor,” Journal of Electronic Testing, vol. 17, no. 1,pp. 29–36, Feb 2001.

[35] W. Beker, A. Wołos, S. Szymkuc, and B. A. Grzybowski,“Minimal–uncertainty prediction of general drug–likenessbased on Bayesian neural networks,” Nature Machine Intel-ligence, vol. 2, no. 8, pp. 457–465, Aug 2020.

[36] Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian activelearning with image data,” in Proceedings of the 34th Inter-national Conference on Machine Learning - Volume 70, ser.ICML’17, 2017, p. 1183–1192.

[37] T. Tran, T.-T. Do, I. Reid, and G. Carneiro, “Bayesiangenerative active deep learning,” CoRR, vol. abs/1904.11643,2019. [Online]. Available: http://arxiv.org/abs/1904.11643

[38] M. Opper and O. Winther, “A Bayesian approach to on-linelearning,” On-line learning in neural networks, pp. 363–378,1998.

[39] H. Ritter, A. Botev, and D. Barber, “Online structured Laplaceapproximations for overcoming catastrophic forgetting,” inProceedings of the 32nd International Conference on Neu-ral Information Processing Systems, ser. NIPS’18, 2018, pp.3742–3752.

[40] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes,M.-L. Shyu, S.-C. Chen, and S. S. Iyengar, “A survey ondeep learning: Algorithms, techniques, and applications,” ACMComput. Surv., vol. 51, no. 5, Sep. 2018.

[41] W. L. Buntine, “Operations for learning with graphical mod-els,” Journal of Artificial Intelligence Research, vol. 2, pp.

16

159–225, Dec 1994.[42] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse, “Flipout:

Efficient pseudo-independent weight perturbations on mini-batches,” in International Conference on Learning Represen-tations, 2018.

[43] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Un-derstanding deep learning requires rethinking generalization,”in 5th International Conference on Learning Representations,ICLR, 2017.

[44] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee,B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, andA. Riddell, “Stan: A probabilistic programming language,”Journal of statistical software, vol. 76, no. 1, 2017.

[45] A. Gelman and other Stan developers, “Prior choice recom-mendations,” 2020, retrieved from https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations [last seen13.07.2020].

[46] D. Silvestro and T. Andermann, “Prior choice affectsability of Bayesian neural networks to identify unknowns,”CoRR, vol. abs/2005.04987, 2020. [Online]. Available:http://arxiv.org/abs/2005.04987

[47] K. P. Murphy, Machine Learning: A Probabilistic Perspective.The MIT Press, 2012.

[48] A. A. Pourzanjani, R. M. Jiang, B. Mitchell, P. J.Atzberger, and L. R. Petzold, “Bayesian inference over theStiefel manifold via the Givens representation,” CoRR, vol.abs/1710.09443, 2017. [Online]. Available: http://arxiv.org/abs/1710.09443

[49] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”CoRR, vol. arXiv:1607.06450, 2016, in NIPS 2016 DeepLearning Symposium.

[50] G.-J. Qi and J. Luo, “Small data challenges in big dataera: A survey of recent progress on unsupervised andsemi-supervised methods,” CoRR, vol. abs/1903.11260, 2019.[Online]. Available: http://arxiv.org/abs/1903.11260

[51] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari,“Learning with noisy labels,” in Advances in Neural Informa-tion Processing Systems 26. Curran Associates, Inc., 2013,pp. 1196–1204.

[52] B. Frenay and M. Verleysen, “Classification in the presence oflabel noise: A survey,” IEEE Transactions on Neural Networksand Learning Systems, vol. 25, no. 5, pp. 845–869, 2014.

[53] A. C. Tommi and T. Jaakkola, “On information regularization,”in In Proceedings of the 19th UAI, 2003.

[54] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D.Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “FixMatch:Simplifying semi-supervised learning with consistency andconfidence,” CoRR, vol. abs/2001.07685, 2020. [Online].Available: https://arxiv.org/abs/2001.07685

[55] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regular-ization: A geometric framework for learning from labeled andunlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Dec. 2006.

[56] S. Yu, B. Krishnapuram, R. Rosales, and R. B. Rao, “Bayesianco-training,” Journal of Machine Learning Research, vol. 12,no. 80, pp. 2649–2680, 2011.

[57] R. Kunwar, U. Pal, and M. Blumenstein, “Semi-supervisedonline Bayesian network learner for handwritten charactersrecognition,” in 2014 22nd International Conference on Pat-tern Recognition, 2014, pp. 3104–3109.

[58] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” inWorkshop on challenges in representation learning, ICML,vol. 3, 2013.

[59] Z. Li, B. Ko, and H.-J. Choi, “Naive semi-supervised deeplearning using pseudo-label,” Peer-to-Peer Networking andApplications, vol. 12, no. 5, pp. 1358–1368, 2019.

[60] M. S. Bari, M. T. Mohiuddin, and S. Joty, “MultiMix: A robustdata augmentation strategy for cross-lingual nlp,” in ICML,

2020.[61] O. Chapelle, J. Weston, L. Bottou, and V. Vapnik, “Vicinal risk

minimization,” in Advances in Neural Information ProcessingSystems 13. MIT Press, 2001, pp. 416–422.

[62] Q. Xie, Z. Dai, E. H. Hovy, M. Luong,and Q. V. Le, “Unsupervised data augmentation,”CoRR, vol. abs/1904.12848, 2019. [Online]. Available:http://arxiv.org/abs/1904.12848

[63] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey,“Meta-learning in neural networks: A survey,” CoRR, vol.abs/2004.05439, 2020. [Online]. Available: http://arxiv.org/abs/2004.05439

[64] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEETransactions on Knowledge and Data Engineering, vol. 22,no. 10, pp. 1345–1359, 2010.

[65] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,”CoRR, vol. abs/2003.08271, 2020.

[66] L. Jing and Y. Tian, “Self-supervised visual feature learningwith deep neural networks: A survey,” IEEE Transactions onPattern Analysis and Machine Intelligence, pp. 1–1, 2020.

[67] E. Grant, C. Finn, S. Levine, T. Darrell, and T. L. Grif-fiths, “Recasting gradient-based meta-learning as hierarchicalBayes,” in 6th International Conference on Learning Repre-sentations, ICLR 2018, Vancouver, BC, Canada, April 30 -May 3, 2018, Conference Track Proceedings, 2018.

[68] L. Beyer, X. Zhai, A. Oliver, and A. Kolesnikov, “S4L:Self-supervised semi-supervised learning,” in 2019 IEEE/CVFInternational Conference on Computer Vision (ICCV), 2019,pp. 1476–1485.

[69] R. Bardenet, A. Doucet, and C. Holmes, “On Markov ChainMonte Carlo methods for tall data,” J. Mach. Learn. Res.,vol. 18, no. 1, pp. 1515–1557, Jan. 2017.

[70] E. I. George, G. Casella, and E. I. George, “Explaining theGibbs sampler,” The American Statistician, 1992.

[71] S. Chib and E. Greenberg, “Understanding the Metropolis-Hastings algorithm,” The American Statistician, vol. 49, no. 4,pp. 327–335, 1995.

[72] R. M. Neal et al., “MCMC using Hamiltonian dynamics,”Handbook of Markov Chain Monte Carlo, vol. 2, no. 11, p. 2,2011.

[73] M. D. Hoffman and A. Gelman, “The No-U-Turn sampler:adaptively setting path lengths in Hamiltonian Monte Carlo,”Journal of Machine Learning Research, vol. 15, no. 1, pp.1593–1623, 2014.

[74] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variationalinference: A review for statisticians,” Journal of the AmericanStatistical Association, vol. 112, no. 518, pp. 859–877, 2017.

[75] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochas-tic variational inference,” J. Mach. Learn. Res., vol. 14, no. 1,pp. 1303–1347, May 2013.

[76] A. Graves, “Practical variational inference for neural net-works,” in Advances in Neural Information Processing Systems24. Curran Associates, Inc., 2011, pp. 2348–2356.

[77] Z. Ghahramani and M. J. Beal, “Propagation algorithms forvariational Bayesian learning,” in Advances in Neural Infor-mation Processing Systems 13. MIT Press, 2001, pp. 507–513.

[78] H. Ritter, A. Botev, and D. Barber, “A scalable laplace approx-imation for neural networks,” in International Conference onLearning Representations, 2018.

[79] W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, andA. G. Wilson, “A simple baseline for Bayesian uncertainty indeep learning,” in Advances in Neural Information ProcessingSystems 32. Curran Associates, Inc., 2019, pp. 13 153–13 164.

[80] J. M. Hernandez-Lobato and R. P. Adams, “Probabilisticbackpropagation for scalable learning of Bayesian neural net-works,” in Proceedings of the 32nd International Conferenceon International Conference on Machine Learning - Volume

17

37, ser. ICML’15, 2015, p. 1861–1869.[81] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra,

“Weight uncertainty in neural network,” in Proceedings ofthe 32nd International Conference on Machine Learning, ser.Proceedings of Machine Learning Research, vol. 37, 2015, pp.1613–1622.

[82] D. P. Kingma, M. Welling et al., “An introduction to vari-ational autoencoders,” Foundations and Trends® in MachineLearning, vol. 12, no. 4, pp. 307–392, 2019.

[83] D. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” International Conference on Learning Repre-sentations, 12 2014.

[84] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neu-ral networks from overfitting,” Journal of Machine LearningResearch, vol. 15, no. 56, pp. 1929–1958, 2014.

[85] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approxi-mation: Representing model uncertainty in deep learning,” inProceedings of the 33rd International Conference on MachineLearning - Volume 48, ser. ICML’16, 2016, p. 1050–1059.

[86] Y. Li and Y. Gal, “Dropout inference in Bayesian neuralnetworks with alpha-divergences,” in Proceedings of the 34thInternational Conference on Machine Learning - Volume 70,ser. ICML’17, 2017, pp. 2052–2061.

[87] J. Hron, A. Matthews, and Z. Ghahramani, “VariationalBayesian dropout: pitfalls and fixes,” in Proceedings of the35th International Conference on Machine Learning, ser. Pro-ceedings of Machine Learning Research, vol. 80, 2018, pp.2019–2028.

[88] A. Chan, A. Alaa, Z. Qian, and M. Van Der Schaar, “Unla-belled data improves Bayesian uncertainty calibration undercovariate shift,” in Proceedings of the 37th InternationalConference on Machine Learning, ser. Proceedings of MachineLearning Research, vol. 119. Virtual: PMLR, 13–18 Jul 2020,pp. 1392–1402.

[89] S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradientdescent as approximate Bayesian inference,” The Journal ofMachine Learning Research, vol. 18, no. 1, pp. 4873–4907,2017.

[90] M. Welling and Y. W. Teh, “Bayesian learning via stochasticgradient Langevin dynamics,” in Proceedings of the 28thinternational conference on machine learning, ser. ICML ’11,2011, pp. 681–688.

[91] N. Seedat and C. Kanan, “Towards calibrated and scalableuncertainty representations for neural networks,” CoRR, vol.abs/1911.00104, 2019. [Online]. Available: http://arxiv.org/abs/1911.00104

[92] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple andscalable predictive uncertainty estimation using deep ensem-bles,” in Advances in Neural Information Processing Systems30. Curran Associates, Inc., 2017, pp. 6402–6413.

[93] M. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, andA. Srivastava, “Fast and scalable Bayesian deep learning byweight-perturbation in Adam,” in Proceedings of the 35th In-ternational Conference on Machine Learning, ser. Proceedingsof Machine Learning Research, vol. 80, 2018, pp. 2611–2620.

[94] T. Pearce, F. Leibfried, A. Brintrup, M. Zaki, and A. Neely,“Uncertainty in neural networks: Approximately Bayesianensembling,” in AISTATS 2020, 2020.

[95] J. Zeng, A. Lesnikowski, and J. M. Alvarez, “The relevanceof Bayesian layer positioning to model uncertainty in deepBayesian active learning,” CoRR, vol. abs/1811.12535, 2018.[Online]. Available: http://arxiv.org/abs/1811.12535

[96] N. Brosse, C. Riquelme, A. Martin, S. Gelly, and EricMoulines, “On last-layer algorithms for classification:Decoupling representation from uncertainty estimation,”CoRR, vol. abs/2001.08049, 2020. [Online]. Available:http://arxiv.org/abs/2001.08049

[97] E. Snelson and Z. Ghahramani, “Compact approximations to

Bayesian predictive distributions,” in Proceedings of the 22ndInternational Conference on Machine Learning, ser. ICML’05, 2005, p. 840–847.

[98] A. Korattikara, V. Rathod, K. Murphy, and M. Welling,“Bayesian dark knowledge,” in Proceedings of the 28th Inter-national Conference on Neural Information Processing Sys-tems - Volume 2, ser. NIPS’15, 2015, pp. 3438–3446.

[99] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledgein a neural network,” arXiv preprint arXiv:1503.02531, 2015,in NIPS 2014 Deep Learning Workshop.

[100] A. K. Menon, A. S. Rawat, S. J. Reddi, S. Kim, andS. Kumar, “Why distillation helps: a statistical perspective,”CoRR, vol. abs/2005.10419, 2020. [Online]. Available:https://arxiv.org/abs/2005.10419

[101] K.-C. Wang, P. Vicol, J. Lucas, L. Gu, R. Grosse, andR. Zemel, “Adversarial distillation of Bayesian neural networkposteriors,” in Proceedings of the 35th International Con-ference on Machine Learning, ser. Proceedings of MachineLearning Research, vol. 80, 2018, pp. 5190–5199.

[102] K. Janocha and W. M. Czarnecki, “On loss functions for deepneural networks in classification,” Schedae Informaticae, vol.1/2016, 2017. [Online]. Available: http://dx.doi.org/10.4467/20838476SI.16.004.6185

[103] V. Kuleshov, N. Fenner, and S. Ermon, “Accurate uncertaintiesfor deep learning using calibrated regression,” in Proceedingsof the 35th International Conference on Machine Learning, ser.Proceedings of Machine Learning Research, vol. 80, 2018, pp.2796–2804.

APPENDIX AA PROOF OF EQUATION 38

Let’s assume that we have a probability space (Ω,F ,P),where Ω is a set of outcomes, F is a σ-algebra of Ωrepresenting possible events and P is a measure defined onF and assigning value 1 to Ω, representing the probability ofan event. Also assume that we have a probability distributionqφ(θ) for a given random variable θ, a probability distributionq(ε) for a given random variable ε and a functional relationt(ε,φ) such that t(ε,φ) is distributed like qφ(θ) and t(ε,φ)is a bijection with respect to ε. Thus, we have:

P(θ−1(t(E,φ))) = P(ε−1(E)) ∀E ∈ ε(F), (55)

withε(F) = ε(ω) : ω ∈ e : e ∈ F ,

t(E,φ) = t(ε,φ) : ε ∈ E,

ε−1(E) =⋃e∈F∧ε(e)⊆E e,

θ−1(t(E,φ)) =⋃e∈F∧θ(e)⊆t(E,φ) e.

Since t(ε,φ) is a bijection with respect to ε, we haveε−1(E) = θ−1(t(E,φ)). This implies:

∫

θ∈t(E,φ)

qφ(θ)dθ =

∫

ε∈Eq(ε)dε ∀E ∈ ε(F). (56)

which in turn implies:

qφ(θ)dθ = q(ε)dε (57)

for non-degenerated probability distributions qφ(θ) and q(ε).Now, given a differentiable function f(φ,θ), we have:∫

θ∈t(ε(Ω),φ)

f(φ,θ)qφ(θ)dθ =

∫

ε∈ε(Ω)

f(φ, t(ε,φ))q(ε)dε (58)

which implies Equations (38).

1

Hands-on Bayesian Neural Networks –Supplementary material

I. PRACTICAL EXAMPLE – BAYESIAN MNIST

Over the years, MNIST [1] has become the most renowntoy dataset in deep learning. Coding a handwritten digitclassifier based on this dataset is now the Hello World of deepneural network programming. The first practical example weintroduce for this tutorial is thus just a plain old classifierfor MNIST implemented as a BNN. The code is availableon github [https://github.com/french-paragon/BayesianMnist].The setup is as follows;

The purpose is to show a Bayesian Neural Network helloworld project.

The problem is to train a BNN to perform hand-writtendigit recognition.

The dataset we are going to use is MNIST. However, tobe able to test how the proposed BNN reacts to unseen data,we will remove one of the classes from the training set so thatthe network never sees it during training.

The stochastic model is the plain Bayesian regressionpresented in Section IV-B of the main paper. We use a normaldistribution as a prior for the network parameters θ, with astandard deviation of 5 for the weights and 25 for the bias aswe expect the scale of the bias to have a bit more variabilitythan the weights. This is because, in a RELU or leaky-RELUnetwork, the weights influence the local change in slope of alayer function while the bias indicates the position of thoseinflection points.

The functional model is a standard convolutional neuralnetwork with two convolutional layers followed by two fullyconnected layers (Fig. 1).

A. Training

We used Variational Inference to train this BNN. We usea Gaussian distribution, with diagonal covariance, as a varia-tional posterior (see Section V-C of the main paper). Since weexpect the posterior to be multi-modal, we train an ensembleof BNNs instead of a single one (see Section V-E2b of themain paper). Note that the actual top performing method onthe MNIST leaderboard is actually an ensemble-based method

ConvolutionMax-Pool Convolution

Max-Pool Dense

1@28x28

16@24x2416@12x12

32@8x8 32@4x41x512

1x128

1x10

RELU

RELU

Fig. 1: The neural network architecture used for the BayesianMNIST practical example.

0 1 2 3 4 5 6 7 8 9digits

0.0

0.2

0.4

0.6

0.8

1.0

dig

it p

rob

0 1 2 3 4 5 6 7 8 9digits

0.00

0.02

0.04

0.06

0.08

std

dig

it p

rob

Within sample

Across samples

(a) Results on a seen class

0 1 2 3 4 5 6 7 8 9digits

0.0

0.2

0.4

0.6

0.8

1.0

dig

it p

rob

0 1 2 3 4 5 6 7 8 9digits

0.0

0.1

0.2

0.3

std

dig

it p

rob

Within sample

Across samples

(b) Results on the unseen class

0 1 2 3 4 5 6 7 8 9digits

0.0

0.2

0.4

0.6

0.8

1.0

dig

it p

rob

0 1 2 3 4 5 6 7 8 9digits

0.00

0.05

0.10

0.15

std

dig

it p

rob

Within sample

Across samples

(c) Results on white noise

Fig. 2: Average prediction for a seen class (a) the unseen class(b) and just white noise input (c).

[2]. (The authors refer to a multi-column neural network, butthe idea is the same.)

B. Results

We tested the final BNN against (1) the test set restrictedto the classes the network has been trained on, (2) the testset restricted to the class the network has not been trainedon, and (3) a pure white noise. For each case, we reportthe average probabilities predicted by the BNN for all inputimages in the considered class (Fig. 2), along with the averagestandard deviation of the network on a single sample, and thestandard deviation of the prediction for all samples. While notas informative and rigorous than a collection of calibrationcurves, it does indicate if the BNN uncertainty matches thenatural variability of the dataset. The reported results showthat for a class seen during training, the network gives thecorrect prediction and seems to be quite confident about itsresults. The variability per sample and across samples arecoherent with one another and quite small. When presentedwith a digit from a class unseen during training, the networkapparently tries to match it to different digits that might sharesome similarities. However, the low predicted probabilitiesalong with the high per sample and across samples standarddeviation show that the network is aware that it does not knowabout these unseen digits. As for the white noise, the averageoutput is not perfectly flat but quite constant.

arX

iv:2

007.

0682

3v2

[cs

.LG

] 3

0 Se

p 20

21

2

II. PRACTICAL EXAMPLE – SPARSE MEASURE

The second practical example we introduce, titled ”Sparsemeasure”.

The purpose is to present a small model illustrating howdifferent training strategies can be integrated in a BNN. Wealso provide a Python implementation of this example inorder to show how the different hypotheses we present inthis Appendix translate to actual code [https://github.com/french-paragon/sparse measure bnn].

The problem is to learn a function f : Rn → Rm based ona dataset D = (xi,yi)|i ∈ [1, N ],xi ∈ Rn,yi ∈ Rm wherefor each tuple (xi,yi) only some elements of yi are actuallymeasured. This scenario corresponds to any application ofmachine learning were measuring a raw signal, representedby the xi’s, is easy but one want to train an algorithm toreconstruct a derived signal, represented by the yi’s, whichis much harder to acquire. On top of that, the elements ofyi are measured with a certain level of uncertainty. Also, oneelement of yi, which is constant but unknown, has a muchhigher level of uncertainty (e.g., it has been found that one ofthe instruments used for data acquisition is defective, but noone remembers for which experiment it has been used). It isalso known that the function f is continuous and will, withvery high probability map any given input to a point near aknown hyperplane of Rm.

The dataset is generated at random for each experimentusing python. Inputs are sampled from a zero-mean multi-variate normal distribution with a random correlation matrix.The input is then multiplied by a random projection matrix toR3 before a non-linear function is applied. A random set ofthree orthonormal vectors in Rm is then generated and used toreproject the values in the final output space, constrained ona known hyperplane. A final non-linear function is applied onthe input and its results are added to the previously generatedoutputs with a very low gain factor to ensure the actual outputis still likely to be near the known hyperplane. Finally, noise isadded to the training data, with one channel receiving muchmore noise than the other channels. We set n = m = 9.The measured values for the training set are then selectedat random, the other values being set to 0. A set of binarymasks are returned with the data to indicate which values haveactually been measured.

The stochastic model is the most interesting part of thiscase study. Note that learning one function f : Rn → Rm

is equivalent to learning m functions f : Rn → R, meaninga standard Bayesian regression would be enough to addressthis learning problem. But since doing so ignores the fact thatthe functions under consideration can be correlated with oneanother it is possible to do much more.

First, it is possible to extend the model to account for thenoisy labels. To learn which output channel is actually morenoisy than the others, a meta-learning approach can be usedwith hierarchical Bayes applied not on the model parametersbut on the noise model. If applied to standard point estimatenetworks, this approach would be seen as learning a loss. Theprior can be extended with a consistency condition to accountfor the fact that the unknown function projects points near

θ

D

xξ

l ys

po σy y

Fig. 3: BBN corresponding to the stochastic model we usedfor the sparse measures example.

a known hyperplane. Last but not least, a semi-supervisedlearning model can be used to share information across thetraining samples. The PGM corresponding to this approachis depicted in Fig. 3. We kept the plate for the dataset D aswe will use a second consistency condition to implement thesemi-supervised learning strategy. We also chose to use a lastlayer-only BNN, mostly to illustrate how it can be done. Wewill also consider point estimates for the hierarchy of noisequantification variables.

The functional model Φξ,θ is a feed-forward artificialneural network with five hidden layers (Fig. 4). The first fourlayers are point-estimate layers, with parameters ξ, while thelast hidden layer and the output layer are stochastic, withparameters θ. We also have three Monte-Carlo dropout layersat the end of the network. The hourglass shape of the networkis supposed to match the expected behaviour of the studiedfunction, where the information is approximately located on asmaller dimensional space.

A. Base stochastic model

The base stochastic model is built as a Bayesian regressionwith a last-layer BNN architecture. The base probability of θ,without the consistency conditions that will be applied lateron, has been set to a normal distribution with mean 0 and astandard deviation of 1 for the weights and 10 for the bias.The prior distribution of the output y knowing the input x andthe parameters θ and ξ is then given by:

p(y|x,θ, ξ) = N (Φξ,θ(x), σy), (1)

MC-dropout active

Stochastic layersPoint estimate layers

OutputInput Leaky-RELU hidden neurones

Fig. 4: The neural network architecture we used for the sparsemeasure example.

3

where σy is the small uncertainty due to the fact that thefunctionnal model will never be able to perfectly fit the actualfunction f . We set σy = 0.1.

B. Noisy labels model

For this regression problem, we assume that the measure-ments have been corrupted by a zero-mean Gaussian noise,which is a convenient formulation and more than often a rea-sonable approximation of the true noise model for continuousmeasurements (due to the central limit theorem). The mean ofthe noisy measurements y is thus the unobserved, true functionvalue y. For the standard deviation σy , we assume that it isσin = 0.1 for inliers and σout = 5 for outliers. The problemhere is that we do not know which channel has been corruptedby the additional noise. We could just set a constant standarddeviation for all channels sligthly above σin, but instead weadded an additional variable, po, a vector representing theprobability that a given channel is an outlier. Since a singlechannel is corrupted by noise, we could constrain po to lie onthe probability simplex. Instead, to let the model generalize tocases with more than a single noisy channel, we will imposethat po ∈ [0, 1]m. For convenience, we will consider anunconstrained variable s ∈ Rm and define po as a function ofs to simplify the optimisation:

po =1

1 + e−s. (2)

The value of σ is then determined by a Bernoulli distributionwith parameter po:

b ∼ Bernoulli(po),σy = b · σout + (1− b) · σin. (3)

The only thing left is to determine the prior for s. Assumingwe have no idea which channel is noisy and which is not, areasonable hypothesis is a Gaussian with mean 0 and largecovariance matrix Σs (we choose Σs = 2500I)). Yet, sincewe know that only one channel is noisy, we can add aconsistency condition to enforce that ‖po‖1 ≈ 1, or anyexpected number of noisy channels:

p(s) ∝ N (0,Σs)exp

[−γs

(∥∥∥∥1

1 + e−s

∥∥∥∥1

− 1

)2], (4)

where γs is a scaling factor representing the confidence onehas in the model capacity to fit the correct number of noisychannels. We set γs = 3000 as we are almost certain the modelshould be able to fit the single noisy channel. The probabilityof y given y and σy is then given by:

p(y|y,σy) = N (y,σy). (5)

Since we are not so interested in the complete posteriordistribution for s or even σy , we will learn point estimates ofthose variables instead and then derive a regularization termfrom the log of Equation (4). The final contribution for thelearnable parameters s to the total loss is log(p(s)) plus anunknown constant one can just ignore to, is:

1

2

(s>Σ−1

s s)

+ γs

(∥∥∥∥1

1 + e−s

∥∥∥∥1

− 1

)2

. (6)

This is implemented in the —outlierAwareSumSquareError—learnable loss in the python implementation.

C. Consistency condition

The function f is likely to map a given input near a knownsubspace of Rm, which can be represented as span(S) whereS ∈ Rm×3 is a given orthonormal matrix. The distance dy,Sbetween a prediction y and span(S) is then given by:

dy,S =∥∥∥y − SS>y

∥∥∥2. (7)

We then assume a-priori that dy,S knowing x follows a normaldistribution and use this to derive a consistency condition forp(θ, ξ|Dx):

p(θ, ξ|Dx) ∝ p(θ)p(ξ)e−∑

x∈Dx

1

σ2d

‖Φξ,θ(x)−SS>Φξ,θ(x)‖22

(8)where σd is the prior standard deviation of dy,S , which weset to 1.

The additional contribution to the loss by the consistencycondition is thus:

∑

x∈Dx

1

σ2d

∥∥∥Φξ,θ(x)− SS>Φξ,θ(x)∥∥∥

2

2. (9)

D. Semi-supervised learning strategy

As stated in Section IV-D1 of the main paper, there aretwo ways one can implement a semi-supervised learningstrategy for a BNN. The first one is by using a consistencycondition (i.e., data driven regularization). The second one isby assuming some kind of dependence across the samples.We use the former for this practical example as data drivenregularization is generally easier to implement.

The intuition behind this example is that if two points fromthe input space are close to one another then their image by fshould also be close to one another. The simplest approach toformally measure this is just to assume that the norm of thedifference in the output space normalized by the distance inthe input space is small. This sets the consistency conditionfunction C(θ, ξ,x) to be:

C(θ, ξ,x) =1

2

∑

x′∈Dx\x

‖Φξ,θ(x)− Φξ,θ(x′)‖22‖x− x′‖22

. (10)

Averaging this consistency condition over the whole trainingset Dx is equivalent to computing the sum over all outputchannels of Φξ,θ of the quadratic form of the Laplacian matrixof the dense graph spanned by the training set Dx, wherethe weights of the edges are given by the inverse Euclideandistances of the points. Graph Laplacian regularization is itselfa common method to implement semi-supervised learning [3].

This approach can be further improved. Assume the differ-ences between the components of two random input vectors xand x′, as well as the differences between the components oftheir images by f , are normally distributed. This implies that:

σ2∆x ‖f(x)− f(x′)‖22σ2

∆f ‖x− x′‖22∼ F (n,m), (11)

4

0 200 400 600 800Relative distance squared

10 13

10 11

10 9

10 7

10 5

10 3

10 1

Pro

bab

ilit

y d

en

sity

Fisher pdf

Observed histogram

Fig. 5: Comparison between the observed distribution of rela-tive distances squared in the test set and the Fisher–Snedecordistribution of parameters 9 and 9

4 3 2 1 0 1Actual prediction error

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Pro

bab

ilit

y d

en

sity

Fig. 6: Prediction error distribution for the sparse measureexample

where F (n,m) is a Fisher–Snedecor distribution with pa-rameters n and m, σ∆x is the prior standard deviation ofthe difference of two random inputs, and σ∆f is the priorstandard deviation of the difference of the correspondingoutputs. As shown in Fig. 5, such assumptions are not perfectbut still match reasonably well the data, especially since thoseassumptions are just a prior and not an exact model. The mostimportant point to notice is that both the observed and priordistributions have a heavy tail. This would not be the case withthe consistency condition proposed in Equation 10. This naiveapproach would thus penalize outliers too heavily. We thusimplemented the following, corrected, consistency conditionfunction in the sparse measures example:

∑

x′∈Dx\x

(n+m

2− 1

)log(1 + F )−

(n2− 1)

log(F ),

(12)

where F = λ∆

‖Φξ,θ(x)− Φξ,θ(x′)‖22‖x− x′‖22

is the Fisher statistic.

The only parameter one has to define for this prior is λ∆,which is the ratio of σ2

∆x and σ2∆f . We set λ∆ = 1.

E. Results

The estimator for y reaches a RMSE of around 0.4 witha distribution of error approximating a normal distribution.Variations can be observed over mutliple runs, as a differentfunction f is generated at random each times (Fig. 6). Thepython code also computes the calibration curve using the

0.0 0.2 0.4 0.6 0.8 1.0Predicted probability

0.0

0.2

0.4

0.6

0.8

1.0

Ob

serv

ed

pro

bab

ilit

y

Calibration curve

Ideal curve

Fig. 7: Calibration curve for the sparse measure example

hypothesis for regression models presented in Section VII ofthe main paper. The results (Fig. 7) show that the model tendsto be slightly underconfident, except for large outliers wherethe model is overconfident. This is probably due to the factthat our simple variational inference model cannot fit the exactposterior and thus tends to be a bit too pessimistic aboutits predictions to compensate for the outliers it is unable toefficiently fit.

III. PRACTICAL EXAMPLE – PAPERFOLD

The last practical example we introduce is titled ”Paper-fold”. The code is available on github [https://github.com/french-paragon/paperfold-bnn].

The purpose of this case study is to compare differentinference approaches for BNNs. We introduce a model, anda corresponding dataset, which is small enough to make allexisting training methods for BNNs tractables when usedfor the corresponding architecture. This makes the numberof parameters small enough such that the posterior and thecorresponding approximations are low dimensional enough tobe easily visualized and sampled without issues with exactMCMC methods. On the other hand, the model is complexenough to display most of the issues one would encounterwhen studying a BNN posterior.

The problem is to fit a one dimensional function of twoparameters of the form z = f(x, y).

The dataset is made of eight (x, y, z) samples grouped intothe vectors x,y, z with:

x =

√2

1

2

√2

0

−1

2

√2

−√

2

−1

2

√2

01

2

√2

, y =

01

2

√2√2

1

2

√2

0

−1

2

√2

−√

2

−1

2

√2

, z =

11

201

211

201

2

.

(13)

5

(a) Raw data (b) First possible fold (c) Second possible fold

Fig. 8: The dataset and two possible folding behavior of the model for the paperfold case study.

The stochastic model (Fig. 9a) is the usual fully supervisedBaysian regression. We assume each measure in z has asmall uncertainty εi ∼ N (0, σ2

z), with σz being a smallpositive standard deviation that one can set depending on theexperiment. By default, we assume that σz = 0.1.

The functional model (Fig. 9b) is a two-branch and two-layer feedforward neural network. The first layer branch has nonon-linearities afterwards, while the second branch is followedby a RELU non-linearity. The second layer just takes theweighed sum of both branches and thus has no bias. Thismodel reduces to a function of the form:

z = γ1(α1x+ α2y+ α3) + γ2RELU(β1x+ β2y+ β3). (14)

Here, α represents the parameters of the first branch of thefirst layer, β the parameters of the second branch of the firstlayer and γ the parameters of the second layer. This simplemodel represents a planar function with a single fold alongthe line of equation β1x+ β2y + β3 = 0.

This model has always two and only two non-equivalentways of fitting the proposed dataset (see Fig. 8). The modelalso exhibits some weight space symmetry (i.e., the half-planesfitted by the first and second branch of the first layer can beswapped, resulting in a different parametrization but equivalentfitting of the data) and scaling symmetry (between β and αagainst γ).

To sample the posterior, we first have to choose a priorfor the parameters α, β and γ of the BNN. The followingprocedure works for any choice of prior p(α,β), p(γ) wherethe probability distribution of the parameters of the firstlayer is independent of the probability distribution of theparameters of the second layer. For simplicity and if notspecified otherwise, we assume a normal prior with diagonalconstant covariance:

αβγ

∼ N (0, σ2I), (15)

y

zD

(a)

y

zγ1

γ2

α1

α2α3

β1

β2β3 RELU

(b)

Fig. 9: Stochastic (a) and functional (b) model for the paper-fold case study.

with σ a positive standard deviation. By default, we assumewe do not know much about the model and set σ = 5 for thiscase study.

A. Comparison of training methods

We implemented four different samplers for the paperfoldexample (Fig. 10):• The NUTS sampler, whose implementation is provided

by the Pyro python library. This is a state-of-the-artgeneral purpose MCMC sampler, which we used mainlyas a way of generating samples from the true posterioras a base of comparison with other methods. This isbecause this MCMC sampler is much slower than theother methods we consider (Fig. 11).

• A variational inference model based on a Gaussian ap-proximation of the posterior.

• An ensemble based approach, and• An ensemble of variational inference based models.1) MCMC: The MCMC sampler leads to results which

should match really well samples from the exact posterior.In Fig. 10, one can get an appreciation of the complexity ofthe exact posterior, even for such a small and simple model.Each row and column represent one variable with the graphat the intersection showing the samples projected in the 2-dimensional plane spanned by those variables. The samplesform perpendicular and diagonal structures, which probablycorrespond to the two non-equivalent ways of fitting the data.The symmetric lines are caused by the weight-space symmetry.The hyperbolas correspond to equivalent parametrizations ofthe network under scaling symmetry. Finally, the few outliersthat are visibles correspond to a series of small modes inthe posterior around the parametrizations fitting the constantfunction f(x, y) = 0.5.

When looking at the actual aggregate results (Fig. 12a),one can see that the average prediction of the BNN fitsthe data with a regular function similar to a second degreepolynomial. The uncertainty, as measured by the standarddeviation a posteriori, is low along the lines of the squarewhere the data points lie and increases linearly with distance,creating a diamond shape. Sampling the predicted value forz along a series of lines in the (x, y) plane of coordinates(Fig. 13a) shows that both folds appear as clearly distinct fromon another, but a bit of uncertainty remains around the exactposition of these folds.

2) Variational inference: As a variational inference ap-proximation to the posterior, we used a normal distributionwith a learnable mean and a diagonal covariance matrix. Itis pretty clear that for this specific example (which, should

6

Fig. 10: Pairplot of the samples from the posterior using four different approaches.

MCMC VI Ensembles VI+Ensembles

100

101

102

Tra

inin

g t

ime [

s]

(a)

MCMC VI Ensembles VI+Ensembles

10 3

10 2

10 1

Tim

e p

er

sam

ple

[s]

(b)

Fig. 11: Comparison between training time (a) and inferencetime (b) for the paperfold example with standard algorithmsfor the different training approaches.

be noted, has been designed to generate such a specificproblem), this method leads to a very poor approximationof the actual posterior, since it can only fit an unimodaldistribution (Fig. 10). In terms of the marginal, only one of thetwo possible folds (Fig. 13b) has been fitted. This translatesto a mean estimate for z as well as an uncertainty level anddistribution well off the actual posterior (Fig. 12b). The actualside of the fold the RELU branch is fitting is also visible as itsuncertainty is higher than on the other side of the fold. Thishighlights the major limitations of simple variational inferencemethods for BNN: since the underlying models are complexand the data can be fitted in many non-equivalent manners,simple variational posteriors lack some expressive power toprovide a good estimate of the actual BNN uncertainty. It isstill important to keep in mind that this example has beendesigned specifically to make those problems apparent andthat in practice mean field Gaussian and similar variationalinference approximations can still lead to good estimationsof the actual uncertainty (an example is the sparse measureexample in Appendix II). Variational inference is also ordersof magnitude faster than MCMC (Fig. 11).

3) Ensembling: The most straightforward way to performensemble learning with an artificial neural network architec-ture is just to restart the learning procedure multiple times andeach time add the final model to the ensemble. Using this strat-egy to learn the posterior for the paperfold example alreadygives quite good results, even better than naive variationalinference as the samples can belong to different modes of the

7

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

mean

0.2

0.4

0.6

0.8

z

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

standard deviation

0.1

0.2

0.3

0.4

z

(a)

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

mean

0.2

0.4

0.6

0.8

1.0

z

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

standard deviation

0.04

0.06

0.08

0.10

0.12

0.14

z

(b)

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

mean

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

z

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

standard deviation

0.05

0.10

0.15

0.20

0.25

0.30

z

(c)

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

mean

0.2

0.4

0.6

0.8

z

1.0 0.5 0.0 0.5 1.0x

1.0

0.5

0.0

0.5

1.0

y

standard deviation

0.10

0.15

0.20

0.25

0.30

0.35

0.40

z

(d)

Fig. 12: Mean and standard deviation of the marginal distribution of z knowing D, x and y for the paperfold model usingMCMC (a), variational inference (b), ensembling (c) and variational inference + ensembling (d).

(a) (b) (c) (d)

Fig. 13: Predicted z values across four different lines for the MCMC (a), variational inference (b), ensembling (c) and variationalinference + ensembling (d) version of the paperfold BNN.

posterior (Fig. 10). The marginal mean and standard deviationa-posteriori are clearly different from the ones obtained viaMCMC, but the diamond shape can be recongnized in bothpredictions (Fig. 12c). The training is slightly longer than withbasic variational inference as it has to be run multiple times.However, the time per sample when computing the marginal isextremely low (Fig. 11). The main drawback of this approachis that it provides no estimation of the local uncertainty aroundthe different modes of the posterior as shown by Fig. 13c.

4) Variational Inference + Ensembling: Combining thebenefits of a simple variational inference and ensembling canbe done in a straightforward manner by learning an ensembleof variational approximations. The resulting distribution isa Gaussian mixture, which can approximate the shape ofmultiple modes of the posterior. While for this example it stillhas a rough time matching the complex shapes in the exactposterior (Fig. 10), the resulting marginal for z is a very goodmatch with the one generated via a MCMC sampler (Fig. 12and 13). In addition, the time per sample when computing themarginal is still very similar to the single Gaussian variationalapproximation (Fig. 11). This shows the huge benefits ofensembling as a tool for approximate Bayesian methods indeep learning.

REFERENCES

[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of theIEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[2] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deepneural networks for image classification,” in 2012 IEEE Con-ference on Computer Vision and Pattern Recognition, 2012, pp.3642–3649.

[3] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regular-ization: A geometric framework for learning from labeled andunlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Dec. 2006.

wray buntine, arxiv:2007.06823v1 [cs.lg] 14 jul 2020

Documents