sampling-free uncertainty estimation in gated recurrent...

Sampling-free Uncertainty Estimation in Gated Recurrent Units withApplications to Normative Modeling in Neuroimaging

Seong Jae Hwang∗[email protected]

Ronak R. Mehta∗[email protected]

Hyunwoo J. Kim∗[email protected]

Sterling C. Johnson3,4,6

[email protected] Singh2,1

[email protected]

AbstractThere has recently been a concerted effortto derive mechanisms in vision and machinelearning systems to offer uncertainty estimatesof the predictions they make. Clearly, there arebenefits to a system that is not only accuratebut also has a sense for when it is not. Ex-isting proposals center around Bayesian inter-pretations of modern deep architectures – theseare effective but can often be computationallydemanding. We show how classical ideas inthe literature on exponential families on prob-abilistic networks provide an excellent startingpoint to derive uncertainty estimates in GatedRecurrent Units (GRU). Our proposal directlyquantifies uncertainty deterministically, with-out the need for costly sampling-based estima-tion. We show that while uncertainty is quiteuseful by itself in computer vision and ma-chine learning, we also demonstrate that it canplay a key role in enabling statistical analy-sis with deep networks in neuroimaging stud-ies with normative modeling methods. To ourknowledge, this is the first result describingsampling-free uncertainty estimation for pow-erful sequential models such as GRUs.

1 INTRODUCTIONRecurrent Neural Networks (RNNs) have achieved

state-of-the-art performance in various sequence predic-tion tasks such as machine translation (Wu et al., 2016;

∗Corresponding Authors; 1Dept. of Computer Sciences, Univ.of Wisconsin-Madison; 2Dept. of Biostatistics and Medical In-formatics, Univ. of Wisconsin-Madison; 3Dept. of Medicine,Univ. of Wisconsin-Madison; 4William S. Middleton VA Hos-pital, Madison; 5Dept. of Computer Science and Engineering,Korea University; 6Wisconsin Alzheimer’s Disease ResearchCenter, Madison; Appendix is available at http://pages.cs.wisc.edu/˜sjh/

Jozefowicz et al., 2016), speech recognition (Hintonet al., 2015; Amodei et al., 2016), language models (Choet al., 2014) as well as medical applications (Jagannathaand Yu, 2016; Esteban et al., 2016). For sequences withlong term dependencies, popular variants of RNN suchas Long-Short Term Memory (Gers et al., 1999) andGated Recurrent Unit (Chung et al., 2014) have shownremarkable effectiveness in dealing with the vanishinggradients problem and have been successfully deployedin a number of applications.

Point estimates, confidence and consequences. Despitethe impressive predictive power of RNN models, thepredictions rely on the “point estimate” of the parame-ters. The confidence score can often be overestimateddue to overfitting (Fortunato et al., 2017) especially ondatasets with insufficient sample sizes. More impor-tantly, in practice, without acknowledging the level ofuncertainty about the prediction, the model cannot beentirely trusted in mission critical applications. Unex-pected performance variations with no sensible way ofanticipating this possibility may also be a limitation interms of regulatory compliance. When a decision madeby a model could result in dangerous outcomes in real-life tasks such as an autonomous vehicle not detectinga pedestrian, missing a disease prediction due to someartifacts in a medical image, or radiation therapy mis-planning (Lambert et al., 2011), knowing how ‘certain’the model is about its decision can offer a chance to lookfor alternative solutions such as alerting the driver to takeover or recommending a different disease test to preventundesirable outcomes made by erroneous decisions.

Uncertainty. When operating with predictions involvingdata and some model, there are mainly two sources of un-predictability. First, there may be uncertainty that arisesfrom an imperfect dataset or observations — aleatoricuncertainty. Second, the lack of certainty resulting fromthe model itself (i.e., model parameters) is called epis-temic uncertainty (Der Kiureghian and Ditlevsen, 2009).Aleatoric uncertainty comes from the observations exter-

http://pages.cs.wisc.edu/~sjh/

http://pages.cs.wisc.edu/~sjh/

SP-GRU

Input Sequence

𝑡 = 1 𝑡 = 10

Ground Truth

Output Prediction

Model Uncertainty Map

𝑡 = 11 𝑡 = 20

Figure 1: Image sequence prediction with uncertainty. Given the first 10 frames of an input sequence (left), our modelSP-GRU makes the Output Prediction and the pixel-level Model Uncertainty Map where bright regions indicatehigh uncertainty. SP-GRU estimates the uncertainty deterministically without sampling model parameters.

nally such as noise and other factors that cannot typicallybe inferred systematically. Algorithms instead attemptto calculate the epistemic uncertainty resulting from themodel itself. This is often also referred to as model un-certainty (Kendall and Gal, 2017).

Related work on uncertainty in Neural networks. Theimportance of estimating the uncertainty aspect of neuralnetworks (NN) has been acknowledged in the literature.Several early ideas investigated a suite of schemes re-lated to Bayesian neural networks (BNN): Monte Carlo(MC) sampling (MacKay, 1992a), variational inference(Hinton and Van Camp, 1993) and Laplace approxima-tion (MacKay, 1992b). More recent works have focusedon efficiently approximating posterior distributions to in-fer predictive uncertainty. For instance, scalable forms ofvariational inference approaches (Graves, 2011) suggestestimating the evidence lower bound (ELBO) via MonteCarlo estimation to efficiently approximate the marginallikelihood of the weights. Similarly, several proposalshave extended the variational Bayes approach to performprobabilistic back propagation with assumed density fil-tering (Hernandez-Lobato and Adams, 2015), explicitlyupdate the weights of NN in terms of the distribution pa-rameters (i.e., expectation) (Blundell et al., 2015), or ap-ply stochastic gradient Langevin dynamics (Welling andTeh, 2011) at large scales. These methods, however, the-oretically rely on the correctness of the prior distribution,which has shown to be crucial for reasonable predic-tive uncertainties (Rasmussen and Quinonero-Candela,2005) and the strength or validity of the assumption (i.e.,mean field independence) for computational benefits. Aninteresting and different perspective on BNN uncertaintybased on Monte Carlo dropout was proposed by Gal etal. (Gal and Ghahramani, 2016), wherein the authorsapproximate the predictive uncertainty by using dropout(Srivastava et al., 2014) at prediction time. This ap-proach can be interpreted as an ensemble method wherethe predictions based on “multiple networks” with differ-ent dropout structures (Lakshminarayanan et al., 2017)yield estimates for uncertainty. However, while the es-

timated predictive uncertainty is less dependent on thedata by using a fixed dropout rate independent from thedata, uncertainty estimation on the network parameters(i.e. weights) is naturally compromised since the fixeddropout rates are already imposed on the weights by thealgorithm itself. In summary, while the literature is stillin a nascent stage, a number of researchers are studyingways in which uncertainty estimates can be derived fordeep architectures similar to those from traditional sta-tistical analysis for various applications (Ribeiro et al.,2018; Sedlmeier et al., 2019).

Other gaps in our knowledge. While the above methodsfocus on predictive uncertainty, most strategies do notexplicitly attempt to estimate the uncertainty of all inter-mediate representations of the network such as neurons,weights, biases and so on. Such information is under-standably less attractive in traditional applications, whereour interest mainly lies in the prediction made by the fi-nal output layer. However, RNN-type sequential NNsoften utilize not only the last layer of neurons but alsodirectly operate on the intermediate neurons in making asequence of predictions (Mikolov et al., 2010). SeveralBayesian RNNs have been proposed (Lakshminarayananet al., 2017; Fortunato et al., 2017) but are based on theBNN models described above. Their deployment is notalways feasible under practical time constraints for real-life tasks, especially with high dimensional inputs. Also,stochastic RNN models with stochastic layers with de-terministic layers (Fraccaro et al., 2016) and stochas-tic state models for reinforcement learning (Gregor andBesse, 2018) have been proposed, but they do not ex-plicitly estimate the uncertainty of intermediate represen-tations. Further, empirically more powerful variants ofRNNs such as LSTMs or GRUs have not been explicitlystudied in the literature in the context of uncertainty.

Contributions. Here, our goal is is to enable uncertaintyestimation on more powerful sequential neural networks,namely gated recurrent units (GRU), while addressingthe issues discussed above in BNNs. To our knowledge,few (if any) other works offer this capability. We propose

a probabilistic GRU, where all network parameters fol-low exponential family distributions.We call this frame-work the SP-GRU, which operates directly on these pa-rameters, inspired in part by an interesting result for non-sequential data (Wang et al., 2016). Our SP-GRU di-rectly offers the following properties: (i) The operationswithin each cell in the GRU proceed only with respect tothe natural parameters deterministically. Thus, the over-all procedure is completely sampling-free. Such a prop-erty is especially appealing for sequential datasets withsmall sample sizes; (ii) Because weights and biases andall intermediate neurons of SP-GRU can be expressedin terms of a distribution, their uncertainty estimates canbe directly inferred from the network itself. (iii) We fo-cus on some well-known exponential family distributions(i.e., Gaussian, Gamma) which have nice characteristicsthat can be appropriately chosen with minimal modifica-tions to the operations depending on the application ofinterest. (iv) We show how SP-GRU can be used on neu-roimaging data for detecting early disease progression inan asymptomatic Alzheimer’s disease cohort.

2 RECURRENT NEURAL NETWORKSAND EXPONENTIAL FAMILIES

Recurrent Neural Networks. The Gated Recurrent Unit(GRU) and the Long-Short Term Memory (LSTM) arepopular variants of RNN where the network parametersare shared across layers. While they both deal with ex-ploding/vanishing gradient issues with cell structures ofsimilar forms, the GRU does not represent the cell stateand hidden state separately. Specifically, its updates takethe following form (order of operation is (1) Reset Gateand Update Gate, (2) State Candidate and (3) Cell State):

Reset Gate: rt = σ(Wrxt + br)

Update Gate: zt = σ(Wzxt + bz)

State Candidate: ht = tanh(Uhxt +Wh(rt � ht−1) + bh)

Cell State: ht = (1− zt)� ht + zt � ht−1

where W{r,z,h} and b{r,z,h} are the weights and biasesrespectively for their corresponding updates. Typical im-plementations of both GRUs and LSTMs include an out-put layer outside of the cell to produce the desired out-puts. However, they do not naturally admit more thanpoint estimates of hidden states and outputs.

Exponential Families in Networks. In statistics, theproperties of distributions within exponential familieshave been very well studied.

Definition 1 Let x ∈ X be a random variable with prob-ability density/mass function (pdf/pmf) fX . Then fX isan exponential family distribution if

fX(x|η) = h(x) exp(ηTT (x)−A(η)) (1)

× gl

al

W l

W lal al+1 ∼ EXPFAM(gl(W lal))

Figure 2: A single exponential family neuron. W l arelearned, and the output is a sample generated from theexponential family defined by gl(W lal).

with natural parameters η, base measure h(x), and suffi-cient statistics T (x). Constant A(η) (log-partition func-tion) ensures that the distribution normalizes to 1.

Common distributions (e.g., Guassian, Bernoulli,Gamma) can be written in this unified ‘natural form’ withspecific definitions of h(x), T (x) and A(η) (e.g., Gaus-sian distribution with η = (α, β), T (x) = (x, x2) andh(x) = 1/

√2π).

Two key properties of this family of distributions haveled to their widespread use: (1) their ability to summa-rize arbitrary amounts of data x ∼ fX through only theirsufficient statistics T (x), and (2) their ability to be ef-ficiently estimated either directly through a closed formmaximum likelihood estimator or a convex function withconvex constraints.

Deep Exponential Families (DEFs) (Ranganath et al.,2015) explicitly models the output of any given layer asa random variable, sampled from an exponential familydefined by natural parameters given by the linear productof the previous layer’s output and a learnable weight ma-trix (see Fig. 2). While this formulation leads directlyto distributions over hidden states and model outputs,we have not learned distributions over the model param-eters. Computational feasibility is also neglected: thevariational inference procedure used for learning theseDEFs requires Monte Carlo sampling at each hiddenstate many times for every input sample (the cost of run-ning just text experiments was $40K as stated by the au-thors). This also becomes a concern in many biomedi-cal applications (e.g., medical imaging) where the modelsize grows proportionally to the dimensionality of datawhich often ranges from thousands to millions.

3 SAMPLING-FREE PROBABILISTICNETWORKS

We now describe a probabilistic network fully operatingon a set of natural parameters of exponential family dis-tributions in a sampling-free manner. Inspired by a resultfrom a few years back (Wang et al., 2016), the learningprocess, similar to traditional NNs, is deterministic yetstill captures the probabilistic aspect of the output andthe network itself, purely as a byproduct of typical NNprocedures (i.e., backpropagation).

al−1m

al−1s

olm

olm

ols

alm

als

LMM NMM

Figure 3: Linear Moment Matching (LMM) and Non-linear Moment Matching (NMM) are performed at theweights/bias sums and activations respectively.

Unlike the probabilistic networks mentioned before, ourGRU performs forward propagation in a series of deter-ministic linear and nonlinear transformations on the dis-tribution of weights and biases. Throughout the entireprocess, all operations only involve distribution parame-ters while maintaining their desired distributions after ev-ery transformation. We focus on three exponential fam-ily distributions with two natural parameters: Gaussian,Gamma and Poisson.

3.1 LINEAR TRANSFORMATIONS

We describe the linear transformation on the inputvector x with a matrix W of weights and a vec-tor b of biases in terms of their natural parameters.We first apply the mean-field assumption on each ofthe weights and biases based on their individual dis-tribution parameters α and β as p(W |Wα,Wβ) =∏i,j p(W (i, j) | Wα(i, j),Wβ(i, j)) where {Wα,Wβ}

and {bα, bβ} are the model parameters. Thus, analogousto the linear transformation o = Wa+ b in ordinary neu-ral networks on the previous layer output (or an input) awith W and b, our network operates purely on (α, β) tocompute {oα, oβ}.After each linear transformation, it is necessary to pre-serve the ‘distribution property’ of the outputs (i.e., oαand oβ still define the same distribution) throughout theforward propagation so that the intermediate nodes andthe network itself can be naturally interpreted in termsof their distributions. Thus, we cannot simply mimic thetypical linear transformation on aβ and compute oβ =Wβaβ + bβ if we want oβ to still be able to preserve thedistribution (Wang et al., 2016).

We perform a second order moment matching on themean and variance of the distributions. The mean mand variance s can easily be computed with an appro-priate function g(·, ·) which maps g : (α, β) → (m, s)for each exponential family distribution of interest (i.e.,g(α, β) = (−α+1

β , α+1β2 ) for a Gamma distribution).

Thus, we compute the (m, s) counterparts of all the(α, β)-based components (i.e., (om, os) = g(oα, oβ)).

Using the linear output before the activation function, wecan now apply Linear Moment Matching (LMM) on (1)the mean am following the standard linearity of randomvariable expectations and (2) the variance as as follows:

om = Wmam + bm

os = Wsas + bs + (Wm �Wm)as +Ws(am � am)

where � is the Hadamard product. Then, we invert backto (oα, oβ) = g−1(om, os). For the exponential familydistributions involving at most two natural parameters,matching the first two moments is sufficient.

3.2 NONLINEAR TRANSFORMATIONS

The next key step in NNs is the element-wise nonlin-ear transformation where we want to apply a nonlin-ear function f(·) to the linear transformation output oparametrized by η = (oα, oβ). This is equivalent to ageneral random variable transformation given the prob-ability density function (pdf) pO for O to derive thepdf pA of A transformed by a = f(o): pA(a) =pO(f(o))|f ′(o)|.However, well-known nonlinear functions f(·) such assigmoids and hyperbolic tangents cannot directly be uti-lized on (oα, oβ) because the resulting a = f(o) may notbe from the same exponential family distribution. Thus,we perform another second order moment matching interms of mean om and variance os via Nonlinear Mo-ment Matching (NMM). Ideally, we need to marginal-ize over a distribution of o given (oα, oβ) to computeam =

∫f(o)pO(o | oα, oβ)do and the corresponding

variance as =∫f(o)2pO(o | oα, oβ)do − a2m which we

map back to (aα, aβ) with an appropriate bijective map-ping function g(·, ·). However, when the dimension of ogrows, the computational burden of integral calculationbecomes incredibly more demanding. The closed formapproximations described below can efficiently computethe mean and variance of the activation outputs am andas (Wang et al., 2016). We show these approximationsfor sigmoids σ(x) and hyperbolic tangents tanh(x) fora Gaussian distribution, as these will become the criti-cal components used in our probabilistic GRU. Here, weuse the fact that σ(x) ≈ Φ(ζx) where Φ(·) is a probitfunction and ζ =

√π/8 is a constant. Then, we can

approximate the sigmoid functions for am and as as

am ≈ σm(om, os) = σ

(om

(1 + ζ2os)12

)as ≈ σs(om, os) = σ

(ν(om + ω)

(1 + ζ2ν2os)12

)− a2m

where ν = 4−2√

2 and ω = − log(√

2+1). The hyper-bolic tangent can be derived from tanh(x) = 2σ(2x)−1.

Table 1: SP-GRU operations in mean and variance. � and [A]2 denotes the Hadamard product and A � A of amatrix/vector A respectively. Note the Cell State does not involve nonlinear operations. See Fig. 4 for the illustrationof cell structure.

Operation Linear Transformation Nonlinear Transformation

Reset Gate otr,m = Ur,mxtm +Wr,mh

t−1m + br,m rtm = σm(otr,m, o

tr,s)

otr,s = Ur,sxts +Wr,sh

t−1s + br,s + [Ur,m]2xts rts = σs(o

tr,m, o

tr,s)

+Ur,s[xtm]2 + [Wr,m]2ht−1s +Wr,s[h

t−1m ]2

Update Gate otz,m = Uz,mxtm +Wz,mh

t−1m + bz,m ztm = σm(otz,m, o

tz,s)

otz,s = Uz,sxts +Wz,sh

t−1s + bz,s + [Uz,m]2xts zts = σs(o

tz,m, o

tz,s)

+Uz,s[xtm]2 + [Wz,m]2ht−1s +Wz,s[h

t−1m ]2

State Candidate oth,m

= Uh,mxtm +Wh,mh

t−1m + bh,m htm = tanhm(ot

h,m, oth,s

)

oth,s

= Uh,sxts +Wh,sh

t−1s + bh,s + [Uh,m]2xts hts = tanhs(o

th,m

, oth,s

)

+Uh,s[xtm]2 + [Wh,m]2ht−1s +Wh,s[h

t−1m ]2

Cell State htm = (1− ztm)� htm + ztm � ht−1m Not Needed

hts = [(1− zts)]2 � hts + [zts]2 � ht−1s

ℎ𝑚𝑡−1

𝑧𝑠𝑡

𝑧𝑚𝑡

𝑥𝑚𝑡

𝑥𝑠𝑡

𝑟𝑠𝑡

𝑟𝑚𝑡

𝑥𝑚𝑡

𝑥𝑠𝑡

1 − 𝑧𝑠𝑡

1 − 𝑧𝑚𝑡

ℎ𝑠𝑡

ℎ𝑚𝑡𝑥𝑚

𝑡

𝑥𝑠𝑡

ℎ𝑠𝑡−1

ℎ𝑚𝑡

ℎ𝑠𝑡

Figure 4: SP-GRU cell structure. Solid lines/boxes andred dotted lines/boxes correspond to operations and vari-ables for meanm and variance s respectively. Circles areelement-wise operators.

Note that other common exponential family distributionsdo not have obvious ways to make such straightforwardapproximations. Thus, we use an ‘activation-like’ map-ping f(x) = a − b exp(−γd(x)) where d(x) is an arbi-trary activation of choice with appropriate constants a, band γ of > 0. Nonlinear transformations of Gamma andPoisson distributions can then be formulated in closedform as well (e.g., a = b = γ = 1 is a good choice).

3.3 SAMPLING-FREE PROBABILISTIC GRUBased on the probabilistic formulations described above,we present Sampling-free Probabilistic GRU (SP-GRU).The internal architecture is shown in Fig. 4. Here, wefocus on adapting GRU with the sampling-free proba-bilistic formulation. We express all the variables re-

lated to the GRU in Table 1 in terms of their parametersη = (α, β). For instance, Wr is now expressed only interms of its parameters Wr,α and Wr,β (i.e., two weightmatrices). We assume that all of the variables are factor-ized. Because the GRU consists of a series of operationswith linear and nonlinear transformations, we can updateeach gate by the transformations defined in Table 1.

Assuming that the desired exponential family distribu-tion provides an invertible parameter mapping functiong(·, ·), we first transform all of the natural parameter vari-ables to means and variances. Then, given an input se-quence x = {x1m, x1s}, . . . , {xTm, xTs }, we perform lin-ear/nonlinear transformations with respect to means andvariances for each GRU operation (Fig. 4 and Table 1).

The cell state computation does not involve a nonlineartransformation. For an output layer on the hidden statesto compute the desired estimate y, a typical layer canbe defined in a similar manner to obtain both ym andys. In the experiments that follow, we add another suchlayer to compute the mean and variance of the sequenceof predictions y = {y1m, y1s}, {y2m, y2s}, . . . , {yTm, yTs }.Extensibility remarks. We note that despite the simplicityof the cell structure of the SP-GRU as shown in Fig. 4,our exponential family adaptation is not limited to GRU.For instance, the above formulation can be extended toother variants of RNNs such as LSTMs popularly used inmedical applications (Jagannatha and Yu, 2016; Santer-amo et al., 2018), flow-based models (Dinh et al., 2016,2014), and invertible neural networks (Ardizzone et al.,2019).

θ Ground Truth Prediction Uncertainty20◦

25◦

30◦

35◦

v Ground Truth Prediction Uncertainty5.0%

5.5%

6.0%

6.5%

b Input Prediction Uncertainty

0.0

0.2

0.4

0.6

θ Ground Truth Prediction Uncertainty20◦

25◦

30◦

35◦

v Ground Truth Prediction Uncertainty5.0%

5.5%

6.0%

6.5%

b Input Prediction Uncertainty

0.0

0.2

0.4

0.6

Figure 5: Predictions and uncertainties (sub-frames {11, 15, 20} out of full predicted frames {11, . . . , 20}) fromtesting varying deviations from trained trajectories (first of four rows, blue). Top: angle (colors match the paths inFig. 6). Bottom: speed (colors match the paths in Appendix B.2). Right: [sum of pixel-level variances / frames] usingSP-GRU and MC-GRU.

4 EXPERIMENTSWe first perform unsupervised learning of predicting im-age sequences from the moving MNIST dataset (Sri-vastava et al., 2015) for intuitive quantitative/qualitativeevaluations. Second, we apply our model to aunique neuroimaging dataset, consisting of brain imag-ing acquisitions from individuals at risk for develop-ing Alzheimer’s disease. Models were trained on anNVIDIA GeForce GTX 1080 Ti GPU in TensorFlowwith ADAM and an initial learning rate of 0.05, and de-cay parameters β1 = 0.9, β2 = 0.999. We use the Gaus-sian distribution for all setups with the KL divergencebetween the final output distribution N(om, diag(os))and the target mini-batch distribution N(ym, diag(ys))as the error where ym and ys are the ground truth valuesof the mini-batch samples and their variances (w.r.t. thecurrent mini-batch) respectively. This allows the modelto learn both the means and variances.

4.1 UNSUPERVISED SEQUENCE LEARNINGOF MOVING MNIST

Goal. For pixel-level tasks, prediction quality can beunderstood by the uncertainty estimate, i.e., estimatedmodel variance of that pixel. In these experiments, weask the following questions qualitatively and quantita-tively: (1) Given a visually ‘good looking’ sequence pre-diction, how can we tell that its trajectory is correct? (2)If it is, can we derive a degree of uncertainty on its pre-diction?

Setup. The moving MNIST dataset consists of digits

Figure 6: Controlled angle trajectories over 20 frames.

moving (randomly or controlled) in a 64×64 image over20 frames. We split sequences into two halves (first 10and second 10 frames). Then, we encode the first 10frames to learn a hidden representation (size 1024) andpredict the second 10 frames.

Controlled Paths. We first train our SP-GRU and MonteCarlo dropout GRU (MC-GRU) (Gal and Ghahramani,2016) with the same number of parameters until theyhave similar test errors (independent of uncertainty) onsimple one-digit MNIST sequences moving in a straightline (blue line in Fig. 6). We then construct three sets of100 ‘unfamiliar’ samples where each set consists of se-quences deviating from the training sequence path (bluepath in Fig. 6 with angle θ = 20◦ and speed v = 5.0%of width per frame) with varying angles (25◦, 30◦, and35◦ paths in Fig. 6) and speeds (5.5%, 6.0%, and 6.5%of width per frame). See Appendix B for details.

Results. For ‘unfamiliar’ angles and speeds, the predic-tions in Fig. 5 look visually sensible, but they do not ac-tually follow the ground truth paths (e.g., the predictionof 35◦ still follows 20◦ path). We can quantify this di-rectly by the [sum of pixel-level variances / frames] asshown in the right of Fig. 5. While we cannot evaluate

2 Moving Digits Prediction︷︸︸︷

3 Moving Digits (Out of Domain) Prediction︷︸︸︷

Figure 7: SP-GRU predictor results. Top 3 rows: 2 mov-ing digits (top: ground truth, middle: mean prediction,bottom: uncertainty estimate). Bottom 3 rows: 3 movingdigits which are out of domain (i.e., not seen in training).

the relative difference here because ‘ground truth uncer-tainty’ is not available for a true absolute comparison, weobserve that the uncertainty increases as the angle/speeddeviation increases for both SP-GRU and MC-GRU.

Computation Speed. From a practical perspective, theuncertainty inference should not sacrifice computationalspeed, e.g., real-time safety of an autonomous vehicle.With respect to this crucial aspect, SP-GRU greatly bene-fits from its sampling-free procedure: each epoch (30 se-quences) takes∼3 seconds while MC-GRU with a MonteCarlo sampling rate of 50 requires ∼40 seconds (> 10times SP-GRU) despite their comparable qualitative andquantitative performance. The MC sampling rate forthese methods cannot simply be decreased: uncertaintywill be underestimated. With SP-GRU, we compute thismodel uncertainty in closed form, without the need forany heavy lifting from large sample analysis.

Random Paths. To demonstrate that SP-GRU does notsacrifice base predictive power, we evaluate SP-GRU onthe same setup by (Srivastava et al., 2015) (2 randomlymoving digits).

Results. An example of two digit prediction result isillustrated in Fig. 7 (Top 3 rows) which shows quan-tifiable variance outputs as demonstrated in the con-trolled paths examples. We note that the mean predic-tion (middle row of Top 3 rows in Fig. 7) performanceis also accurate by comparing our method to previouswork in Table 2. SP-GRU with a basic predictor net-work setup performs comparably or better than othermethods that do not provide model uncertainty. In theseworks, model performance often benefits from respectivespecific network structures: encoder-predictor compos-

Table 2: Average cross entropy test loss per image perframe on Moving MNIST.

Model Test Loss

Srivastava et al. 2015 341.2Xingjian et al. 2015 367.1Brabandere et al. 2016 285.2Ghosh et al. 2016 241.8SP-GRU (Ours) 277.1

ite models (Srivastava et al., 2015), generative adversar-ial networks (Kulharia et al., 2016), and external weightfilters (De Brabandere et al., 2016). Further, more ad-vanced models (Cricri et al., 2016) have achieved betterresults with large, more sophisticated pipelines. Extend-ing SP-GRU to such setups becomes a reasonable modi-fication, providing model uncertainty without sacrificingperformance.

We briefly evaluate how well SP-GRU is able to performon out-of-domain samples (Fig. 7, Bottom 3 rows). Mod-els deployed in real-world settings may not realisticallybe able to determine if a sample is far from their trainingdistributions. However, with our specific modeling ofuncertainty, we would expect that images or sequencesdistant from the training data will exhibit high variance.We construct sequences of 3 moving digits. Here, fu-ture reconstruction is generally quite poor. As has beenobserved in previous work (Srivastava et al., 2015), themodel attempts to hallucinate two digits. Our model isaware of this issue: the variance for a large number ofpixels is extremely high, even if the digits overlap.

Other Methods. Deep Markov Model (DMM) (Krish-nan et al., 2017) is a variant of Structured VariationalAutoencoders introduced recently that naturally give riseto a probabilistic interpretation of predictions from deeptemporal models. However, upon application of thismodel to Moving MNIST we were unable to obtain anyreasonable prediction, across a range of hidden dimen-sion sizes and trajectory complexities, even with signifi-cant training time (days vs. hours for SP-GRU). Shownin Fig. 8 are results using a hidden dimension size of1024 (equal to our setup). We note that the experimentalsetups described in (Krishnan et al., 2017) are small indimension and complexity compared to Moving MNIST,and it may be the case that additional technical develop-ment with DMMs may lead to promising and comparableuncertainty results.

4.2 NORMATIVE MODELING INPRECLINICAL NEUROIMAGING DATA

Goal. In a preclinical cohort of individuals at risk fordeveloping Alzheimer’s disease (AD), effect sizes aresmall and statistical signal is often weak among thosewho will and will not go on to develop AD. Even with

1 Random Digit Prediction using DMM (Krishnan et al., 2017)︷︸︸︷

Figure 8: Deep Markov Model results. Compared toour results on a single digit (Fig. 5), the mean and vari-ance estimations using DMM cannot be estimated wellon Moving MNIST.

high-dimensional brain imaging data, it is often the casethat specific imaging modalities do not lead to signifi-cant group differences. Early detection of risk factorsassociated with the eventual development of AD are ofcritical importance in facilitating the prevention of on-set, and identifying individuals who subtly deviate fromexpected decline is a required step in that direction. Weaim to identify an out-of-domain sample via normativemodeling (Marquand et al., 2016): Given that we havea SP-GRU model trained on a preclinical cohort, can wepredict with confidence those individuals who are at risk?

Data. Imaging data from 139 individuals was derivedfrom two distinct modalities: Positron emission tomog-raphy (PET) and diffusion-weighted MRI (DWI) (Chuaet al., 2008). PET imaging is used to determine meanamyloid-plaque burden (11C Pittsburgh Compound B(PiB) radiotracer), known to be strongly associated withAD pathology and often preceding observable cogni-tive decline (Johnson et al., 2014). An individual isdeemed at risk if the average amyloid burden within spe-cific regions (eight bilateral) is greater than 1.12 (John-son et al., 2014). DWI captures the diffusion of waterthrough a specific voxel in a brain image; the mean dif-fusion of water through a tract within the brain is a mea-sure of connectivity strength. For each individual, 1761unique brain connectivities derived from the IIT atlas(Varentsova et al., 2014) are computed from each DWI.Additionally, we have a neuropsychological test score foreach individual, the Rey Auditory Visual Learning Test(RAVLT) (Rosenberg et al., 1984) known to be corre-lated with both amyloid load and structural connectivity.Since our data is cross-sectional, we use RAVLT as our“temporal” analog of cognitive decline.

Model Setup. To generate our sequential training data,we first place all individuals into 8 bins based on theirRAVLT scores (i.e., 8 evenly ranged intervals between[RAVLTmax, RAVLTmin]). This gives us the samplemeans and variances of each connectivity in each bin.Then, we generate samples of 1761 connectivities across8 bins (timepoints) by independently sampling each con-nectivity in each bin from a normal distribution with thecorresponding sample mean and variance. Each sample

sequence (1761 variables for each of the 8 timepoints)thus simulates how connectivity may evolve over de-creasing RAVLT (modeling potential AD progression)within our preclinical population. See Appendix C fordetails. We train SP-GRU on the generated samples topredict t = 5, 6, 7, 8 given t = 1, 2, 3, 4.

Evaluation. We follow existing work in identifying at-risk individuals. Refer to Fig. 9 for the full pipeline.First, after we train our SP-GRU predictor, we generateN = 100 new test sequences and predict t = 5, 6, 7, 8given t = 1, 2, 3, 4 (Fig. 9 (1)-(2)). Thus, for sub-ject i, time t and connectivity k, we obtain a mean re-sponse yitk and an expected level of variation σitk. Notethat we also have the true response yitk with a bin-level variance of σntk. Then, we compute a normativeprobability map (NPM) per timepoint for each subjectand connectivity (Ziegler et al., 2014). We compute Z-scores across timepoints, connectivities, and subjects aszitk = (yitk − yitk)/

√σ2itk + σ2

ntk (Fig. 9 (3)). Apply-ing the procedure described in (Marquand et al., 2016)we compute subject-level empirical distributions of allconnectivities per timepoint. Then the robust mean ofthe top 5% of absolute statistics defines the extreme valuestatistic (EVS) describing that subject (Fig. 9 (4)). Col-lecting across subjects we fit a generalized extreme valuedistribution (GED) per time point (Fig. 9 (5)).

Results. We aim to identify those sequences which cor-respond to individuals deviating from the norm definedby our estimated GED. Using PET mean amyloid bur-den, we can separate our cohort into two distinct groups,one of which is considered to be ‘cognitively healthy’,the other to be ‘at risk’. Sampling 100 sequences eachusing the binning above applied to both groups, we canthen apply the EVS procedure (i.e., compute EVS fol-lowing (1)-(4) in Fig. 9 with the same SP-GRU). Then,we use these EVS to identify sequences within thosegroups which significantly deviate from the overall popu-lation (Fig. 9 (6)-(7)). With an α = 0.01 cutoff (with theBonferroni correction) we identify 9 outlier sequences inthe cognitively healthy group and 19 in the at risk group.While further scientific analysis is necessary, these re-sults suggest that larger absolute fluctuations in DWIconnectivity may be a good indicator for disease risk asmeasured by amyloid burden. This sets a promising di-rection in preclinical AD research since brain connectiv-ity is one of the early indicators of AD progression (Gre-icius et al., 2009; Kim et al., 2015, 2019; Hwang et al.,2019) characterizing the overall integrity of brain (seeAppendix C for additional details and discussions).

Remarks. Although this process is feasible with anymodel providing expected variance, an ideal model needsto possess the following three traits: (i) Strong modeling

(1) Test Sample Inputs (each i and t: 1761 connectivities)

i = 1

i = N

(2) Predictions with Uncertainties(each i and t: 1761 means and variances)

t = 1 t = 2 t = 3 t = 4

(3) Sequential Normative Probability Maps(each i and t: 1761 NPMs)

SP-GRU (Trained Predictor)

Given t = 1,2,3,4

Predict t = 5,6,7,8

t = 5 t = 6 t = 7 t = 8

i = 1

i = N

1761

NPMs

1761

NPMs

1761

NPMs

1761

NPMs

1761

NPMs

1761

NPMs

1761

NPMs

1761

NPMs

t = 5 t = 6 t = 7 t = 8

i = 1

i = N

(4) Sequential Extreme Value Statistics(each i and t: 1 EVS)

t = 5 t = 6 t = 7 t = 8

i = 1

i = N

1 EVS 1 EVS1 EVS1 EVS


(5) Extreme Value Distributions

1. Construct histogram of EVS for each t

2. Fit generalized extreme value distributions (GED)

3. Derive confidence intervalst = 5 t = 6 t = 7 t = 8

Extreme Value Statistics

Robust summary of NPMs

(Mean of top 5%)

1761

NPMs1 EVS

t = 5 t = 6 t = 7 t = 8


(6) Sequential EVS of a New Subject N’

(following the above pipeline on a new subject)(7) Outlier Detection

Outlier EVS in at least one t ⇒ Outlier subject

t = 5 t = 6 t = 7 t = 8

i = N’ 1 EVS 1 EVS1 EVS1 EVS

Figure 9: Normative modeling pipeline for preclinical AD. (1) Given a set of test inputs (t = 1, 2, 3, 4), (2) use thepretrained SP-GRU to make mean and variance predictions for each connectivity and t = 5, 6, 7, 8. (3) Compute NPMfor each prediction, and (4) derive EVS for each sample i and t. (5) Fit GED and construct confidence intervals basedon N EVS for each t. (6) Given a new sample, derive EVS following (1)-(4), and (7) check the confidence intervalsfrom (5) to determine heterogeneity.

ability to capture the subtle underlying abnormality ofbiomarkers in the early AD stage. (ii) Sequential model-ing of longitudinal progression of biomarkers which isoften more advantageous due to the variability amongcross-sectional samples. (iii) Accurate and practical un-certainty estimation of every variable of interest. De-spite the availability of successful recurrent neural net-work models, they only satisfy (i) and (ii) by construc-tion. Similarly, non-RNN models that satisfy (ii) and (iii)may lack predictive power compared to popular RNNs.Here, we take direct advantage of SP-GRU possessingall three traits in determining the statistic used for detec-tion to overcome the subtle signal in preclinical longitu-dinal settings which would otherwise be unidentifiable.Also, SP-GRU requires much less prediction time (∼1seconds) compared to MC-GRU (∼11 seconds, 50 sam-pling rate) for 100 sequences.

5 CONCLUSIONIn this work, we show how uncertainty estimates for apowerful class of sequential models, GRUs, can be de-

rived without compromising either predictive power orcomputation speed using our SP-GRU. Complementaryto the developing body of work on Bayesian perspec-tives on deep learning, we show how a mix of old andnew ideas can enable deriving uncertainty estimates fora powerful class of models, GRUs, while also being eas-ily extensible to other sequential models. Competitiveresults are first shown on a standard dataset used for se-quential models, while offering uncertainty as a natu-ral byproduct. We then demonstrated a direct applica-tion of SP-GRU for normative modeling of preclinicalAlzheimer’s disease cohort for outlier detection yieldingresults consistent with the findings in the field. The codeis available at https://github.com/vsingh-group.

6 ACKNOWLEDGMENTSResearch supported in part by NIH (R01AG040396,R01AG021155, R01AG027161, P50AG033514,R01AG059312, R01EB022883, R01AG062336), theCenter for Predictive and Computational Phenotyping(U54AI117924), NSF CAREER Award (1252725), and apredoctoral fellowship to RRM via T32LM012413.

https://github.com/vsingh-group

ReferencesDario Amodei, Sundaram Ananthanarayanan, Rishita Anub-

hai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper,Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deepspeech 2: End-to-end speech recognition in english andmandarin. In ICML, 2016.

Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, et al. An-alyzing Inverse Problems with Invertible Neural Networks.In ICLR, 2019.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, andDaan Wierstra. Weight Uncertainty in Neural Network.In International Conference on Machine Learning, pages1613–1622, 2015.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, andYoshua Bengio. Learning phrase representations using RNNencoder-decoder for statistical machine translation. In Pro-ceedings of the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP), 2014.

Terence C Chua, Wei Wen, Melissa J Slavin, and Perminder SSachdev. Diffusion tensor imaging in mild cognitive impair-ment and Alzheimer’s disease: a review. Current opinion inneurology, 21(1):83–92, 2008.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, andYoshua Bengio. Empirical evaluation of gated recurrentneural networks on sequence modeling. arXiv preprintarXiv:1412.3555, 2014.

Francesco Cricri, Mikko Honkala, Xingyang Ni, Emre Aksu,and Moncef Gabbouj. Video ladder networks. arXiv preprintarXiv:1612.01756, 2016.

Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and LucVan Gool. Dynamic filter networks. In NIPS, 2016.

Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epis-temic? does it matter? Structural Safety, 31(2):105–112,2009.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprintarXiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Den-sity estimation using Real NVP. In ICLR, 2016.

Cristobal Esteban, Oliver Staeck, Stephan Baier, YinchongYang, and Volker Tresp. Predicting clinical events by com-bining static and dynamic information using recurrent neuralnetworks. In ICHI, 2016.

Meire Fortunato, Charles Blundell, and Oriol Vinyals.Bayesian Recurrent Neural Networks. arXiv preprintarXiv:1704.02798, 2017.

Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and OleWinther. Sequential neural models with stochastic layers. InAdvances in neural information processing systems, pages2199–2207, 2016.

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian ap-proximation: Representing model uncertainty in deep learn-ing. In ICML, 2016.

Felix A Gers, Jurgen Schmidhuber, and Fred Cummins. Learn-ing to forget: Continual prediction with LSTM. 1999.

Alex Graves. Practical variational inference for neural net-works. In NIPS, 2011.

Karol Gregor and Frederic Besse. Temporal difference varia-tional auto-encoder. arXiv preprint arXiv:1806.03107, 2018.

Michael D Greicius, Kaustubh Supekar, Vinod Menon, andRobert F Dougherty. Resting-state functional connectivityreflects structural connectivity in the default mode network.Cerebral cortex, 19(1):72–78, 2009.

Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilis-tic backpropagation for scalable learning of Bayesian neuralnetworks. In ICML, 2015.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-ing the knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.

Geoffrey E Hinton and Drew Van Camp. Keeping the neuralnetworks simple by minimizing the description length of theweights. In Computational learning theory, 1993.

Seong Jae Hwang, Nagesh Adluru, Won Hwa Kim, Sterling CJohnson, Barbara B Bendlin, and Vikas Singh. AssociationsBetween Positron Emission Tomography Amyloid Pathol-ogy and Diffusion Tensor Imaging Brain Connectivity inPre-Clinical Alzheimer’s Disease. Brain connectivity, 9:162–173, 2019.

Abhyuday N Jagannatha and Hong Yu. Bidirectional rnn formedical event detection in electronic health records. In As-sociation for Computational Linguistics, 2016.

Sterling C Johnson, Bradley T Christian, Ozioma C Okonkwo,Jennifer M Oh, Sandra Harding, Guofan Xu, Ansel THillmer, Dustin W Wooten, Dhanabalan Murali, Todd EBarnhart, Lance T Hall, Annie M Racine, William E Klunk,Chester A Mathis, Howard A Rowley, Bruce P Hermann,N. Maritza Dowling, Sanjay Asthana, and Mark A Sager.Amyloid burden and neural function in people at risk forAlzheimer’s disease. Neurobiology of aging, 35(3):576–584,2014.

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, NoamShazeer, and Yonghui Wu. Exploring the limits of languagemodeling. arXiv preprint arXiv:1602.02410, 2016.

Alex Kendall and Yarin Gal. What Uncertainties Do We Needin Bayesian Deep learning for Computer Vision? In NIPS,2017.

Won Hwa Kim, Nagesh Adluru, Moo K Chung, Ozioma COkonkwo, Sterling C Johnson, Barbara B Bendlin, andVikas Singh. Multi-resolution statistical analysis of brainconnectivity graphs in preclinical alzheimer’s disease. Neu-roImage, 118:103–117, 2015.

Won Hwa Kim, Annie M Racine, Nagesh Adluru, Seong JaeHwang, Kaj Blennow, Henrik Zetterberg, Cynthia M Carls-son, Sanjay Asthana, Rebecca L Koscik, Sterling C Johnson,et al. Cerebrospinal fluid biomarkers of neurofibrillary tan-gles and synaptic dysfunction are associated with longitudi-nal decline in white matter connectivity: A multi-resolutiongraph analysis. NeuroImage: Clinical, 21, 2019.

Rahul G Krishnan, Uri Shalit, and David Sontag. StructuredInference Networks for Nonlinear State Space Models. InAAAI, pages 2101–2109, 2017.

Viveka Kulharia, Arnab Ghosh, Amitabha Mukerjee, VinayNamboodiri, and Mohit Bansal. Contextual RNN-GANs forabstract reasoning diagram generation. In AAAI, 2016.

Balaji Lakshminarayanan, Alexander Pritzel, and CharlesBlundell. Simple and Scalable Predictive Uncertainty Es-timation using Deep Ensembles. In NIPS, 2017.

Jonathan Lambert, Peter B Greer, Fred Menk, Jackie Patter-son, Joel Parker, Kara Dahl, Sanjiv Gupta, Anne Capp, ChrisWratten, Colin Tang, et al. MRI-guided prostate radiationtherapy planning: Investigation of dosimetric accuracy ofMRI-based dose planning. Radiotherapy and Oncology, 98(3):330–334, 2011.

David JC MacKay. Bayesian methods for adaptive models.PhD thesis, California Institute of Technology, 1992a.

David JC MacKay. A practical Bayesian framework for back-propagation networks. Neural computation, 4(3):448–472,1992b.

Andre F Marquand, Iead Rezek, Jan Buitelaar, and Christian FBeckmann. Understanding heterogeneity in clinical cohortsusing normative models: beyond case-control studies. Bio-logical psychiatry, 80(7):552–561, 2016.

Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky,and Sanjeev Khudanpur. Recurrent neural network basedlanguage model. In Interspeech, volume 2, page 3, 2010.

Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and DavidBlei. Deep exponential families. In Artificial Intelligenceand Statistics, pages 762–771, 2015.

Carl Edward Rasmussen and Joaquin Quinonero-Candela.Healing the relevance vector machine through augmenta-tion. In ICML, 2005.

Fabio De Sousa Ribeiro, Francesco Caliva, Mark Swainson,Kjartan Gudmundsson, Georgios Leontidis, and StefanosKollias. Deep Bayesian Uncertainty Estimation for Adapta-tion and Self-Annotation of Food Packaging Images. arXivpreprint arXiv:1812.01681, 2018.

Samuel J Rosenberg, Joseph J Ryan, and Aurelio Prifitera. Reyauditory-verbal learning test performance of patients withand without memory impairment. Journal of clinical psy-chology, 40(3):785–787, 1984.

Ruggiero Santeramo, Samuel Withey, and Giovanni Montana.Longitudinal detection of radiological abnormalities withtime-modulated lstm. In Deep Learning in Medical Im-age Analysis and Multimodal Learning for Clinical DecisionSupport, pages 326–333. 2018.

Andreas Sedlmeier, Thomas Gabor, Thomy Phan, LenzBelzner, and Claudia Linnhoff-Popien. Uncertainty-basedout-of-distribution detection in deep reinforcement learning.arXiv preprint arXiv:1901.02219, 2019.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simpleway to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov.Unsupervised learning of video representations using lstms.In ICML, 2015.

Anna Varentsova, Shengwei Zhang, and Konstantinos Ar-fanakis. Development of a high angular resolution diffusionimaging human brain template. NeuroImage, 91:177–186,2014.

Hao Wang, SHI Xingjian, and Dit-Yan Yeung. Natural-parameter networks: A class of probabilistic neural net-works. In NIPS, 2016.

Max Welling and Yee W Teh. Bayesian learning via stochasticgradient Langevin dynamics. In ICML, 2011.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’sneural machine translation system: Bridging the gap be-tween human and machine translation. arXiv preprintarXiv:1609.08144, 2016.

Gabriel Ziegler, Gerard R Ridgway, Robert Dahnke, ChristianGaser, and Alzheimer’s Disease Neuroimaging Initiative. In-dividualized Gaussian process-based prediction and detec-tion of local and global gray matter abnormalities in elderlysubjects. NeuroImage, 97:333–348, 2014.

sampling-free uncertainty estimation in gated recurrent...

Documents