hamiltonian monte carlo for scalable deep learningdzeng/bios740/robson_bios740.pdfhamiltonian monte...

Hamiltonian Monte Carlo

for Scalable Deep Learning

Isaac Robson

Department of Statistics and Operations Research,

University of North Carolina at Chapel Hill

[email protected]

BIOS 740

May 4, 2018

Preface

Markov Chain Monte Carlo (MCMC) techniques are powerful algorithms for fitting probabilistic models

Variations such as Gibbs samplers work well for some high-dimensional situations, but have issues scaling to today’s challenges and model architectures

Hamiltonian Monte Carlo (HMC) is a more proposal-efficient variant of MCMCs that is a promising catalyst for innovation in deep learning and probabilistic graphical models

Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 02/24

Outline

Review Metropolis-Hastings

Introduction to Hamiltonian Monte Carlo (HMC)

Brief review of neural networks and fitting methods

Discussion of Stochastic Gradient HMC (T. Chen et al., 2014)


Introduction to Hamiltonian Monte Carlo


Review: Metropolis-Hastings 1/3

Metropolis et al., 1953, and Hastings, 1970

The original Metropolis et al. can be used to compute integrals of a distribution, e.g. the normalization for a Bayesian posterior

𝐽 = 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 = 𝐸𝑃(𝑓)

Originally for statistical mechanics, more specifically, calculating potential of 2D spheres (particles) in a square with ‘fast electronic computing machines’

Size N = 224 particles, time = 16 hours(on prevailing machines)

Advantage is that it depends only the ratio of the 𝑃(𝑥′)/𝑃(𝑥) of the probability distribution evaluated at two points, 𝑥′ and 𝑥 in some data 𝐷


Metropolis et al., 1953


We can use this ratio to accept or reject moving from randomly generated points 𝑥 𝑥′ with acceptance ratio 𝑃(𝑥′)/𝑃(𝑥)

This allows us to ‘sample’ by accumulating a running random-walk (Markov chain) list of correlated samples under a symmetric proposal scheme from the target distribution, which we can then estimate

Hastings extended this to permit (but not require) an asymmetric proposal scheme, which speeds the process and improves mixing

𝐴(𝑥′|𝑥) = min[1,𝑃 𝑥′

𝑃 𝑥

𝑔 𝑥 𝑥′

𝑔 𝑥′ 𝑥]𝜋𝑟2

Regardless, we also accumulate a ‘burn-in’ period of bad initial samples we have to ignore (this slows convergence, as do correlated samples)



We have to remember that Metropolis-Hastings has a few restrictionsA Markov chain won’t converge to a target distribution P(𝑥) unless it converges to a stationary distribution 𝜋 𝑥 = 𝑃 𝑥

If 𝜋 𝑥 is not unique, we can also get multiple answers! (This is bad)

So we require the equality P x′ x)𝑃 𝑥 = 𝑃 𝑥 𝑥′)𝑃(𝑥′), e.g. reversibility

Additionally, the proposal is symmetric when 𝑔 𝑥 𝑥′

𝑔 𝑥′ 𝑥= 1, e.g. Gaussian

These are called ‘random-walk’ algorithms

When 𝑃(𝑥′) ≥ 𝑃 𝑥 , they move to the high-density region with certainty, else with acceptance ratio 𝑃(𝑥′)/𝑃 𝑥

Note a proposal with higher variance typically yields a lower acceptance ratio

Finally, remember Gibbs sampling, useful for certain high-dimensional situations, is a special case of Metropolis-Hastings using proposals conditioned on values of other dimensions


Hamiltonian Monte Carlo 1/5

Duane et al., 1987, Neal, 2012, and Betancourt, 2017

Duane proposed Hybrid Monte Carlo to more efficiently computer integrals in lattice field theory

‘Hybrid’ was due to the fact that it infused Hamiltonian equations of motion to generate a candidate point 𝑥′ instead of just RNG

As Neal describes, this allows us to ‘push’ the candidate points further out with momentum because the dynamics

Are reversible (necessary for convergence to unique target distribution)

Preserve the Hamiltonian (so we can still use momentum)

Preserve volume (which makes acceptance probabilities solvable)



A Hamiltonian is an energy function of form

H 𝑞, 𝑝 = U 𝑞 + K(𝑝)

Hamilton’s equations govern the change of this system over time

𝑑𝑞𝑖𝑑𝑡=𝜕H

𝜕𝑝𝑖= [M−1 𝑝]𝑖 ,

𝑑𝑝𝑖𝑑𝑡=𝜕H

𝜕𝑞𝑖= −𝜕U

𝜕𝑞𝑖

JH =𝟎𝑑x𝑑 𝐈𝑑x𝑑𝐈𝑑x𝑑 𝟎𝑑x𝑑

We can set U 𝑞 = − log 𝑞 + 𝐶 and K(𝑝) = 𝑝TM−1𝑝/2 where C is constant and M is a PSD ‘mass matrix’ that determines our momentum(kinetic energy)


Potential Energy

Kinetic Energy

MomentumPosition


In a Bayesian setting, we set 𝑞 to be the prior, π(𝑞), times the likelihood given data 𝐷, 𝐿 𝑞 𝐷) for our potential energy

U 𝑞 = − log[π 𝑞 𝐿 𝑞 𝐷) ]

If we choose a Gaussian proposal (Metropolis), we set kinetic energy

K 𝑝 =

𝑖=1

𝑑𝑝𝑖2

2𝑚𝑖

We then generate 𝑞′, 𝑝′ via Hamiltonian dynamics and use the difference in energy levels as our acceptance ratio in the MH algorithm

A 𝑞′, 𝑝′ 𝑞, 𝑝) = min 1, exp −H 𝑞′, 𝑝′ + H 𝑞, 𝑝



Converting proposal and acceptance steps to this energy form is convoluted, however, we can now use Hamiltonian dynamics to walk farther without sacrificing acceptance ratio

Classic method of solving Hamiltonian’s differential equations is Euler’s Method, which traverses a small distance ε > 0 for 𝐿 steps

𝑝𝑖 𝑡 + ε = 𝑝𝑖 𝑡 − ε𝜕𝑈

𝜕𝑞𝑖[𝑞 𝑡 ], 𝑞𝑖 𝑡 + ε = 𝑞𝑖 𝑡 + ε

𝑝𝑖 𝑡

𝑚𝑖

We can also employ the more efficient ‘leapfrog’ technique to quickly propose a candidate

𝑝𝑖 𝑡 + ε/2 = 𝑝𝑖 𝑡 − (ε/2)𝜕U

𝜕𝑞𝑖[𝑞 𝑡 ], 𝑞𝑖 𝑡 + ε = 𝑞𝑖 𝑡 + ε

𝑝𝑖 𝑡 + ε/2

𝑚𝑖

𝑝𝑖 𝑡 + ε = 𝑝𝑖 𝑡 + ε/2 − (ε/2)𝜕U

𝜕𝑞𝑖[𝑞 𝑡 + ε ]



The HMC algorithm adds two steps to MH:Sample 𝑝 the momentum parameter (typically symmetric, Gaussian)

Compute 𝐿 steps of size ε to find a new 𝑞′, 𝑝′

Betancourt explains we can sample a momentum to easily change energy levels, then we use Hamiltonian dynamics to traverse our 𝑞-space (state space)

We no longer have to wait for random walk to slowly explore as we can easily find samples well-distributed across our posterior with high acceptance ratios (same energy levels)

However, as Chen et al, 2014 describes, we do still have to compute the gradient of our potential at every step, which can be costly, especially in high dimensions


Graphic from Betancourt, 2017

Neural Networks


Neural Networks 1/4

[Artificial] neural networks (or nets) are popular connectionist models for ‘learning’ a function approximation

Derivative of ‘Hebbian’ learning after Hebb’s neuropsychology work in the 1940s

Popularized today thanks to parallelization and convex optimization

Universal function approximator (in theory)

Typically use function composition and the chain rulealongside vectorization to efficiently optimize a loss function by altering the ‘weights’ (function) of each node

Requires immense computational power, especiallywhen the functions being composed are probabilistic(such as in a Bayesian Neural Net (BNN))

Fitting neural nets is an active area of research, with contributions from the perspectives of both optimization and sampling


Feedforward Neural Net,Wikimedia Commons

Neural Networks 2/4

LeCun et al., 1998, Robbins and Monro, 1951

As Lecun et al., details, backward propagation of errors (backprop) is a powerful method for neural net optimization (these do not use sampling) a layer of weights 𝑊, at a time 𝑡, given error matrix 𝐸

𝑊𝑡+1 = 𝑊𝑡 − 𝜂𝜕𝐸

𝜕𝑊

Note this is merely gradient descent, which in recent years has been upgraded with many bells and whistles

One such whistle is stochastic gradient descent (SGD), an algorithm that evolved following the stochastic approximation methods introduced by Robbins and Monro, 1951 (GO TAR HEELS!)


Neural Networks 3/4

Calculating a gradient is costly, but as LeCun et al. details, stochastic gradient descent is much faster, and comes in both online and the smoother minibatch variations

The primary idea is to update using only the error at one point, 𝐸*

𝑊𝑡+1 = 𝑊𝑡 − 𝜂𝜕𝐸∗

𝜕𝑊

The error at one point is an estimate of the error for the entire vector of current weights 𝑊𝑡, hence the name stochastic gradient descent

The speedup is feasible due to shared information across observations and the fact that by decreasing 𝜂, the ‘learning rate’, SGD still converges, including for minibatch variants of SGD, which just computes gradients for a handful of points instead of a single one


Neural Networks 4/4

Rumelhart et al., 1986

A popular bell to complement SGD’s whistle is the addition of a momentum term to the update step. We more or less smooth our update steps with an exponential decay factor α

𝑊𝑡+1 = 𝑊𝑡 − 𝜂𝜕𝐸∗

𝜕𝑊+ α∆𝑊𝑡

∆𝑊𝑡 = 𝑊𝑡+1 −𝑊𝑡

This may seem familiar if you recall the ‘momentum’ term that exists in Hamiltonian Monte Carlo (cue imaginary dramatic sound effect)


Stochastic Gradient HMC(T. Chen et al., 2014)


Stochastic Gradient HMC 1/4

As mentioned before, backprop uses the powerful stochastic gradient descent method and extensions to fit gradient-based neural networks

Unfortunately, many of these neural nets lack inferentiability

One solution (other proving P = NP or than solving AGI) is to use Bayesian Neural Networks, which exist as a class of probabilistic graphical models, and can be fitted with sampling or similar methods

BNNs still perform many of the surreal feats that other neural nets accomplish

However, even with Gibbs samplers and HMC, sampling in high dimensions is quite slow to converge…


for now...


Welling et al., 2011, T. Chen et al., 2014, and C. Chen et al., 2016

In HMC, Instead of calculating our the gradient of our potential energy, U 𝑞 , for all of the dataset 𝐷, what if we selected some minibatch 𝐷 ⊂ 𝐷to use for our estimate in the ‘leapfrogging’ method?

∇ U ≈ ∇U + 𝑁 𝟎, Σ for noise Σ

𝑝𝑖 𝑡 + ε/2 = 𝑝𝑖 𝑡 − (ε/2)𝜕 U

𝜕𝑞𝑖[𝑞 𝑡 ]

Unfortunately, this ‘naïve’ stochastic gradient HMC (SGHMC) injects noise into the Hamilton equations, which requires materially decreasing acceptance ratio in the MH algorithm to inefficient levels



T. Chen et al. suggests fixing naïve SGHMC by adding a friction term, as proposed by Welling et al., borrowing once again from physics in the form of Langevin dynamics (vectorized form, omitting leapfrog notation)

∇ U = ∇U + 𝑁 𝟎, 2B B =ε

2Σ

𝑞𝑡+1 = 𝑞𝑡 +M−1𝑝𝑡

𝑝𝑡+1 = 𝑝𝑡 − ε ∇ U 𝑞𝑡+1 − BM−1𝑝𝑡

Note that B is a PSD function of 𝑞𝑡+1, but Chen also shows certain constant choices of B converge (and are far more practical)

Welling et al. also laments that Bayesian methods have been “left-behind” in recent machine learning advances due to “[MCMC] requiring computations over the whole dataset”



The end result of SGHMC is an efficient sampling algorithm that also permits computing gradients on a minibatch in a Bayesian setting

T. Chen et al then elaborate to show that under deterministic settings, SGHMC performs analogously to SGD with momentum, as the momentum components are related

C. Chen et al. (BEAT DOOK!) further elaborates that many Bayesian MCMC sampling algorithms are analogs of stochastic optimization algorithms, which suggests that a symbiotic discovery and extensions of the two is possible, as presented in the Stochastic AnNealingThermostats with Adaptive momentum (Santa) that incorporates recent advances from both domains


Conclusions

HMC is a promising variant of MCMC sampling algorithms with applications in Bayesian models

SGHMC offers more scalability in deep learning and several other settings, with the added benefit of inferentiability in Bayesian neural nets.

Future work and collaborations between the sampling and optimization communities is promising


Bibliography (by date)Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The journal of chemical physics 21 1087.

Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57 97–109.

Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J (1986). Learning representations by backpropagating errors. Nature, 323:533–536.

Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195 216 - 222.

Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 1998b.

Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient langevindynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681–688.

Neal, R. M. (2012). MCMC using Hamiltonian dynamics. ArXiv e-prints, arXiv:1206.1901

Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. arXiv preprint arXiv:1402.4102v2.

Chen, C. Carlson, D. Gan, Z. Li, C. and Carin, L. (2015). Bridging the gap between stochastic gradient MCMC and stochastic optimization. arXiv:1512.07962

Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXivpreprint arXiv:1701.02434.


Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018

hamiltonian monte carlo for scalable deep learningdzeng/bios740/robson_bios740.pdfhamiltonian monte...

Documents