hamiltonian monte carlo for scalable deep learningdzeng/bios740/robson_bios740.pdfhamiltonian monte...
TRANSCRIPT
Hamiltonian Monte Carlo
for Scalable Deep Learning
Isaac Robson
Department of Statistics and Operations Research,
University of North Carolina at Chapel Hill
BIOS 740
May 4, 2018
Preface
Markov Chain Monte Carlo (MCMC) techniques are powerful algorithms for fitting probabilistic models
Variations such as Gibbs samplers work well for some high-dimensional situations, but have issues scaling to today’s challenges and model architectures
Hamiltonian Monte Carlo (HMC) is a more proposal-efficient variant of MCMCs that is a promising catalyst for innovation in deep learning and probabilistic graphical models
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 02/24
Outline
Review Metropolis-Hastings
Introduction to Hamiltonian Monte Carlo (HMC)
Brief review of neural networks and fitting methods
Discussion of Stochastic Gradient HMC (T. Chen et al., 2014)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 03/24
Introduction to Hamiltonian Monte Carlo
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 04/24
Review: Metropolis-Hastings 1/3
Metropolis et al., 1953, and Hastings, 1970
The original Metropolis et al. can be used to compute integrals of a distribution, e.g. the normalization for a Bayesian posterior
𝐽 = 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 = 𝐸𝑃(𝑓)
Originally for statistical mechanics, more specifically, calculating potential of 2D spheres (particles) in a square with ‘fast electronic computing machines’
Size N = 224 particles, time = 16 hours(on prevailing machines)
Advantage is that it depends only the ratio of the 𝑃(𝑥′)/𝑃(𝑥) of the probability distribution evaluated at two points, 𝑥′ and 𝑥 in some data 𝐷
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 05/24
Metropolis et al., 1953
Review: Metropolis-Hastings 2/3
We can use this ratio to accept or reject moving from randomly generated points 𝑥 𝑥′ with acceptance ratio 𝑃(𝑥′)/𝑃(𝑥)
This allows us to ‘sample’ by accumulating a running random-walk (Markov chain) list of correlated samples under a symmetric proposal scheme from the target distribution, which we can then estimate
Hastings extended this to permit (but not require) an asymmetric proposal scheme, which speeds the process and improves mixing
𝐴(𝑥′|𝑥) = min[1,𝑃 𝑥′
𝑃 𝑥
𝑔 𝑥 𝑥′
𝑔 𝑥′ 𝑥]𝜋𝑟2
Regardless, we also accumulate a ‘burn-in’ period of bad initial samples we have to ignore (this slows convergence, as do correlated samples)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 06/24
Review: Metropolis-Hastings 3/3
We have to remember that Metropolis-Hastings has a few restrictionsA Markov chain won’t converge to a target distribution P(𝑥) unless it converges to a stationary distribution 𝜋 𝑥 = 𝑃 𝑥
If 𝜋 𝑥 is not unique, we can also get multiple answers! (This is bad)
So we require the equality P x′ x)𝑃 𝑥 = 𝑃 𝑥 𝑥′)𝑃(𝑥′), e.g. reversibility
Additionally, the proposal is symmetric when 𝑔 𝑥 𝑥′
𝑔 𝑥′ 𝑥= 1, e.g. Gaussian
These are called ‘random-walk’ algorithms
When 𝑃(𝑥′) ≥ 𝑃 𝑥 , they move to the high-density region with certainty, else with acceptance ratio 𝑃(𝑥′)/𝑃 𝑥
Note a proposal with higher variance typically yields a lower acceptance ratio
Finally, remember Gibbs sampling, useful for certain high-dimensional situations, is a special case of Metropolis-Hastings using proposals conditioned on values of other dimensions
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 07/24
Hamiltonian Monte Carlo 1/5
Duane et al., 1987, Neal, 2012, and Betancourt, 2017
Duane proposed Hybrid Monte Carlo to more efficiently computer integrals in lattice field theory
‘Hybrid’ was due to the fact that it infused Hamiltonian equations of motion to generate a candidate point 𝑥′ instead of just RNG
As Neal describes, this allows us to ‘push’ the candidate points further out with momentum because the dynamics
Are reversible (necessary for convergence to unique target distribution)
Preserve the Hamiltonian (so we can still use momentum)
Preserve volume (which makes acceptance probabilities solvable)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 08/24
Hamiltonian Monte Carlo 2/5
A Hamiltonian is an energy function of form
H 𝑞, 𝑝 = U 𝑞 + K(𝑝)
Hamilton’s equations govern the change of this system over time
𝑑𝑞𝑖𝑑𝑡=𝜕H
𝜕𝑝𝑖= [M−1 𝑝]𝑖 ,
𝑑𝑝𝑖𝑑𝑡=𝜕H
𝜕𝑞𝑖= −𝜕U
𝜕𝑞𝑖
JH =𝟎𝑑x𝑑 𝐈𝑑x𝑑𝐈𝑑x𝑑 𝟎𝑑x𝑑
We can set U 𝑞 = − log 𝑞 + 𝐶 and K(𝑝) = 𝑝TM−1𝑝/2 where C is constant and M is a PSD ‘mass matrix’ that determines our momentum(kinetic energy)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 09/24
Potential Energy
Kinetic Energy
MomentumPosition
Hamiltonian Monte Carlo 3/5
In a Bayesian setting, we set 𝑞 to be the prior, π(𝑞), times the likelihood given data 𝐷, 𝐿 𝑞 𝐷) for our potential energy
U 𝑞 = − log[π 𝑞 𝐿 𝑞 𝐷) ]
If we choose a Gaussian proposal (Metropolis), we set kinetic energy
K 𝑝 =
𝑖=1
𝑑𝑝𝑖2
2𝑚𝑖
We then generate 𝑞′, 𝑝′ via Hamiltonian dynamics and use the difference in energy levels as our acceptance ratio in the MH algorithm
A 𝑞′, 𝑝′ 𝑞, 𝑝) = min 1, exp −H 𝑞′, 𝑝′ + H 𝑞, 𝑝
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 10/24
Hamiltonian Monte Carlo 4/5
Converting proposal and acceptance steps to this energy form is convoluted, however, we can now use Hamiltonian dynamics to walk farther without sacrificing acceptance ratio
Classic method of solving Hamiltonian’s differential equations is Euler’s Method, which traverses a small distance ε > 0 for 𝐿 steps
𝑝𝑖 𝑡 + ε = 𝑝𝑖 𝑡 − ε𝜕𝑈
𝜕𝑞𝑖[𝑞 𝑡 ], 𝑞𝑖 𝑡 + ε = 𝑞𝑖 𝑡 + ε
𝑝𝑖 𝑡
𝑚𝑖
We can also employ the more efficient ‘leapfrog’ technique to quickly propose a candidate
𝑝𝑖 𝑡 + ε/2 = 𝑝𝑖 𝑡 − (ε/2)𝜕U
𝜕𝑞𝑖[𝑞 𝑡 ], 𝑞𝑖 𝑡 + ε = 𝑞𝑖 𝑡 + ε
𝑝𝑖 𝑡 + ε/2
𝑚𝑖
𝑝𝑖 𝑡 + ε = 𝑝𝑖 𝑡 + ε/2 − (ε/2)𝜕U
𝜕𝑞𝑖[𝑞 𝑡 + ε ]
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 11/24
Hamiltonian Monte Carlo 5/5
The HMC algorithm adds two steps to MH:Sample 𝑝 the momentum parameter (typically symmetric, Gaussian)
Compute 𝐿 steps of size ε to find a new 𝑞′, 𝑝′
Betancourt explains we can sample a momentum to easily change energy levels, then we use Hamiltonian dynamics to traverse our 𝑞-space (state space)
We no longer have to wait for random walk to slowly explore as we can easily find samples well-distributed across our posterior with high acceptance ratios (same energy levels)
However, as Chen et al, 2014 describes, we do still have to compute the gradient of our potential at every step, which can be costly, especially in high dimensions
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 12/24
Graphic from Betancourt, 2017
Neural Networks
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 13/24
Neural Networks 1/4
[Artificial] neural networks (or nets) are popular connectionist models for ‘learning’ a function approximation
Derivative of ‘Hebbian’ learning after Hebb’s neuropsychology work in the 1940s
Popularized today thanks to parallelization and convex optimization
Universal function approximator (in theory)
Typically use function composition and the chain rulealongside vectorization to efficiently optimize a loss function by altering the ‘weights’ (function) of each node
Requires immense computational power, especiallywhen the functions being composed are probabilistic(such as in a Bayesian Neural Net (BNN))
Fitting neural nets is an active area of research, with contributions from the perspectives of both optimization and sampling
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 14/24
Feedforward Neural Net,Wikimedia Commons
Neural Networks 2/4
LeCun et al., 1998, Robbins and Monro, 1951
As Lecun et al., details, backward propagation of errors (backprop) is a powerful method for neural net optimization (these do not use sampling) a layer of weights 𝑊, at a time 𝑡, given error matrix 𝐸
𝑊𝑡+1 = 𝑊𝑡 − 𝜂𝜕𝐸
𝜕𝑊
Note this is merely gradient descent, which in recent years has been upgraded with many bells and whistles
One such whistle is stochastic gradient descent (SGD), an algorithm that evolved following the stochastic approximation methods introduced by Robbins and Monro, 1951 (GO TAR HEELS!)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 15/24
Neural Networks 3/4
Calculating a gradient is costly, but as LeCun et al. details, stochastic gradient descent is much faster, and comes in both online and the smoother minibatch variations
The primary idea is to update using only the error at one point, 𝐸*
𝑊𝑡+1 = 𝑊𝑡 − 𝜂𝜕𝐸∗
𝜕𝑊
The error at one point is an estimate of the error for the entire vector of current weights 𝑊𝑡, hence the name stochastic gradient descent
The speedup is feasible due to shared information across observations and the fact that by decreasing 𝜂, the ‘learning rate’, SGD still converges, including for minibatch variants of SGD, which just computes gradients for a handful of points instead of a single one
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 16/24
Neural Networks 4/4
Rumelhart et al., 1986
A popular bell to complement SGD’s whistle is the addition of a momentum term to the update step. We more or less smooth our update steps with an exponential decay factor α
𝑊𝑡+1 = 𝑊𝑡 − 𝜂𝜕𝐸∗
𝜕𝑊+ α∆𝑊𝑡
∆𝑊𝑡 = 𝑊𝑡+1 −𝑊𝑡
This may seem familiar if you recall the ‘momentum’ term that exists in Hamiltonian Monte Carlo (cue imaginary dramatic sound effect)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 17/24
Stochastic Gradient HMC(T. Chen et al., 2014)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 18/24
Stochastic Gradient HMC 1/4
As mentioned before, backprop uses the powerful stochastic gradient descent method and extensions to fit gradient-based neural networks
Unfortunately, many of these neural nets lack inferentiability
One solution (other proving P = NP or than solving AGI) is to use Bayesian Neural Networks, which exist as a class of probabilistic graphical models, and can be fitted with sampling or similar methods
BNNs still perform many of the surreal feats that other neural nets accomplish
However, even with Gibbs samplers and HMC, sampling in high dimensions is quite slow to converge…
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 19/24
for now...
Stochastic Gradient HMC 2/4
Welling et al., 2011, T. Chen et al., 2014, and C. Chen et al., 2016
In HMC, Instead of calculating our the gradient of our potential energy, U 𝑞 , for all of the dataset 𝐷, what if we selected some minibatch 𝐷 ⊂ 𝐷to use for our estimate in the ‘leapfrogging’ method?
∇ U ≈ ∇U + 𝑁 𝟎, Σ for noise Σ
𝑝𝑖 𝑡 + ε/2 = 𝑝𝑖 𝑡 − (ε/2)𝜕 U
𝜕𝑞𝑖[𝑞 𝑡 ]
Unfortunately, this ‘naïve’ stochastic gradient HMC (SGHMC) injects noise into the Hamilton equations, which requires materially decreasing acceptance ratio in the MH algorithm to inefficient levels
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 20/24
Stochastic Gradient HMC 3/4
T. Chen et al. suggests fixing naïve SGHMC by adding a friction term, as proposed by Welling et al., borrowing once again from physics in the form of Langevin dynamics (vectorized form, omitting leapfrog notation)
∇ U = ∇U + 𝑁 𝟎, 2B B =ε
2Σ
𝑞𝑡+1 = 𝑞𝑡 +M−1𝑝𝑡
𝑝𝑡+1 = 𝑝𝑡 − ε ∇ U 𝑞𝑡+1 − BM−1𝑝𝑡
Note that B is a PSD function of 𝑞𝑡+1, but Chen also shows certain constant choices of B converge (and are far more practical)
Welling et al. also laments that Bayesian methods have been “left-behind” in recent machine learning advances due to “[MCMC] requiring computations over the whole dataset”
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 21/24
Stochastic Gradient HMC 4/4
The end result of SGHMC is an efficient sampling algorithm that also permits computing gradients on a minibatch in a Bayesian setting
T. Chen et al then elaborate to show that under deterministic settings, SGHMC performs analogously to SGD with momentum, as the momentum components are related
C. Chen et al. (BEAT DOOK!) further elaborates that many Bayesian MCMC sampling algorithms are analogs of stochastic optimization algorithms, which suggests that a symbiotic discovery and extensions of the two is possible, as presented in the Stochastic AnNealingThermostats with Adaptive momentum (Santa) that incorporates recent advances from both domains
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 22/24
Conclusions
HMC is a promising variant of MCMC sampling algorithms with applications in Bayesian models
SGHMC offers more scalability in deep learning and several other settings, with the added benefit of inferentiability in Bayesian neural nets.
Future work and collaborations between the sampling and optimization communities is promising
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 23/24
Bibliography (by date)Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The journal of chemical physics 21 1087.
Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57 97–109.
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J (1986). Learning representations by backpropagating errors. Nature, 323:533–536.
Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195 216 - 222.
Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 1998b.
Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient langevindynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681–688.
Neal, R. M. (2012). MCMC using Hamiltonian dynamics. ArXiv e-prints, arXiv:1206.1901
Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. arXiv preprint arXiv:1402.4102v2.
Chen, C. Carlson, D. Gan, Z. Li, C. and Carin, L. (2015). Bridging the gap between stochastic gradient MCMC and stochastic optimization. arXiv:1512.07962
Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXivpreprint arXiv:1701.02434.
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 24/24
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018