deep learning and optimization methods

Deep Learning and Optimization Methods

Stefan Kühn

Join me on XING

Data Science Meetup Hamburg - July 27th, 2017

Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26

Contents

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Deep Learning

Neural Networks - Universal Approximation Theorem1-hidden-layer feed-forward neural net with finite number of parameters canapproximate any continuous function on compact subsets of Rn

Questions:Why do we need deep learning at all?

I theoretic resultI approximation by piecewise constant functions (not what you might

want for classification/regression)Why are deep nets harder to train than shallow nets?

I More parameters to be learned by training?I More hyperparameters to be set before training?I Numerical issues?

disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —

Example: RNNs

Recurrent Neural NetsExtremely powerful for modeling sequential data, e.g. time series butextremely hard to train (somewhat less hard for LSTMs/GRUs)

Main Advantages:Qualitatively: Flexible and rich model classPractically: Gradients easily computed by Backpropagation (BPTT)

Main Problems:Qualitatively: Learning long-term dependenciesPractically: Gradient-based methods struggle when separation betweeninput and target output is large

Example: RNNs

Recurrent Neural NetsHighly volatile relationship between parameters and hidden states

IndicatorsVanishing/exploding gradientsInternal Covariate Shift

RemediesReLU’Careful’ initializationSmall stepsizes(Recurrent) Batch Normalization

Example: RNNs

Recurrent Neural Nets and LSTMSchmidhuber/Hochreiter proposed change of RNN architecture by addingLong Short-Term Memory Units

Vanishing/exploding gradients?fixed linear dynamics, no longer problematic

Any questions open?Gradient-based trainings works better with LSTMsLSTMs can compensate one deficiency of Gradient-based learning butis this the only one?

Most problems are related to specific numerical issues.

4 Takeaways

Trade-offs between Optimization and Learning

Computational complexity becomes the limiting factor when one envisionslarge amounts of training data. [Bouttou, Bousquet]

Underlying IdeaApproximate optimization algorithms might be sufficient for learningpurposes. [Bouttou, Bousquet]

Implications:Small-scale: Trade-off between approximation error and estimationerrorLarge-scale: Computational complexity dominates

Long story short:The best optimization methods might not be the best learningmethods!

Empirical results

Empirical evidence for SGD being a better learner than optimizer.

RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks

4 Takeaways

Gradient Descent

Minimize a given function f :min f (x), x ∈ Rn

Direction of Steepest Descent, the negative gradient:

d = −∇f (x)

Update in step k

xk+1 = xk − α∇f (xk)

Properties:always a descent direction, no test neededlocally optimal, globally convergentworks with inexact line search, e.g. Armijo’ s rule

Stochastic Gradient Descent

Setting

f (x) :=∑i

fi (x),

∇f (x) :=∑i

∇fi (x), i = 1, . . . ,m number of training examples

Choose i and update in step k

xk+1 = xk − α∇fi (xk)

Shortcomings of Gradient Descent

local: only local information usedespecially: no curvature information usedgreedy: prefers high curvature directionsscale invariant: no

James Martens, Deep learning via Hessian-free optimization

Momentum

Update in step k

zk+1 = βzk +∇f (xk)

xk+1 = xk − αzk+1

Properties for a quadratic convex objective:condition number κ of improves by square rootstepsizes can be twice as longorder of convergence

√κ− 1√κ+ 1

instead ofκ− 1κ+ 1

can diverge, if β is not properly chosen/adaptedGabriel Goh, Why momentum really works

Momentum

D E M O

Properties:combines several clever tricks (from Momentum, RMSprop, AdaGrad)has some similarities to Trust Region methodsempirically proven - best in class (personal opinion)

Kingma, Ba Adam: A method for stochastic optimization

SGD, Momentum and more

D E M O

L-BFGS and Nonlinear CG

Observations so far:The better the method, the more parameters to tune.All better methods try to incorporate curvature information.Why not doing so directly?

L-BFGSQuasi-Newton method, builds an approximation of the (inverse) Hessianand scales gradient accordingly.

Nonlinear CGInformally speaking, Nonlinear CG tries to solve a quadratic approximationof the function.

No surprise: They also work with minibatches.

Empirical results

Empirical evidence for better optimizers being better learners.

MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning

Truncated Newton: Hessian-Free Optimization

Main ideas:Approximate not Hessian H, but matrix-vector product Hd .Use finite differences instead of exact Hessian.Use damping.Use Linear CG method for solving quadratic approximation.Use clever mini-batch stragegy for large data-sets.

Empirical test on pathological problems

Main results:The addition problem is known to be effectively impossible forgradient descent, HF did it.Basic RNN cells are used, no specialized architectures (LSTMs etc.).

(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)

4 Takeaways

Summary

In the long run, the biggest bottleneck will be the sequential parts of analgorithm. That’s why the number of iterations needs to be small. SGD andits successors tend to have much more iterations, and they cannot benefitas much from higher parallelism (GPUs).

But whatever you do/prefer/choose:At least use successors of SGD: Momentum, Adam etc.Look for generic approaches instead of more and more specialized andmanually finetuned solutions.Key aspects:

I InitializationI Adaptive choice of stepsizes/momentum/. . .I Scaling of the gradient

Resources

Overview of Gradient Descent methodsWhy momentum really worksAdam - A Method for Stochastic OptimizationAndrew Ng et al. about L-BFGS and CG outperforming SGDLecture Slides Neural Networks for Machine Learning - Hinton et al.On the importance of initialization and momentum in deep learningData-Science-Blog: Summary article in preparation (Stefan Kühn)The Neural Network Zoo

deep learning and optimization methods

Data & Analytics

hyperparameter optimization of deep convolutional …

heuristic optimization methods

numerical optimization - unconstrained optimization...

optimization methods for supervised machine learning: from...

benchmarking of optimization methods for topology...

4. optimization methods

robust optimization for deep regression

recitation 3 efficient optimization methods for deep...

optimization methods

homotopy optimization methods for global optimization

escconf - deep dive frontend optimization

cs 357: numerical methods constrained optimization · cs...

topology and shape optimization versus traditional...

benefits of storage capacity optimization methods (coms) ·...

recitation 3 efficient optimization methods for deep...

optimization methods for ndp - wayne state...

computer based optimization methods

curso nptel - optimization methods

interior point methods - apmonitor optimization...

chapter 7 optimization methods - inspiring...