deep learning and optimization methods

Deep Learning and Optimization Methods

Stefan Kühn

Join me on XING

Data Science Meetup Hamburg - July 27th, 2017

Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26

https://www.xing.com/profile/Stefan_Kuehn46

Contents

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways


Deep Learning

Neural Networks - Universal Approximation Theorem1-hidden-layer feed-forward neural net with finite number of parameters canapproximate any continuous function on compact subsets of Rn

Questions:Why do we need deep learning at all?

I theoretic resultI approximation by piecewise constant functions (not what you might

want for classification/regression)Why are deep nets harder to train than shallow nets?

I More parameters to be learned by training?I More hyperparameters to be set before training?I Numerical issues?

disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —


Example: RNNs

Recurrent Neural NetsExtremely powerful for modeling sequential data, e.g. time series butextremely hard to train (somewhat less hard for LSTMs/GRUs)

Main Advantages:Qualitatively: Flexible and rich model classPractically: Gradients easily computed by Backpropagation (BPTT)

Main Problems:Qualitatively: Learning long-term dependenciesPractically: Gradient-based methods struggle when separation betweeninput and target output is large


Example: RNNs

Recurrent Neural NetsHighly volatile relationship between parameters and hidden states

IndicatorsVanishing/exploding gradientsInternal Covariate Shift

RemediesReLU’Careful’ initializationSmall stepsizes(Recurrent) Batch Normalization


Example: RNNs

Recurrent Neural Nets and LSTMSchmidhuber/Hochreiter proposed change of RNN architecture by addingLong Short-Term Memory Units

Vanishing/exploding gradients?fixed linear dynamics, no longer problematic

Any questions open?Gradient-based trainings works better with LSTMsLSTMs can compensate one deficiency of Gradient-based learning butis this the only one?

Most problems are related to specific numerical issues.





4 Takeaways


Trade-offs between Optimization and Learning

Computational complexity becomes the limiting factor when one envisionslarge amounts of training data. [Bouttou, Bousquet]

Underlying IdeaApproximate optimization algorithms might be sufficient for learningpurposes. [Bouttou, Bousquet]

Implications:Small-scale: Trade-off between approximation error and estimationerrorLarge-scale: Computational complexity dominates

Long story short:The best optimization methods might not be the best learningmethods!


Empirical results

Empirical evidence for SGD being a better learner than optimizer.

RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks





4 Takeaways


Gradient Descent

Minimize a given function f :min f (x), x ∈ Rn

Direction of Steepest Descent, the negative gradient:

d = −∇f (x)

Update in step k

xk+1 = xk − α∇f (xk)

Properties:always a descent direction, no test neededlocally optimal, globally convergentworks with inexact line search, e.g. Armijo’ s rule


Stochastic Gradient Descent

Setting

f (x) :=∑i

fi (x),

∇f (x) :=∑i

∇fi (x), i = 1, . . . ,m number of training examples

Choose i and update in step k

xk+1 = xk − α∇fi (xk)


Shortcomings of Gradient Descent

local: only local information usedespecially: no curvature information usedgreedy: prefers high curvature directionsscale invariant: no

James Martens, Deep learning via Hessian-free optimization


Momentum

Update in step k

zk+1 = βzk +∇f (xk)

xk+1 = xk − αzk+1

Properties for a quadratic convex objective:condition number κ of improves by square rootstepsizes can be twice as longorder of convergence

√κ− 1√κ+ 1

instead ofκ− 1κ+ 1

can diverge, if β is not properly chosen/adaptedGabriel Goh, Why momentum really works


Momentum

D E M O


Adam

Properties:combines several clever tricks (from Momentum, RMSprop, AdaGrad)has some similarities to Trust Region methodsempirically proven - best in class (personal opinion)

Kingma, Ba Adam: A method for stochastic optimization


SGD, Momentum and more

D E M O


L-BFGS and Nonlinear CG

Observations so far:The better the method, the more parameters to tune.All better methods try to incorporate curvature information.Why not doing so directly?

L-BFGSQuasi-Newton method, builds an approximation of the (inverse) Hessianand scales gradient accordingly.

Nonlinear CGInformally speaking, Nonlinear CG tries to solve a quadratic approximationof the function.

No surprise: They also work with minibatches.


Empirical results

Empirical evidence for better optimizers being better learners.

MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning


Truncated Newton: Hessian-Free Optimization

Main ideas:Approximate not Hessian H, but matrix-vector product Hd .Use finite differences instead of exact Hessian.Use damping.Use Linear CG method for solving quadratic approximation.Use clever mini-batch stragegy for large data-sets.


Empirical test on pathological problems

Main results:The addition problem is known to be effectively impossible forgradient descent, HF did it.Basic RNN cells are used, no specialized architectures (LSTMs etc.).

(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)





4 Takeaways


Summary

In the long run, the biggest bottleneck will be the sequential parts of analgorithm. That’s why the number of iterations needs to be small. SGD andits successors tend to have much more iterations, and they cannot benefitas much from higher parallelism (GPUs).

But whatever you do/prefer/choose:At least use successors of SGD: Momentum, Adam etc.Look for generic approaches instead of more and more specialized andmanually finetuned solutions.Key aspects:

I InitializationI Adaptive choice of stepsizes/momentum/. . .I Scaling of the gradient


Resources

Overview of Gradient Descent methodsWhy momentum really worksAdam - A Method for Stochastic OptimizationAndrew Ng et al. about L-BFGS and CG outperforming SGDLecture Slides Neural Networks for Machine Learning - Hinton et al.On the importance of initialization and momentum in deep learningData-Science-Blog: Summary article in preparation (Stefan Kühn)The Neural Network Zoo


http://ruder.io/optimizing-gradient-descent/index.html

https://distill.pub/2017/momentum/

https://arxiv.org/pdf/1412.6980.pdf

http://ai.stanford.edu/~ang/papers/icml11-OptimizationForDeepLearning.pdf

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

http://proceedings.mlr.press/v28/sutskever13.pdf

http://data-science-blog.com/autoren/

http://www.asimovinstitute.org/neural-network-zoo/

deep learning and optimization methods

Data & Analytics