deep learning and optimization methods

37
Deep Learning and Optimization Methods Stefan Kühn Join me on XING Data Science Meetup Hamburg - July 27th, 2017 Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26

Upload: stefan-kuehn

Post on 21-Jan-2018

389 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Deep Learning and Optimization Methods

Deep Learning and Optimization Methods

Stefan Kühn

Join me on XING

Data Science Meetup Hamburg - July 27th, 2017

Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26

Page 2: Deep Learning and Optimization Methods

Contents

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 2 / 26

Page 3: Deep Learning and Optimization Methods
Page 4: Deep Learning and Optimization Methods

Deep Learning

Neural Networks - Universal Approximation Theorem1-hidden-layer feed-forward neural net with finite number of parameters canapproximate any continuous function on compact subsets of Rn

Questions:Why do we need deep learning at all?

I theoretic resultI approximation by piecewise constant functions (not what you might

want for classification/regression)Why are deep nets harder to train than shallow nets?

I More parameters to be learned by training?I More hyperparameters to be set before training?I Numerical issues?

disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —

Stefan Kühn (XING) Deep Optimization 27.07.2017 4 / 26

Page 5: Deep Learning and Optimization Methods

Example: RNNs

Recurrent Neural NetsExtremely powerful for modeling sequential data, e.g. time series butextremely hard to train (somewhat less hard for LSTMs/GRUs)

Main Advantages:Qualitatively: Flexible and rich model classPractically: Gradients easily computed by Backpropagation (BPTT)

Main Problems:Qualitatively: Learning long-term dependenciesPractically: Gradient-based methods struggle when separation betweeninput and target output is large

Stefan Kühn (XING) Deep Optimization 27.07.2017 5 / 26

Page 6: Deep Learning and Optimization Methods

Example: RNNs

Recurrent Neural NetsHighly volatile relationship between parameters and hidden states

IndicatorsVanishing/exploding gradientsInternal Covariate Shift

RemediesReLU’Careful’ initializationSmall stepsizes(Recurrent) Batch Normalization

Stefan Kühn (XING) Deep Optimization 27.07.2017 6 / 26

Page 7: Deep Learning and Optimization Methods

Example: RNNs

Recurrent Neural Nets and LSTMSchmidhuber/Hochreiter proposed change of RNN architecture by addingLong Short-Term Memory Units

Vanishing/exploding gradients?fixed linear dynamics, no longer problematic

Any questions open?Gradient-based trainings works better with LSTMsLSTMs can compensate one deficiency of Gradient-based learning butis this the only one?

Most problems are related to specific numerical issues.

Stefan Kühn (XING) Deep Optimization 27.07.2017 7 / 26

Page 8: Deep Learning and Optimization Methods

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 8 / 26

Page 9: Deep Learning and Optimization Methods

Trade-offs between Optimization and Learning

Computational complexity becomes the limiting factor when one envisionslarge amounts of training data. [Bouttou, Bousquet]

Underlying IdeaApproximate optimization algorithms might be sufficient for learningpurposes. [Bouttou, Bousquet]

Implications:Small-scale: Trade-off between approximation error and estimationerrorLarge-scale: Computational complexity dominates

Long story short:The best optimization methods might not be the best learningmethods!

Stefan Kühn (XING) Deep Optimization 27.07.2017 9 / 26

Page 10: Deep Learning and Optimization Methods

Empirical results

Empirical evidence for SGD being a better learner than optimizer.

RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks

Stefan Kühn (XING) Deep Optimization 27.07.2017 10 / 26

Page 11: Deep Learning and Optimization Methods

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 11 / 26

Page 12: Deep Learning and Optimization Methods

Gradient Descent

Minimize a given function f :min f (x), x ∈ Rn

Direction of Steepest Descent, the negative gradient:

d = −∇f (x)

Update in step k

xk+1 = xk − α∇f (xk)

Properties:always a descent direction, no test neededlocally optimal, globally convergentworks with inexact line search, e.g. Armijo’ s rule

Stefan Kühn (XING) Deep Optimization 27.07.2017 12 / 26

Page 13: Deep Learning and Optimization Methods

Stochastic Gradient Descent

Setting

f (x) :=∑i

fi (x),

∇f (x) :=∑i

∇fi (x), i = 1, . . . ,m number of training examples

Choose i and update in step k

xk+1 = xk − α∇fi (xk)

Stefan Kühn (XING) Deep Optimization 27.07.2017 13 / 26

Page 14: Deep Learning and Optimization Methods

Shortcomings of Gradient Descent

local: only local information usedespecially: no curvature information usedgreedy: prefers high curvature directionsscale invariant: no

James Martens, Deep learning via Hessian-free optimization

Stefan Kühn (XING) Deep Optimization 27.07.2017 14 / 26

Page 15: Deep Learning and Optimization Methods

Momentum

Update in step k

zk+1 = βzk +∇f (xk)

xk+1 = xk − αzk+1

Properties for a quadratic convex objective:condition number κ of improves by square rootstepsizes can be twice as longorder of convergence

√κ− 1√κ+ 1

instead ofκ− 1κ+ 1

can diverge, if β is not properly chosen/adaptedGabriel Goh, Why momentum really works

Stefan Kühn (XING) Deep Optimization 27.07.2017 15 / 26

Page 16: Deep Learning and Optimization Methods

Momentum

D E M O

Stefan Kühn (XING) Deep Optimization 27.07.2017 16 / 26

Page 17: Deep Learning and Optimization Methods

Adam

Properties:combines several clever tricks (from Momentum, RMSprop, AdaGrad)has some similarities to Trust Region methodsempirically proven - best in class (personal opinion)

Kingma, Ba Adam: A method for stochastic optimization

Stefan Kühn (XING) Deep Optimization 27.07.2017 17 / 26

Page 18: Deep Learning and Optimization Methods

SGD, Momentum and more

D E M O

Stefan Kühn (XING) Deep Optimization 27.07.2017 18 / 26

Page 19: Deep Learning and Optimization Methods

L-BFGS and Nonlinear CG

Observations so far:The better the method, the more parameters to tune.All better methods try to incorporate curvature information.Why not doing so directly?

L-BFGSQuasi-Newton method, builds an approximation of the (inverse) Hessianand scales gradient accordingly.

Nonlinear CGInformally speaking, Nonlinear CG tries to solve a quadratic approximationof the function.

No surprise: They also work with minibatches.

Stefan Kühn (XING) Deep Optimization 27.07.2017 19 / 26

Page 20: Deep Learning and Optimization Methods

Empirical results

Empirical evidence for better optimizers being better learners.

MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning

Stefan Kühn (XING) Deep Optimization 27.07.2017 20 / 26

Page 21: Deep Learning and Optimization Methods

Truncated Newton: Hessian-Free Optimization

Main ideas:Approximate not Hessian H, but matrix-vector product Hd .Use finite differences instead of exact Hessian.Use damping.Use Linear CG method for solving quadratic approximation.Use clever mini-batch stragegy for large data-sets.

Stefan Kühn (XING) Deep Optimization 27.07.2017 21 / 26

Page 22: Deep Learning and Optimization Methods

Empirical test on pathological problems

Main results:The addition problem is known to be effectively impossible forgradient descent, HF did it.Basic RNN cells are used, no specialized architectures (LSTMs etc.).

(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)

Stefan Kühn (XING) Deep Optimization 27.07.2017 22 / 26

Page 23: Deep Learning and Optimization Methods

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 23 / 26

Page 24: Deep Learning and Optimization Methods

Summary

In the long run, the biggest bottleneck will be the sequential parts of analgorithm. That’s why the number of iterations needs to be small. SGD andits successors tend to have much more iterations, and they cannot benefitas much from higher parallelism (GPUs).

But whatever you do/prefer/choose:At least use successors of SGD: Momentum, Adam etc.Look for generic approaches instead of more and more specialized andmanually finetuned solutions.Key aspects:

I InitializationI Adaptive choice of stepsizes/momentum/. . .I Scaling of the gradient

Stefan Kühn (XING) Deep Optimization 27.07.2017 24 / 26

Page 25: Deep Learning and Optimization Methods

Resources

Overview of Gradient Descent methodsWhy momentum really worksAdam - A Method for Stochastic OptimizationAndrew Ng et al. about L-BFGS and CG outperforming SGDLecture Slides Neural Networks for Machine Learning - Hinton et al.On the importance of initialization and momentum in deep learningData-Science-Blog: Summary article in preparation (Stefan Kühn)The Neural Network Zoo

Stefan Kühn (XING) Deep Optimization 27.07.2017 25 / 26

Page 26: Deep Learning and Optimization Methods
Page 27: Deep Learning and Optimization Methods
Page 28: Deep Learning and Optimization Methods
Page 29: Deep Learning and Optimization Methods
Page 30: Deep Learning and Optimization Methods
Page 31: Deep Learning and Optimization Methods
Page 32: Deep Learning and Optimization Methods
Page 33: Deep Learning and Optimization Methods
Page 34: Deep Learning and Optimization Methods
Page 35: Deep Learning and Optimization Methods
Page 36: Deep Learning and Optimization Methods
Page 37: Deep Learning and Optimization Methods