deep learning and optimization methods

Post on 21-Jan-2018

389 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning and Optimization Methods

Stefan Kühn

Join me on XING

Data Science Meetup Hamburg - July 27th, 2017

Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26

Contents

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 2 / 26

Deep Learning

Neural Networks - Universal Approximation Theorem1-hidden-layer feed-forward neural net with finite number of parameters canapproximate any continuous function on compact subsets of Rn

Questions:Why do we need deep learning at all?

I theoretic resultI approximation by piecewise constant functions (not what you might

want for classification/regression)Why are deep nets harder to train than shallow nets?

I More parameters to be learned by training?I More hyperparameters to be set before training?I Numerical issues?

disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —

Stefan Kühn (XING) Deep Optimization 27.07.2017 4 / 26

Example: RNNs

Recurrent Neural NetsExtremely powerful for modeling sequential data, e.g. time series butextremely hard to train (somewhat less hard for LSTMs/GRUs)

Main Advantages:Qualitatively: Flexible and rich model classPractically: Gradients easily computed by Backpropagation (BPTT)

Main Problems:Qualitatively: Learning long-term dependenciesPractically: Gradient-based methods struggle when separation betweeninput and target output is large

Stefan Kühn (XING) Deep Optimization 27.07.2017 5 / 26

Example: RNNs

Recurrent Neural NetsHighly volatile relationship between parameters and hidden states

IndicatorsVanishing/exploding gradientsInternal Covariate Shift

RemediesReLU’Careful’ initializationSmall stepsizes(Recurrent) Batch Normalization

Stefan Kühn (XING) Deep Optimization 27.07.2017 6 / 26

Example: RNNs

Recurrent Neural Nets and LSTMSchmidhuber/Hochreiter proposed change of RNN architecture by addingLong Short-Term Memory Units

Vanishing/exploding gradients?fixed linear dynamics, no longer problematic

Any questions open?Gradient-based trainings works better with LSTMsLSTMs can compensate one deficiency of Gradient-based learning butis this the only one?

Most problems are related to specific numerical issues.

Stefan Kühn (XING) Deep Optimization 27.07.2017 7 / 26

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 8 / 26

Trade-offs between Optimization and Learning

Computational complexity becomes the limiting factor when one envisionslarge amounts of training data. [Bouttou, Bousquet]

Underlying IdeaApproximate optimization algorithms might be sufficient for learningpurposes. [Bouttou, Bousquet]

Implications:Small-scale: Trade-off between approximation error and estimationerrorLarge-scale: Computational complexity dominates

Long story short:The best optimization methods might not be the best learningmethods!

Stefan Kühn (XING) Deep Optimization 27.07.2017 9 / 26

Empirical results

Empirical evidence for SGD being a better learner than optimizer.

RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks

Stefan Kühn (XING) Deep Optimization 27.07.2017 10 / 26

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 11 / 26

Gradient Descent

Minimize a given function f :min f (x), x ∈ Rn

Direction of Steepest Descent, the negative gradient:

d = −∇f (x)

Update in step k

xk+1 = xk − α∇f (xk)

Properties:always a descent direction, no test neededlocally optimal, globally convergentworks with inexact line search, e.g. Armijo’ s rule

Stefan Kühn (XING) Deep Optimization 27.07.2017 12 / 26

Stochastic Gradient Descent

Setting

f (x) :=∑i

fi (x),

∇f (x) :=∑i

∇fi (x), i = 1, . . . ,m number of training examples

Choose i and update in step k

xk+1 = xk − α∇fi (xk)

Stefan Kühn (XING) Deep Optimization 27.07.2017 13 / 26

Shortcomings of Gradient Descent

local: only local information usedespecially: no curvature information usedgreedy: prefers high curvature directionsscale invariant: no

James Martens, Deep learning via Hessian-free optimization

Stefan Kühn (XING) Deep Optimization 27.07.2017 14 / 26

Momentum

Update in step k

zk+1 = βzk +∇f (xk)

xk+1 = xk − αzk+1

Properties for a quadratic convex objective:condition number κ of improves by square rootstepsizes can be twice as longorder of convergence

√κ− 1√κ+ 1

instead ofκ− 1κ+ 1

can diverge, if β is not properly chosen/adaptedGabriel Goh, Why momentum really works

Stefan Kühn (XING) Deep Optimization 27.07.2017 15 / 26

Momentum

D E M O

Stefan Kühn (XING) Deep Optimization 27.07.2017 16 / 26

Adam

Properties:combines several clever tricks (from Momentum, RMSprop, AdaGrad)has some similarities to Trust Region methodsempirically proven - best in class (personal opinion)

Kingma, Ba Adam: A method for stochastic optimization

Stefan Kühn (XING) Deep Optimization 27.07.2017 17 / 26

SGD, Momentum and more

D E M O

Stefan Kühn (XING) Deep Optimization 27.07.2017 18 / 26

L-BFGS and Nonlinear CG

Observations so far:The better the method, the more parameters to tune.All better methods try to incorporate curvature information.Why not doing so directly?

L-BFGSQuasi-Newton method, builds an approximation of the (inverse) Hessianand scales gradient accordingly.

Nonlinear CGInformally speaking, Nonlinear CG tries to solve a quadratic approximationof the function.

No surprise: They also work with minibatches.

Stefan Kühn (XING) Deep Optimization 27.07.2017 19 / 26

Empirical results

Empirical evidence for better optimizers being better learners.

MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning

Stefan Kühn (XING) Deep Optimization 27.07.2017 20 / 26

Truncated Newton: Hessian-Free Optimization

Main ideas:Approximate not Hessian H, but matrix-vector product Hd .Use finite differences instead of exact Hessian.Use damping.Use Linear CG method for solving quadratic approximation.Use clever mini-batch stragegy for large data-sets.

Stefan Kühn (XING) Deep Optimization 27.07.2017 21 / 26

Empirical test on pathological problems

Main results:The addition problem is known to be effectively impossible forgradient descent, HF did it.Basic RNN cells are used, no specialized architectures (LSTMs etc.).

(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)

Stefan Kühn (XING) Deep Optimization 27.07.2017 22 / 26

1 Training Deep Networks

2 Training and Learning

3 The Toolbox of Optimization Methods

4 Takeaways

Stefan Kühn (XING) Deep Optimization 27.07.2017 23 / 26

Summary

In the long run, the biggest bottleneck will be the sequential parts of analgorithm. That’s why the number of iterations needs to be small. SGD andits successors tend to have much more iterations, and they cannot benefitas much from higher parallelism (GPUs).

But whatever you do/prefer/choose:At least use successors of SGD: Momentum, Adam etc.Look for generic approaches instead of more and more specialized andmanually finetuned solutions.Key aspects:

I InitializationI Adaptive choice of stepsizes/momentum/. . .I Scaling of the gradient

Stefan Kühn (XING) Deep Optimization 27.07.2017 24 / 26

Resources

Overview of Gradient Descent methodsWhy momentum really worksAdam - A Method for Stochastic OptimizationAndrew Ng et al. about L-BFGS and CG outperforming SGDLecture Slides Neural Networks for Machine Learning - Hinton et al.On the importance of initialization and momentum in deep learningData-Science-Blog: Summary article in preparation (Stefan Kühn)The Neural Network Zoo

Stefan Kühn (XING) Deep Optimization 27.07.2017 25 / 26

top related