yellowfin and the art of momentum tuning - dawn · 3/8/2018  · yellowfin and the art of momentum...

1
YellowFin and the Art of Momentum Tuning Jian Zhang 1 , Ioannis Mitliagkas 2 1 Stanford University, 2 MILA, University of Montreal Adaptive Optimization Hyperparameter tuning is a big cost of deep learning. Momentum: a key hyperparameter to SGD and variants. Adaptive methods, e.g. Adam 1 , don’t tune momentum. YellowFin optimizer Based on the robustness properties of momentum. • Auto- tuning of momentum and learning rate in SGD. Closed-loop momentum control for async. training. Experiments ResNet and LSTM YellowFin runs with no tuning. Adam, mom. SGD etc. are tuned on learning rate grids. YellowFin can outperform tuned SoA on train/val. metrics. Facebook Convolutional Seq-to-seq model IWSLT 2014 German-English translation YellowFin outperforms the hand- tuned default optimizer Extension: Closed-loop YellowFin Async. distributed training: fast, no sync. barrier. However, Asynchrony induces additional to . Can we auto-match total momentum to YF- tuned ? Closed-loop momentum control Closed-loop mechanism improves YellowFin in async.. Momentum operation SGD step: . In a 1-D case, matrix form with : Robust region , given . Linear rate robust to curvature variance (middle). Linear rate robust to a range of learning rates (left). Robustness of Momentum YellowFin Noisy quadratic model Model stochastic setting with gradient variance . Local quadratic approximation 3 : 1-D case. In robust region, distance to optimum is approx. by Greedy tuning strategy Solve learning rate and momentum in closed-form sol. E(x t - x ) 2 μ t (x 0 - x ) 2 + (1 - μ t ) 2 C 1 - μ C E(x 0 - x ? ) 2 μ Async. induced Effective total value μ μ T Algorithmic ¯ μ ¯ μ μ μ + γ · (μ - μ T ) Feedback control μ ? μ T = μ μ YF target value μ ? Github for PyTorch: https://github.com/JianGoForIt/YellowFin_Pytorch Github for TensorFlow: https://github.com/JianGoForIt/YellowFin Principle I: Stay in the robust region Principle II: Minimize after one step ( ) x t+1 - x x t - x = 1 - h(x t )+ μ -μ 1 0 x t - x x t-1 - x , A t x t - x x t-1 - x h min h(x t ) h max (1 - p μ) 2 h max (1 + p μ) 2 h min Spectral radius (A t )= p μ linear convergence 2 x t+1 = x t - rf (x t )+ μ(x t - x t-1 ) rf (x t )= h(x t )(x t - x ? ) is curvature in quadratics. h 1. [Kingma et. al 15] 3. [Schaul et. al 13] 0k 30k 60k 90k 120k Iterations 10 0 10 1 Training loss CIFAR100 ResNet 164 Adam YellowFin Closed-loop YellowFin 0k 5k 10k 15k 20k 25k 30k Iterations 10 2 10 3 Validation perplexity 2-layer PTB LSTM Momentum SGD Adam YellowFin 0k 5k 10k 15k 20k Iterations 1 1.5 2 Training loss 2-layer TinyShakespeare Char-RNN Momentum SGD Adam YellowFin 0k 10k 20k 30k 40k Iterations 10 -1 10 0 Training loss ResNet 110 CIFAR10 Momentum SGD Adam YellowFin 0k 30k 60k 90k 120k Iterations 88.0 88.5 89.0 89.5 90.0 90.5 91.0 91.5 Validation F1 3-layer WSJ LSTM Momentum SGD Adam YellowFin Adagrad Vanilla SGD t =1 Val. loss Val. BLEU@4 Default Nesterov Momentum 2.86 30.75 YellowFin 2.75 31.59 2. Not guarrenteed for non-quadratics [Mitliagkas et. al 16] “Send us your bug reports!”

Upload: others

Post on 21-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: YellowFin and the Art of Momentum Tuning - DAWN · 3/8/2018  · YellowFin and the Art of Momentum Tuning Jian Zhang1, Ioannis Mitliagkas2 1Stanford University, 2MILA, University

YellowFin and the Art of Momentum Tuning Jian Zhang1, Ioannis Mitliagkas2

1Stanford University, 2MILA, University of Montreal

Adaptive OptimizationHyperparameter tuning is a big cost of deep learning.

Momentum: a key hyperparameter to SGD and variants.

Adaptive methods, e.g. Adam1, don’t tune momentum.

YellowFin optimizer • Based on the robustness properties of momentum.

• Auto-tuning of momentum and learning rate in SGD.

• Closed-loop momentum control for async. training.

ExperimentsResNet and LSTM YellowFin runs with no tuning.

Adam, mom. SGD etc. are tuned on learning rate grids.

YellowFin can outperform tuned SoA on train/val. metrics.

Facebook Convolutional Seq-to-seq model IWSLT 2014 German-English translationYellowFin outperforms the hand-tuned default optimizer

Extension: Closed-loop YellowFinAsync. distributed training: fast, no sync. barrier.

However, Asynchrony induces additional to .

Can we auto-match total momentum to YF-tuned ?

Closed-loop momentum control

Closed-loop mechanism improves YellowFin in async..

Momentum operation• SGD step: .

• In a 1-D case, matrix form with :

Robust region , given .

• Linear rate robust to curvature variance (middle).• Linear rate robust to a range of learning rates (left).

Robustness of Momentum

YellowFinNoisy quadratic model • Model stochastic setting with gradient variance .• Local quadratic approximation3: 1-D case. In robust region, distance to optimum is approx. by

Greedy tuning strategy

• Solve learning rate and momentum in closed-form sol.

E(xt � x

⇤)2 ⇡ µ

t(x0 � x

⇤)2 + (1� µ

t)↵

2C

1� µ

C

E(x0 � x

?)2

µ

Async. induced

Effective total value

µ

µT

Algorithmic

µ̄

µ̄

µ µ+ � · (µ⇤ � µT )Feedback control

µ?

µT = µ+ µ̄

YF target valueµ?

Github for PyTorch: https://github.com/JianGoForIt/YellowFin_PytorchGithub for TensorFlow: https://github.com/JianGoForIt/YellowFin

Principle I: Stay in the robust region

Principle II: Minimize after one step ( )

✓xt+1 � x

xt � x

◆=

1�↵h(xt)+µ �µ

1 0

�✓xt � x

xt�1 � x

◆,At

✓xt � x

xt�1 � x

h

min

h(xt)h

max

(1�pµ)2

hmax

↵ (1 +

pµ)2

hmin

Spectral radius ⇢(At)=pµ linear convergence2

xt+1 = xt � ↵rf(xt) + µ(xt � xt�1)

rf(xt)=h(xt)(xt�x

?)

is curvature in quadratics.h

1.[Kingma et. al 15]

3.[Schaul et. al 13]

0k 30k 60k 90k 120k

Iterations

100

101

Trai

ning

loss

CIFAR100 ResNet 164AdamYellowFinClosed-loopYellowFin

0k 5k 10k 15k 20k 25k 30k

Iterations

102

103

Valid

atio

npe

rple

xity 2-layer PTB LSTM

Momentum SGDAdamYellowFin

0k 5k 10k 15k 20k

Iterations

1

1.5

2

Trai

ning

loss

2-layer TinyShakespeare Char-RNNMomentum SGDAdamYellowFin

0k 10k 20k 30k 40k

Iterations

10�1

100

Trai

ning

loss

ResNet 110 CIFAR10Momentum SGDAdamYellowFin

0k 30k 60k 90k 120k

Iterations

88.0

88.5

89.0

89.5

90.0

90.5

91.0

91.5

Valid

atio

nF1

3-layer WSJ LSTM

Momentum SGDAdamYellowFinAdagradVanilla SGD

t=1

Val. loss Val. BLEU@4Default Nesterov Momentum 2.86 30.75

YellowFin 2.75 31.59

2. Not guarrenteed for non-quadratics

[Mitliagkas et. al 16]

“Send us your bug reports!”