![Page 1: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/1.jpg)
Don’t Decay the Learning Rate,Increase the Batch Size
Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. LeDecember 9th 2017
Google Brainslsmith@
![Page 2: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/2.jpg)
Proprietary + Confidential
Three related questions:
● What properties control generalization?
● How should we tune SGD hyper-parameters?
● Can we train efficiently with large batches? (> 50,000 examples)
![Page 3: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/3.jpg)
Proprietary + Confidential
Small batches out-generalize large batches(at constant learning rate)
As observed by:
“On Large Batch Training…”, Keskar et al. (2017)
![Page 4: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/4.jpg)
Proprietary + Confidential
Which minimum is best?
![Page 5: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/5.jpg)
Proprietary + Confidential
Bayesian model comparison
![Page 6: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/6.jpg)
Proprietary + Confidential
Bayesian model comparison
![Page 7: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/7.jpg)
Proprietary + Confidential
Bayesian model comparison
Probability ratio of two competing models
![Page 8: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/8.jpg)
Proprietary + Confidential
Bayesian model comparison
Probability ratio of two competing models
Prior probability ratio of the models. Usually 1.
![Page 9: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/9.jpg)
Proprietary + Confidential
Bayesian model comparison
Probability ratio of two competing models
Prior probability ratio of the models. Usually 1.
The evidence ratio!
![Page 10: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/10.jpg)
Proprietary + Confidential
The Bayesian evidence (Gaussian approximation)
λi is the ith Hessian eigenvalue
λ is the L2 regularization parameter
![Page 11: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/11.jpg)
Proprietary + Confidential
The Bayesian evidence (Gaussian approximation)
λi is the ith Hessian eigenvalue
λ is the L2 regularization parameter
Evidence for a minimum
![Page 12: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/12.jpg)
Proprietary + Confidential
The Bayesian evidence (Gaussian approximation)
λi is the ith Hessian eigenvalue
λ is the L2 regularization parameter
Evidence for a minimum Depth of the
minimum
![Page 13: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/13.jpg)
Proprietary + Confidential
The Bayesian evidence (Gaussian approximation)
λi is the ith Hessian eigenvalue
λ is the L2 regularization parameter
Evidence for a minimum Depth of the
minimumWidth of the minimum
![Page 14: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/14.jpg)
Proprietary + Confidential
The Bayesian evidence (Gaussian approximation)
λi is the ith Hessian eigenvalue
λ is the L2 regularization parameter
Width of the minimum
Invariant to changes in model parameterization(sharp minima can’t generalize!)
![Page 15: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/15.jpg)
Proprietary + Confidential
Which minimum is best?
![Page 16: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/16.jpg)
Proprietary + Confidential
Which minimum is best?
Generalization is a weighted combination of:
1) Depth2) Width
![Page 17: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/17.jpg)
Proprietary + Confidential
Which minimum is best?
The SGD should not minimize the cost function
It should maximize the evidence
![Page 18: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/18.jpg)
Proprietary + Confidential
The SGD gradient update
NoiseTrue gradient
![Page 19: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/19.jpg)
Proprietary + Confidential
The SGD gradient update
![Page 20: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/20.jpg)
Proprietary + Confidential
The SGD gradient update
![Page 21: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/21.jpg)
Proprietary + Confidential
The SGD gradient update
Batch size
![Page 22: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/22.jpg)
Proprietary + Confidential
How to choose the batch size? (at constant learning rate)
![Page 23: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/23.jpg)
Proprietary + Confidential
How to choose the batch size? (at constant learning rate)
Too little noise(big batches)
Too much noise(small batches)
Just right!
![Page 24: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/24.jpg)
Proprietary + Confidential
How to choose the batch size? (at constant learning rate)
Just right!
Too little noise(big batches)
Too much noise(small batches)
There should be an optimum batch size
![Page 25: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/25.jpg)
Proprietary + Confidential
How to choose the batch size? (at constant learning rate)
![Page 26: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/26.jpg)
Proprietary + Confidential
How to choose the batch size? (at constant learning rate) As
predicted!
![Page 27: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/27.jpg)
Proprietary + Confidential
Defining the SGD “noise scale”
![Page 28: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/28.jpg)
Proprietary + Confidential
Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation
![Page 29: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/29.jpg)
Proprietary + Confidential
Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation
“Noise scale”
![Page 30: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/30.jpg)
Proprietary + Confidential
Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation
After a little math:
![Page 31: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/31.jpg)
Proprietary + Confidential
Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation
After a little math:
Prediction:
![Page 32: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/32.jpg)
Proprietary + Confidential
![Page 33: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/33.jpg)
Proprietary + Confidential
![Page 34: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/34.jpg)
Proprietary + Confidential
Consequences
1) We can linearly scale batch size and learning rate■ “Accurate, Large Minibatch SGD: Training ImageNet in 1
Hour”, Goyal et al. (2017)
2) We expect training sets to grow over time■ Suggests batch sizes will rise
![Page 35: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/35.jpg)
Proprietary + Confidential
What about momentum?
![Page 36: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/36.jpg)
Proprietary + Confidential
What about momentum?
![Page 37: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/37.jpg)
Proprietary + Confidential
What about momentum?
![Page 38: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/38.jpg)
Proprietary + Confidential
![Page 39: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/39.jpg)
Proprietary + Confidential
Decaying learning rate and increasing batch size are equivalent
![Page 40: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/40.jpg)
Proprietary + Confidential
Decaying learning rate and increasing batch size are equivalent
![Page 41: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/41.jpg)
Proprietary + Confidential
Decaying learning rate and increasing batch size are equivalent
We can choose any combination of ε and B with the same g.
(so long as ε isn’t too large)
![Page 42: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/42.jpg)
Proprietary + Confidential
Three equivalent schedules:
Wide ResNet on CIFAR-10
![Page 43: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/43.jpg)
Proprietary + Confidential
Training curves:
Ghost batch norm,Hoffer et al., 2017
![Page 44: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/44.jpg)
Proprietary + Confidential
Training curves:
Computational cost constant
![Page 45: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/45.jpg)
Proprietary + Confidential
Computational cost constantBut parallelizable
Training curves:
![Page 46: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/46.jpg)
Proprietary + Confidential
Test curves:
MomentumNesterov
momentum
![Page 47: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/47.jpg)
Proprietary + Confidential
Test curves:
Vanilla SGD Adam
![Page 48: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/48.jpg)
Proprietary + Confidential
Towards large batch training:
![Page 49: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/49.jpg)
Proprietary + Confidential
Towards large batch training:
![Page 50: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/50.jpg)
Proprietary + Confidential
Towards large batch training:
![Page 51: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/51.jpg)
Proprietary + Confidential
Towards large batch training:
![Page 52: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/52.jpg)
Proprietary + Confidential
Towards large batch training:
Typical
speed-up10-100X
![Page 53: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/53.jpg)
Proprietary + Confidential
Why does momentum scaling reduce test accuracy?
![Page 54: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/54.jpg)
Proprietary + Confidential
“Accumulation” stores moving average of gradients
Why does momentum scaling reduce test accuracy?
![Page 55: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/55.jpg)
Proprietary + Confidential
Larger momentum equals longer memory
Why does momentum scaling reduce test accuracy?
![Page 56: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/56.jpg)
Proprietary + Confidential
Larger momentum equals longer memory
Why does momentum scaling reduce test accuracy?
The gradient changes too slowly as we explore the parameter space
![Page 57: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/57.jpg)
Proprietary + Confidential
Training ImageNet in under 2500 updates!
Inception-Resnet-V2
Original implementation:~ 400,000 updates
“ImageNet in one hour”Goyal et al., 2017(learning rate scaling)~ 14,000 updates
![Page 58: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/58.jpg)
Proprietary + Confidential
Training ImageNet in under 2500 updates!
79% accuracy in under 6000 updates
77% accuracy in under 2500 updates
Batches of 65,536 images
![Page 59: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289](https://reader033.vdocument.in/reader033/viewer/2022050403/5f80ecee6516ec5b49295393/html5/thumbnails/59.jpg)
Proprietary + Confidential
● “A Bayesian Perspective on Generalization and Stochastic Gradient Descent”, arXiv:1710.06451 Samuel L Smith and Quoc V. Le
● “Don’t Decay the Learning Rate, Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le *Equal contribution
● “Stochastic Gradient Descent as Approximate Bayesian Inference”, arXiv:1704.04289 Stephan Mandt, Matthew D. Hoffman and David M. Blei
Thank You!
slsmith@pikinder@
qvl@