on large-batch training for deep learning: generalization ... · 100 (1) a can be the identity...

On Large-Batch Training for Deep Learning:Generalization Gap and Sharp Minima

Nitish Shirish Keskar1 Dheevatsa Mudigere2 Jorge Nocedal1

Mikhail Smelyanskiy2 Ping Tak Peter Tang2

1Northwestern University

2Intel Corporation

ICLR, 2017Presenter: Tianlu Wang

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 1 /

20

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary


20

Outline




4 Summary


20

Batch Size of Stochastic Gradient Methods

Non-convex optimization in deep learning:minx∈Rn f (x) := 1

M

∑Mi=1 fi (x)

Stochastic Gradient Methods and its variants:|Bk | ∈ {32, 64, . . . , 512}Increase batch size to improve parallelism leads to a loss ingeneralization performance


20

Outline




4 Summary


20

Main Observations

Large-batch methods tend to converge to sharp minimizers of thetraining function and tend to generalize less well.Small-batch methods converge to flat minimizers and are able toescape basins of attraction of sharp minimizers.

Sharp Minimizer x̂ : function increases rapidly in a small neighborhoodof x̂Flat Minimizer x̄ : function varies slowly in a large neighborhood of x̄


20

Outline




4 Summary


20

Numerical Results

6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points


20

Question

Generalization gap is not due to over-fitting or over-training ???


20

Outline




4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 10

/ 20

Parametric Plots

x∗s and x∗l :solutions obtained by SB and LB

plot f (αx∗l + (1− α)x∗s ):


/ 20

Outline




4 Summary


/ 20

Sharpness of Minima

Motivation: Measure the sensitivity of training function at the givenlocal minimizer, so we want to explore a small neighborhood of aminimizer and compute the largest value that f can attain in thisneighborhood.


/ 20

Sharpness of Minima

Small neighborhood:p: dimension of manifoldA: n × p matrix,columns are randomly generatedA+: pesudo-inverse of A

Cε = {z ∈ Rn : −ε(|xi |+ 1) ≤ zi ≤ ε(|xi |+ 1)}∀i ∈ {1, 2, . . . , n}

Cε = {z ∈ Rp : −ε(|(A+x)i |+ 1) ≤ zi ≤ ε(|(A+x)i |+ 1)}∀i ∈ {1, 2, . . . , p}

Metric 2.1. Given x ∈ Rn, ε > 0 and A ∈ Rn∗p, the sharpness of fat x :

φx ,f (ε,A) :=(maxy∈Cε f (x + Ay))− f (x)

1 + f (x)× 100 (1)

A can be the identity matrix In


/ 20

Sharpness of Minima

Sharpness of Minima in Full Space(A is the identity matrix):


/ 20

Outline




4 Summary


/ 20

Deterioration along Increasing of Batch-Size

Note batch-size≈ 15000 for F2 and batch-size≈ 500 for C1

There exists a threshold after which there is a deterioration in thequality of the model.


/ 20

Outline




4 Summary


/ 20

Warm-started Large Batch experiments

Train network for 100 epochs with batch-size=256 and use these 100epochs as starting points.

The SB method needs some epochs to explore and discover a flatminimizer.


/ 20

Summary

Numerical experiments that support the view that convergence tosharp minimizers gives rise to the poor generalization of large-batchmethods for deep learning.

SB methods have an exploration phase followed by convergence to aflat minimizer.

Attempts to remedy the problem:

Data augmentationConservative trainingAdversarial trainingRobust optimization


/ 20

on large-batch training for deep learning: generalization ... · 100 (1) a can be the identity...

Documents