on large-batch training for deep learning: generalization ... · 100 (1) a can be the identity...

26
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima Nitish Shirish Keskar 1 Dheevatsa Mudigere 2 Jorge Nocedal 1 Mikhail Smelyanskiy 2 Ping Tak Peter Tang 2 1 Northwestern University 2 Intel Corporation ICLR, 2017 Presenter: Tianlu Wang Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Inte On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima ICLR, 2017 Presenter: Tianlu Wang 1/ 20

Upload: others

Post on 16-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

On Large-Batch Training for Deep Learning:Generalization Gap and Sharp Minima

Nitish Shirish Keskar1 Dheevatsa Mudigere2 Jorge Nocedal1

Mikhail Smelyanskiy2 Ping Tak Peter Tang2

1Northwestern University

2Intel Corporation

ICLR, 2017Presenter: Tianlu Wang

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 1 /

20

Page 2: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 2 /

20

Page 3: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 3 /

20

Page 4: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Batch Size of Stochastic Gradient Methods

Non-convex optimization in deep learning:minx∈Rn f (x) := 1

M

∑Mi=1 fi (x)

Stochastic Gradient Methods and its variants:|Bk | ∈ {32, 64, . . . , 512}Increase batch size to improve parallelism leads to a loss ingeneralization performance

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 4 /

20

Page 5: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Batch Size of Stochastic Gradient Methods

Non-convex optimization in deep learning:minx∈Rn f (x) := 1

M

∑Mi=1 fi (x)

Stochastic Gradient Methods and its variants:|Bk | ∈ {32, 64, . . . , 512}Increase batch size to improve parallelism leads to a loss ingeneralization performance

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 4 /

20

Page 6: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 5 /

20

Page 7: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Main Observations

Large-batch methods tend to converge to sharp minimizers of thetraining function and tend to generalize less well.Small-batch methods converge to flat minimizers and are able toescape basins of attraction of sharp minimizers.

Sharp Minimizer x̂ : function increases rapidly in a small neighborhoodof x̂Flat Minimizer x̄ : function varies slowly in a large neighborhood of x̄

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 6 /

20

Page 8: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Main Observations

Large-batch methods tend to converge to sharp minimizers of thetraining function and tend to generalize less well.Small-batch methods converge to flat minimizers and are able toescape basins of attraction of sharp minimizers.

Sharp Minimizer x̂ : function increases rapidly in a small neighborhoodof x̂Flat Minimizer x̄ : function varies slowly in a large neighborhood of x̄

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 6 /

20

Page 9: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 7 /

20

Page 10: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Numerical Results

6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 8 /

20

Page 11: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Numerical Results

6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 8 /

20

Page 12: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Numerical Results

6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 8 /

20

Page 13: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Question

Generalization gap is not due to over-fitting or over-training ???

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 9 /

20

Page 14: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Question

Generalization gap is not due to over-fitting or over-training ???

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 9 /

20

Page 15: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 10

/ 20

Page 16: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Parametric Plots

x∗s and x∗l :solutions obtained by SB and LB

plot f (αx∗l + (1− α)x∗s ):

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 11

/ 20

Page 17: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Parametric Plots

x∗s and x∗l :solutions obtained by SB and LB

plot f (αx∗l + (1− α)x∗s ):

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 11

/ 20

Page 18: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 12

/ 20

Page 19: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Sharpness of Minima

Motivation: Measure the sensitivity of training function at the givenlocal minimizer, so we want to explore a small neighborhood of aminimizer and compute the largest value that f can attain in thisneighborhood.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 13

/ 20

Page 20: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Sharpness of Minima

Small neighborhood:p: dimension of manifoldA: n × p matrix,columns are randomly generatedA+: pesudo-inverse of A

Cε = {z ∈ Rn : −ε(|xi |+ 1) ≤ zi ≤ ε(|xi |+ 1)}∀i ∈ {1, 2, . . . , n}

Cε = {z ∈ Rp : −ε(|(A+x)i |+ 1) ≤ zi ≤ ε(|(A+x)i |+ 1)}∀i ∈ {1, 2, . . . , p}

Metric 2.1. Given x ∈ Rn, ε > 0 and A ∈ Rn∗p, the sharpness of fat x :

φx ,f (ε,A) :=(maxy∈Cε f (x + Ay))− f (x)

1 + f (x)× 100 (1)

A can be the identity matrix In

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 14

/ 20

Page 21: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Sharpness of Minima

Sharpness of Minima in Full Space(A is the identity matrix):

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 15

/ 20

Page 22: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 16

/ 20

Page 23: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Deterioration along Increasing of Batch-Size

Note batch-size≈ 15000 for F2 and batch-size≈ 500 for C1

There exists a threshold after which there is a deterioration in thequality of the model.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 17

/ 20

Page 24: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Outline

1 IntroductionBatch Size of Stochastic Gradient Methods

2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima

3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments

4 Summary

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 18

/ 20

Page 25: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Warm-started Large Batch experiments

Train network for 100 epochs with batch-size=256 and use these 100epochs as starting points.

The SB method needs some epochs to explore and discover a flatminimizer.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 19

/ 20

Page 26: On Large-Batch Training for Deep Learning: Generalization ... · 100 (1) A can be the identity matrix I n Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,

Summary

Numerical experiments that support the view that convergence tosharp minimizers gives rise to the poor generalization of large-batchmethods for deep learning.

SB methods have an exploration phase followed by convergence to aflat minimizer.

Attempts to remedy the problem:

Data augmentationConservative trainingAdversarial trainingRobust optimization

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 20

/ 20