stochastic gradient descentalinlab.kaist.ac.kr/resource/lec2_sgd.pdf · 2020-04-25 · algorithmic...

19
Algorithmic Intelligence Lab Algorithmic Intelligence Lab EE807: Recent Advances in Deep Learning Lecture 2 Slide made by Insu Han and Jongheon Jeong KAIST EE Stochastic Gradient Descent

Upload: others

Post on 11-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

AlgorithmicIntelligenceLab

EE807:RecentAdvancesinDeepLearningLecture2

Slidemadeby

Insu HanandJongheon JeongKAISTEE

StochasticGradientDescent

Page 2: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

1. Introduction• Empiricalriskminimization(ERM)

2. GradientDescendMethods• Gradientdescent(GD)• Stochasticgradientdescent(SGD)

3. MomentumandAdaptiveLearningRateMethods• Momentummethods• Learningratescheduling• Adaptivelearningratemethods(AdaGrad,RmsProp,Adam)

4. ChangingBatchSize• Increasingthebatchsizewithoutlearningratedecaying

5. Summary

TableofContents

2

Page 3: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

• Giventrainingset

• Predictionfunctionparameterizedby

• Empiricalriskminimization: Findaparamaterthatminimizesthelossfunction

whereisalossfunctione.g.,MSE,crossentropy,

• Forexample,neuralnetworkhas

EmpiricalRiskMinimization(ERM)

3

Next,howtosolveERM?

Page 4: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

• Gradientdescent(GD) updatesparametersiterativelybytakinggradient.

• (+) Convergestoglobal(local)minimumforconvex(non-convex)problem.• (−)Notefficientwithrespecttocomputationtime andmemoryspace forhuge𝑛.• Forexample,ImageNetdatasethas𝑛 =1,281,167 images fortraining.

GradientDescent(GD)

4

parameters

learningrate

lossfunction

Next,efficientGD

1.2Mof256x256RGBimages≈ 236GBmemory

Page 5: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

• Stochasticgradientdescent(SGD) usesamples toapproximateGD

• Inpractice,minibatchsizescanbe32/64/128.

• Mainpracticalchallenges andcurrentsolutions:1. SGDcanbetoonoisyandmightbeunstable2. hardtofindagoodlearningrate

StochasticGradientDescent(SGD)

5*source:https://lovesnowbest.site/2018/02/16/Improving-Deep-Neural-Networks-Assignment-2/

Next,momentum

momentumadaptivelearningrate

Page 6: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).

• Equivalenttotheweighted-sumofthefraction𝜇 ofpreviousupdate.

• (+) Momentumreducestheoscillationandacceleratestheconvergence.

MomentumMethods

6

momentum preservationratio

SGD

frictiontoverticalfluctuation

accelerationtoleftSGD+momentum

Page 7: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).

• (−)Momentumcanfailtoconvergeevenforsimpleconvexoptimizations.• Nestrov’s acceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,

MomentumMethods:Nesterov’s Momentum

7

momentum preservationratio

“lookahead”gradient

Page 8: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).

• Nesterov’sacceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,

MomentumMethods:Nesterov’s Momentum

8

momentum preservationratio

Quiz:fillinthepseudocodeofNesterov’acceleratedgradient

SGDSGD+momentum

NAG

Page 9: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

AdaptiveLearningRateMethods

9

2. Learningratescheduling• Learningrateiscriticalforminimizingloss!

*source:http://cs231n.github.io/neural-networks-3/

Next,learningratescheduling

Toohigh→Mayignorethenarrowvalley,candivergeToolow →Mayfallintothelocalminima,slowconverge

Page 10: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

2. Learningratescheduling:decaymethods• Anaivechoiceistheconstant learningrate• Commonlearningrateschedulesincludetime-based/step/exponentialdecay

• “Stepdecay”decreaseslearningratebyafactoreveryfewepochs• Typically,itisset= 0.01 anddropsbyhalfever= 10 epoch

Time-based Exponential Step(mostpopularinpractice)

AdaptiveLearningRateMethods:Learningrateannealing

10*source:https://towardsdatascience.com/

stepdecay exponentialdecay accuracy

Page 11: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

2. Learningratescheduling:cyclingmethod• [Smith’2015]proposedcycling learningrate(triangular)• Why“cycling”learningrate?

• Sometimes,increasinglearningrateishelpfultoescapethesaddlepoints

• Itcanbecombinedwithexponentialdecayorperiodicdecay

AdaptiveLearningRateMethods:Learningrateannealing

11*source:https://github.com/bckenstler/CLR

cycling(triangular)decay

Page 12: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]usecosinecycling andrestart themaximumateachcycle• Why“cosine”?

• Itdecaysslowlyatthehalfofcycleanddropquicklyattherest

• (+) canclimbdownandupthelosssurface,thuscantraverseseverallocalminima• (+) sameasrestartingatgoodpointswithaninitiallearningrate

AdaptiveLearningRateMethods:Learningrateannealing

12*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017

Page 13: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]alsoproposedwarmrestart incyclinglearningrate

• (+) Ithelptoescapesaddlepointssinceitismorelikelytostuckinearlyiteration

AdaptiveLearningRateMethods:Learningrateannealing

13*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017

Next,adaptivelearningrate

:stepdecay :cyclingwithnorestart :cyclingwithrestart

*Warmrestart:frequentlyrestartinearlyiterations

But,thereisnoperfectlearningratescheduling!Itdependsonspecifictask.

Page 14: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

AdaptiveLearningRateMethods:AdaGrad,RMSProp

14

3. Adaptivelychanginglearningrate(AdaGrad,RMSProp)• AdaGrad [Duchi’11]downscalesalearningratebymagnitudeofpreviousgradients.

• (−) thelearningratestrictlydecreasesandbecomestoosmallforlargeiterations.

• RMSProp [Tieleman’12]usesthemovingaveragesofsquaredgradient.

• Othervariantsalsoexist,e.g.,Adadelta[Zeiler’2012]

sumofallprevioussquaredgradients

preservationratio

Page 15: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

AdaptiveLearningRateMethods

15*source:animationsfromfromAlecRadford’blog

optimizationonsaddlepoint optimizationonlocaloptimum

• Visualizationofalgorithms

• Adaptivelearning-ratemethods,i.e.,Adadelta andRMSprop aremostsuitableandprovidethebestconvergenceforthesescenarios

Next,momentum+adaptivelearningrate

Page 16: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

3. Combinationofmomentumandadaptivelearningrate• Adam (ADAptive Momentestimation)[Kingma’2015]

• Canbeseenasmomentum+RMSprop update.• Othervariantsexist,e.g.,Adamax [Kingma’14],Nadam [Dozat’16]

AdaptiveLearningRateMethods:ADAM

16

averageofsquaredgradients

momentum

*source:Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015

Page 17: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

• Inpractice, SGD+Momentum andAdam workswellinmanyapplications.

• But,schedulinglearningratesisstillcritical!(shouldbedecayappropriately)

• [Smith’2017]showsthatdecayinglearningrate=increasingbatchsize,• (+) Alargebatchsizeallowsfewerparameterupdates,leadingtoparallelism!

DecayingtheLearningRate=IncreasingtheBatchSize

17*source:Smithetal.,"Don'tDecaytheLearningRate,IncreasetheBatchSize.“,ICLR2017

Page 18: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

• SGDhavebeenusedasessentialalgorithmstodeeplearningasback-propagation.

• Momentummethodsimprovetheperformanceofgradientdescendalgorithms.• Nesterov’smomentum

• Annealinglearningratesarecriticalfortraininglossfunctions• Exponential,harmonic,cyclicdecayingmethods• Adaptivelearningratemethods(RMSProp,AdaGrad,AdaDelta,Adam,etc)

• Inpractice,SGD+momentum showssuccessfulresults,outperformingAdam!• Forexample,NLP(Huangetal.,2017)ormachinetranslation(Wuetal.,2016)

Summary

18

Page 19: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking

AlgorithmicIntelligenceLab

• [Nesterov’1983]Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k^2).1983link:http://mpawankumar.info/teaching/cdt-big-data/nesterov83.pdf

• [Duchi etal2011],“Adaptivesubgradient methodsforonlinelearningandstochasticoptimization”,JMLR2011link:http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

• [Tieleman’2012]GeoffHinton’sLecture6eofCourseraClasslink:http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

• [Zeiler’2012]Zeiler,M.D.(2012).ADADELTA:AnAdaptiveLearningRateMethodlink:https://arxiv.org/pdf/1212.5701.pdf

• [Smith’2015]Smith,LeslieN."Cyclicallearningratesfortrainingneuralnetworks.”link:https://arxiv.org/pdf/1506.01186.pdf

• [Kingma andBa.,2015]Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015link:https://arxiv.org/pdf/1412.6980.pdf

• [Dozat’2016]Dozat,T.(2016).IncorporatingNesterov MomentumintoAdam.ICLRWorkshop,link:http://cs229.stanford.edu/proj2015/054_report.pdf

• [Smithetal.,2017]Smith,SamuelL.,Pieter-JanKindermans andQuocV.Le.Don'tDecaytheLearningRate,IncreasetheBatchSize.ICLR2017.link:https://openreview.net/pdf?id=B1Yy1BxCZ

• [Loshchilov etal.,2017]Loshchilov,I.,&Hutter,F.(2017).SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017.link:https://arxiv.org/pdf/1608.03983.pdf

References

19