stochastic gradient descentalinlab.kaist.ac.kr/resource/lec2_sgd.pdf · 2020-04-25 · algorithmic...

AlgorithmicIntelligenceLab


EE807:RecentAdvancesinDeepLearningLecture2

Slidemadeby

Insu HanandJongheon JeongKAISTEE

StochasticGradientDescent


1. Introduction• Empiricalriskminimization(ERM)

2. GradientDescendMethods• Gradientdescent(GD)• Stochasticgradientdescent(SGD)

3. MomentumandAdaptiveLearningRateMethods• Momentummethods• Learningratescheduling• Adaptivelearningratemethods(AdaGrad,RmsProp,Adam)

4. ChangingBatchSize• Increasingthebatchsizewithoutlearningratedecaying

5. Summary

TableofContents

2


• Giventrainingset

• Predictionfunctionparameterizedby

• Empiricalriskminimization: Findaparamaterthatminimizesthelossfunction

whereisalossfunctione.g.,MSE,crossentropy,

• Forexample,neuralnetworkhas

EmpiricalRiskMinimization(ERM)

3

Next,howtosolveERM?


• Gradientdescent(GD) updatesparametersiterativelybytakinggradient.

• (+) Convergestoglobal(local)minimumforconvex(non-convex)problem.• (−)Notefficientwithrespecttocomputationtime andmemoryspace forhuge𝑛.• Forexample,ImageNetdatasethas𝑛 =1,281,167 images fortraining.

GradientDescent(GD)

4

parameters

learningrate

lossfunction

Next,efficientGD

1.2Mof256x256RGBimages≈ 236GBmemory


• Stochasticgradientdescent(SGD) usesamples toapproximateGD

• Inpractice,minibatchsizescanbe32/64/128.

• Mainpracticalchallenges andcurrentsolutions:1. SGDcanbetoonoisyandmightbeunstable2. hardtofindagoodlearningrate

StochasticGradientDescent(SGD)

5*source:https://lovesnowbest.site/2018/02/16/Improving-Deep-Neural-Networks-Assignment-2/

Next,momentum

momentumadaptivelearningrate


1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).

• Equivalenttotheweighted-sumofthefraction𝜇 ofpreviousupdate.

• (+) Momentumreducestheoscillationandacceleratestheconvergence.

MomentumMethods

6

momentum preservationratio

SGD

frictiontoverticalfluctuation

accelerationtoleftSGD+momentum



• (−)Momentumcanfailtoconvergeevenforsimpleconvexoptimizations.• Nestrov’s acceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,

MomentumMethods:Nesterov’s Momentum

7


“lookahead”gradient



• Nesterov’sacceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,

MomentumMethods:Nesterov’s Momentum

8


Quiz:fillinthepseudocodeofNesterov’acceleratedgradient

SGDSGD+momentum

NAG


AdaptiveLearningRateMethods

9

2. Learningratescheduling• Learningrateiscriticalforminimizingloss!

*source:http://cs231n.github.io/neural-networks-3/

Next,learningratescheduling

Toohigh→Mayignorethenarrowvalley,candivergeToolow →Mayfallintothelocalminima,slowconverge


2. Learningratescheduling:decaymethods• Anaivechoiceistheconstant learningrate• Commonlearningrateschedulesincludetime-based/step/exponentialdecay

• “Stepdecay”decreaseslearningratebyafactoreveryfewepochs• Typically,itisset= 0.01 anddropsbyhalfever= 10 epoch

Time-based Exponential Step(mostpopularinpractice)

AdaptiveLearningRateMethods:Learningrateannealing

10*source:https://towardsdatascience.com/

stepdecay exponentialdecay accuracy


2. Learningratescheduling:cyclingmethod• [Smith’2015]proposedcycling learningrate(triangular)• Why“cycling”learningrate?

• Sometimes,increasinglearningrateishelpfultoescapethesaddlepoints

• Itcanbecombinedwithexponentialdecayorperiodicdecay


11*source:https://github.com/bckenstler/CLR

cycling(triangular)decay


2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]usecosinecycling andrestart themaximumateachcycle• Why“cosine”?

• Itdecaysslowlyatthehalfofcycleanddropquicklyattherest

• (+) canclimbdownandupthelosssurface,thuscantraverseseverallocalminima• (+) sameasrestartingatgoodpointswithaninitiallearningrate


12*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017


2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]alsoproposedwarmrestart incyclinglearningrate

• (+) Ithelptoescapesaddlepointssinceitismorelikelytostuckinearlyiteration


13*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017

Next,adaptivelearningrate

:stepdecay :cyclingwithnorestart :cyclingwithrestart

*Warmrestart:frequentlyrestartinearlyiterations

But,thereisnoperfectlearningratescheduling!Itdependsonspecifictask.


AdaptiveLearningRateMethods:AdaGrad,RMSProp

14

3. Adaptivelychanginglearningrate(AdaGrad,RMSProp)• AdaGrad [Duchi’11]downscalesalearningratebymagnitudeofpreviousgradients.

• (−) thelearningratestrictlydecreasesandbecomestoosmallforlargeiterations.

• RMSProp [Tieleman’12]usesthemovingaveragesofsquaredgradient.

• Othervariantsalsoexist,e.g.,Adadelta[Zeiler’2012]

sumofallprevioussquaredgradients

preservationratio


AdaptiveLearningRateMethods

15*source:animationsfromfromAlecRadford’blog

optimizationonsaddlepoint optimizationonlocaloptimum

• Visualizationofalgorithms

• Adaptivelearning-ratemethods,i.e.,Adadelta andRMSprop aremostsuitableandprovidethebestconvergenceforthesescenarios

Next,momentum+adaptivelearningrate


3. Combinationofmomentumandadaptivelearningrate• Adam (ADAptive Momentestimation)[Kingma’2015]

• Canbeseenasmomentum+RMSprop update.• Othervariantsexist,e.g.,Adamax [Kingma’14],Nadam [Dozat’16]

AdaptiveLearningRateMethods:ADAM

16

averageofsquaredgradients

momentum

*source:Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015


• Inpractice, SGD+Momentum andAdam workswellinmanyapplications.

• But,schedulinglearningratesisstillcritical!(shouldbedecayappropriately)

• [Smith’2017]showsthatdecayinglearningrate=increasingbatchsize,• (+) Alargebatchsizeallowsfewerparameterupdates,leadingtoparallelism!

DecayingtheLearningRate=IncreasingtheBatchSize

17*source:Smithetal.,"Don'tDecaytheLearningRate,IncreasetheBatchSize.“,ICLR2017


• SGDhavebeenusedasessentialalgorithmstodeeplearningasback-propagation.

• Momentummethodsimprovetheperformanceofgradientdescendalgorithms.• Nesterov’smomentum

• Annealinglearningratesarecriticalfortraininglossfunctions• Exponential,harmonic,cyclicdecayingmethods• Adaptivelearningratemethods(RMSProp,AdaGrad,AdaDelta,Adam,etc)

• Inpractice,SGD+momentum showssuccessfulresults,outperformingAdam!• Forexample,NLP(Huangetal.,2017)ormachinetranslation(Wuetal.,2016)

Summary

18


• [Nesterov’1983]Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k^2).1983link:http://mpawankumar.info/teaching/cdt-big-data/nesterov83.pdf

• [Duchi etal2011],“Adaptivesubgradient methodsforonlinelearningandstochasticoptimization”,JMLR2011link:http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

• [Tieleman’2012]GeoffHinton’sLecture6eofCourseraClasslink:http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

• [Zeiler’2012]Zeiler,M.D.(2012).ADADELTA:AnAdaptiveLearningRateMethodlink:https://arxiv.org/pdf/1212.5701.pdf

• [Smith’2015]Smith,LeslieN."Cyclicallearningratesfortrainingneuralnetworks.”link:https://arxiv.org/pdf/1506.01186.pdf

• [Kingma andBa.,2015]Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015link:https://arxiv.org/pdf/1412.6980.pdf

• [Dozat’2016]Dozat,T.(2016).IncorporatingNesterov MomentumintoAdam.ICLRWorkshop,link:http://cs229.stanford.edu/proj2015/054_report.pdf

• [Smithetal.,2017]Smith,SamuelL.,Pieter-JanKindermans andQuocV.Le.Don'tDecaytheLearningRate,IncreasetheBatchSize.ICLR2017.link:https://openreview.net/pdf?id=B1Yy1BxCZ

• [Loshchilov etal.,2017]Loshchilov,I.,&Hutter,F.(2017).SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017.link:https://arxiv.org/pdf/1608.03983.pdf

References

19

stochastic gradient descentalinlab.kaist.ac.kr/resource/lec2_sgd.pdf · 2020-04-25 · algorithmic...

Documents