stochastic gradient descent and its tuning

StochasticGradientDescentalgorithm&itstuning

Mohamed,Qadri

SUMMARYThispapertalksaboutoptimizationalgorithmsusedforbigdataapplications.Westartwithexplainingthegradientdescentalgorithmsanditslimitations.Laterwe delve into the stochastic gradient descentalgorithmsandexploremethods to improve it itbyadjustinglearningrates.GRADIENTDESCENT

Gradient descent is a first orderoptimizationalgorithm.Tofindalocalminimumofafunction using gradient descent, one takes stepsproportionaltothenegativeofthegradient(oroftheapproximategradient)ofthefunctionatthecurrentpoint. If instead one takes steps proportional tothepositiveof thegradient,oneapproachesa localmaximum of that function; the procedure is thenknownasgradientascent.

Gradientdescentisalsoknownassteepestdescent,orthemethodofsteepestdescent.Gradientdescentshouldnotbeconfusedwiththemethodofsteepestdescentforapproximatingintegrals.

Using the Gradient Decent (GD) optimizationalgorithm, the weights are updated incrementallyaftereachepoch(=passoverthetrainingdataset).The cost function J(⋅), the sum of squared errors(SSE),canbewrittenas:

Themagnitudeanddirectionoftheweightupdateis

computedbytakingastepintheoppositedirectionofthecostgradient

whereη is the learning rate. Theweights are thenupdated after each epoch via the following updaterule:

whereΔwisavectorthatcontainstheweightupdatesofeachweightcoefficientw,whicharecomputedasfollows:

Essentially,wecanpictureGDoptimizationasahiker(theweightcoefficient)whowantstoclimbdownamountain (cost function) into a valley (costminimum), and each step is determined by thesteepnessoftheslope(gradient)andtheleglengthofthehiker(learningrate).Consideringacostfunctionwithonlyasingleweightcoefficient,wecanillustratethisconceptasfollows:

GRADIENTDESCENTVARIANTS

Thereare threevariantsofgradientdescent,whichdiffer in how much data we use to compute thegradientoftheobjectivefunction.Dependingontheamount of data, wemake a trade-off between theaccuracy of the parameter update and the time ittakestoperformanupdate.

BatchGradientDescent

Vanillagradientdescent,akabatchgradientdescent,computesthegradientofthecostfunctionw.r.t. totheparametersθfortheentiretrainingdataset:

Asweneedtocalculatethegradientsforthewholedataset to perform justone update, batch gradientdescent can be very slow and is intractable fordatasets that don't fit in memory. Batch gradientdescent also doesn't allow us to update ourmodelonline,i.e.withnewexampleson-the-fly.

Incode,batchgradientdescentlookssomethinglikethis:

for i in range(nb_epochs): params_grad = evaluate_gradient(loss_function, data, params) params = params - learning_rate * params_grad

For a pre-defined number of epochs, we firstcompute the gradient vectorweights_grad ofthe loss function for the whole dataset w.r.t. ourparameter vector params. We then update ourparametersinthedirectionofthegradientswiththelearning ratedetermininghowbigofanupdateweperform. Batch gradient descent is guaranteed toconverge to the global minimum for convex errorsurfaces and to a local minimum for non-convexsurfaces.

StochasticGradientDescent

Stochastic gradient descent (SGD) in contrastperforms a parameter update for each trainingexamplex(i)x(i)andlabely(i)y(i):

θ=θ−η⋅∇θJ(θ;x(i);y(i))

Batch gradient descent performs redundantcomputations for large datasets, as it recomputesgradientsforsimilarexamplesbeforeeachparameterupdate. SGD does away with this redundancy by

performing one update at a time. It is thereforeusually much faster and can also be used to learnonline. SGDperformsfrequentupdateswithahighvariancethatcausetheobjectivefunctiontofluctuateheavilyasinImage.

While batch gradient descent converges to theminimumofthebasintheparametersareplacedin,SGD'sfluctuation,ontheonehand,enablesittojumptonewandpotentially better localminima.On theotherhand,thisultimatelycomplicatesconvergencetotheexactminimum,asSGDwillkeepovershooting.However, it has been shown that when we slowlydecrease the learning rate, SGD shows the sameconvergence behavior as batch gradient descent,almost certainly converging to a local or the globalminimum for non-convex and convex optimizationrespectively.

Itscodefragmentsimplyaddsaloopoverthetrainingexamples and evaluates the gradient w.r.t. eachexample.

for i in range(nb_epochs): np.random.shuffle(data) for example in data: params_grad = evaluate_gradient(loss_function, example, params) params = params - learning_rate * params_grad Inbothgradientdescent(GD)andstochasticgradientdescent(SGD),youupdateasetofparametersinaniterativemannertominimizeanerrorfunction. WhileinGD,youhavetorunthroughALLthesamplesin your training set to do a single update for aparameter in a particular iteration, in SGD, on theotherhand,youuseONLYONEtrainingsamplefromyourtrainingsettodotheupdateforaparameterinaparticulariteration. Thus,ifthenumberoftrainingsamplesarelarge,infactverylarge,thenusinggradientdescentmaytaketoo long because in every iteration when you areupdating the values of the parameters, you arerunning through the complete training set. On the

otherhand,usingSGDwillbefasterbecauseyouuseonlyonetrainingsampleanditstartsimprovingitselfrightawayfromthefirstsample. SGD often convergesmuch faster compared to GDbuttheerrorfunctionisnotaswellminimizedasinthe case of GD. Often in most cases, the closeapproximationthatyougetinSGDfortheparametervalues are enough because they reach the optimalvaluesandkeeposcillatingthere.ThereareseveraldifferentflavorsofSGD,whichcanbeallseenthroughoutliterature.Let'stakealookatthethreemostcommonvariants: A)• randomlyshufflesamplesinthetrainingset• for one or more epochs, or until approx. cost

minimumisreached• fortrainingsamplei• computegradientsandperformweightupdatesB)• for one or more epochs, or until approx. cost

minimumisreached• randomlyshufflesamplesinthetrainingset• fortrainingsamplei• computegradientsandperformweightupdatesC)• foriterationst,oruntilapprox.costminimumis

reached:• drawrandomsamplefromthetrainingset• computegradientsandperformweightupdates

In scenarioA ,we shuffle the training set only onetime in the beginning; whereas in scenario B, weshuffle the trainingsetaftereachepoch topreventrepeating update cycles. In both scenario A andscenarioB,eachtrainingsampleisonlyusedonceperepoch to update the model weights.InscenarioC,wedrawthetrainingsamplesrandomlywithreplacementfromthetrainingset.Ifthenumberof iterations t is equal to the number of trainingsamples,we learn themodel based on abootstrapsampleofthetrainingset. IntheGradientDescentmethod,onecomputesthedirection that decreases the objective function the

most in the case of minimization problems. Butsometimesthiscantbequitecostly.InmostMachineLearningforexample,theobjectivefunctionisoftenthe cumulative sum of the error over the trainingexamples. But the sizeof the training examples setmightbeverylargeandhencecomputingtheactualgradientwouldbecomputationallyexpensive. In Stochastic Gradient (Descent) method, wecompute an estimate or approximation to thisdirection.Themostsimplewayistojustlookatonetrainingexample(orsubsetoftrainingexamples)andcompute the direction to move only on thisapproximation.ItiscalledasStochasticbecausetheapproximatedirectionthatiscomputedateverystepcanbethoughtofarandomvariableofastochasticprocess. This is mainly used in showing theconvergenceofthisalgorithm.

Recent theoretical results, however, show that theruntime to get some desired optimization accuracydoesnotincreaseasthetrainingsetsizeincreases.Stochastic Gradient Descent is sensitive to featurescaling, so it is highly recommended to scale yourdata.Forexample,scaleeachattributeontheinputvectorXto[0,1]or[-1,+1],orstandardizeittohavemean0andvariance1.Note that the same scalingmust be applied to the test vector to obtainmeaningfulresults.

Empirically, we found that SGD converges afterobserving approx. 10^6 training samples. Thus, areasonable first guess for the number of iterationsisn_iter=np.ceil(10**6/n),wherenisthesizeofthetrainingset.

IfyouapplySGDtofeaturesextractedusingPCAwefoundthatitisoftenwisetoscalethefeaturevaluesbysomeconstantcsuchthattheaverageL2normofthetrainingdataequalsone.

WefoundthatAveragedSGDworksbestwithalargernumberoffeaturesandahighereta0Here,theterm"stochastic"comesfromthefactthatthe gradient based on a single training sample is a"stochastic approximation" of the "true" cost

gradient. Due to its stochastic nature, the pathtowardstheglobalcostminimumisnot"direct"asinGD,butmaygo"zig-zag"ifwearevisualizingthecostsurface ina2D space.However, it hasbeen shownthatSGDalmostsurelyconvergestotheglobalcostminimum if the cost function is convex (orpseudo-convex).TheremightbemanyreasonbutonereasonastowhySG is preferred in Machine Learning is because ithelps the algorithm to skip some local minima.Thoughthisisnotatheoreticallysoundreasoninmyopinion,theoptimalpointsthatarecomputedusingSGareempiricallybetterthantheGDmethodoften.SGD is just one type of online learningalgorithm. There are many other online learningalgorithms thatmay not depend on gradients (e.g.perceptronalgorithm,bayesianinference,etc).

Mini-BatchGradientDescent

Mini-batchgradientdescentfinallytakesthebestofbothworldsandperformsanupdateforeverymini-batchofntrainingexamples:

θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))

Thisway,ita)reducesthevarianceoftheparameterupdates,whichcanleadtomorestableconvergence;and b) can make use of highly optimized matrixoptimizations common to state-of-the-art deeplearning libraries thatmakecomputing thegradientw.r.t.amini-batchveryefficient.Commonmini-batchsizes range between 50 and 256, but can vary fordifferentapplications.Mini-batchgradientdescentistypically the algorithm of choice when training aneuralnetworkandthetermSGDusuallyisemployedalso when mini-batches are used. Note: InmodificationsofSGDintherestofthispost,weleaveout the parameters x(i:i+n);y(i:i+n)x(i:i+n);y(i:i+n) forsimplicity.

Challenges

Vanillamini-batch gradient descent, however, doesnot guarantee good convergence, but offers a fewchallengesthatneedtobeaddressed:

Choosing a proper learning rate can be difficult. Alearningratethatistoosmallleadstopainfullyslowconvergence,while a learning rate that is too largecanhinderconvergenceandcausethe lossfunctiontofluctuatearoundtheminimumoreventodiverge.

Learningrateschedulestrytoadjustthelearningrateduring training by e.g. annealing, i.e. reducing thelearningrateaccordingtoapre-definedscheduleorwhen the change inobjectivebetweenepochs fallsbelowa threshold.Theseschedulesand thresholds,however,havetobedefinedinadvanceandarethusunabletoadapttoadataset'scharacteristics.

Additionally, the same learning rate applies to allparameter updates. If our data is sparse and ourfeatures have very different frequencies, wemightnotwanttoupdateallof themtothesameextent,but perform a larger update for rarely occurringfeatures.

Another key challenge of minimizing highly non-convexerrorfunctionscommonforneuralnetworksis avoiding getting trapped in their numeroussuboptimallocalminima.Somescientistsarguethatthedifficultyarisesinfactnotfromlocalminimabutfromsaddlepoints,i.e.pointswhereonedimensionslopes up and another slopes down. These saddlepoints are usually surrounded by a plateau of thesameerror,whichmakesitnotoriouslyhardforSGDto escape, as the gradient is close to zero in alldimensions.

Therearesomealgorithmsthatarewidelyusedbythedeeplearningcommunitytodealwiththeaforementionedchallenges.Belowaresomeofthem.

Momentum

SGDhastroublenavigatingravines, i.e.areaswherethe surface curves much more steeply in onedimension than in another, which are commonaround local optima. In these scenarios, SGDoscillatesacross the slopesof the ravinewhileonlymakinghesitantprogressalongthebottomtowardsthelocaloptimumasinImage2.

2:SGDwithoutmomentum

3:SGDwithmomentum

MomentumisamethodthathelpsaccelerateSGDinthe relevant direction and dampens oscillations ascan be seen in Image 3. It does this by adding afractionγoftheupdatevectorofthepasttimesteptothecurrentupdatevector:

Essentially,whenusingmomentum,wepush a balldown a hill. The ball accumulatesmomentum as itrollsdownhill,becomingfasterandfasterontheway(until it reaches its terminal velocity if there is airresistance, i.e.γ<1).Thesamethinghappenstoourparameterupdates:Themomentumterm increasesfor dimensions whose gradients point in the samedirectionsandreducesupdatesfordimensionswhosegradients change directions. As a result, we gainfasterconvergenceandreducedoscillation.

NesterovAcceleratedGradient

However,aballthatrollsdownahill,blindlyfollowingtheslope,ishighlyunsatisfactory.We'dliketohaveasmarter ball, a ball that has a notionofwhere it isgoingso that itknows toslowdownbefore thehillslopesupagain.

Nesterovacceleratedgradient(NAG)isawaytogiveour momentum term this kind of prescience. Weknowthatwewilluseourmomentumtermγvt−1tomovetheparametersθ.Computingθ−γvt−1thusgivesus an approximation of the next position of theparameters (the gradient is missing for the fullupdate), a rough idea where our parameters aregoing to be.We cannoweffectively look aheadbycalculating the gradient not w.r.t. to our currentparameters θ but w.r.t. the approximate futurepositionofourparameters:

Again,we set themomentum term γ to a value ofaround 0.9. While Momentum first computes thecurrent gradient (small blue vector in Image4) and

thentakesabigjumpinthedirectionoftheupdatedaccumulated gradient (big blue vector), NAG firstmakes a big jump in the direction of the previousaccumulatedgradient(brownvector),measuresthegradientandthenmakesacorrection(greenvector).Thisanticipatoryupdatepreventsusfromgoingtoofast and results in increased responsiveness, whichhassignificantlyincreasedtheperformanceofRNNsonanumberoftasks.

Image4

Adagrad

Adagrad is an algorithm for gradient-basedoptimizationthatdoesjustthis:Itadaptsthelearningratetotheparameters,performinglargerupdatesforinfrequent and smaller updates for frequentparameters. For this reason, it is well-suited fordealing with sparse data. Adagrad has greatlyimproved the robustness of SGD and used it fortraining large-scale neural nets at Google, which --among other things -- learned to recognize cats inYouTubevideos.

Previously, we performed an update for allparametersθatonceaseveryparameterθiusedthesame learning rate η. As Adagrad uses a differentlearning rate for every parameter θi at every timestep t, we first show Adagrad's per-parameterupdate, which we then vectorize. For brevity, weset gt,i to be the gradient of the objective functionw.r.t.totheparameterθiattimestept:

OneofAdagrad'smainbenefits is that iteliminatesthe need tomanually tune the learning rate.Mostimplementationsuseadefaultvalueof0.01andleaveitatthat.

Adagrad'smainweakness is itsaccumulationof thesquared gradients in the denominator: Since everyadded term ispositive, theaccumulatedsumkeepsgrowing during training. This in turn causes the

learning rate to shrink and eventually becomeinfinitesimallysmall,atwhichpointthealgorithmisnolongerabletoacquireadditionalknowledge.Thefollowingalgorithmsaimtoresolvethisflaw.

Adadelta

Adadelta is an extension of Adagrad that seeks toreduce its aggressive, monotonically decreasinglearning rate. Instead of accumulating all pastsquaredgradients,Adadeltarestrictsthewindowofaccumulatedpastgradientstosomefixedsizew.

Instead of inefficiently storing w previous squaredgradients,thesumofgradientsisrecursivelydefinedasadecayingaverageofallpastsquaredgradients.The running average E[g2]t at time step t thendepends(asafractionγsimilarlytotheMomentumterm)onlyonthepreviousaverageandthecurrentgradient.

RMSprop

RMSprop is an unpublished, adaptive learning ratemethodproposedbyGeoffHinton

RMSprop and Adadelta have both been developedindependentlyaroundthesametimestemmingfromthe need to resolve Adagrad's radically diminishinglearningrates.RMSpropinfactisidenticaltothefirstupdatevectorofAdadeltathatwederivedabove:

RMSprop as well divides the learning rate by anexponentiallydecayingaverageofsquaredgradients.Hinton suggests γ to be set to 0.9, while a gooddefaultvalueforthelearningrateηis0.001.

Adam

Adaptive Moment Estimation (Adam) is anothermethod that computes adaptive learning rates foreach parameter. In addition to storing anexponentially decaying average of past squaredgradientsvt likeAdadeltaandRMSprop,Adamalsokeeps an exponentially decaying average of pastgradientsmt,similartomomentum:

mt=β1mt−1+(1−β1)gt

vt=β2vt−1+(1−β2)g2t

mt and vt are estimates of the first moment (themean) and the second moment (the uncenteredvariance) of the gradients respectively, hence thenameofthemethod.Asmtandvtare initializedasvectorsof0's,theauthorsofAdamobservethattheyarebiasedtowardszero,especiallyduringtheinitialtimesteps,andespeciallywhenthedecayratesaresmall(i.e.β1andβ2arecloseto1).

ADDITIONALSTRATEGIESFOROPTIMIZINGSGD

Finally,weintroduceadditionalstrategiesthatcanbeused alongside any of the previously mentionedalgorithms to further improve the performance ofSGD.

ShufflingAndCurriculumLearning

Generally, we want to avoid providing the trainingexamplesinameaningfulordertoourmodelasthismaybiastheoptimizationalgorithm.Consequently,itisoftenagoodideatoshufflethetrainingdataaftereveryepoch.

Ontheotherhand,forsomecaseswhereweaimtosolve progressively harder problems, supplying thetrainingexamplesinameaningfulordermayactuallylead to improved performance and betterconvergence. The method for establishing thismeaningfulorderiscalledCurriculumLearning.

BatchNormalization

Tofacilitatelearning,wetypicallynormalizetheinitialvalues of our parameters by initializing them withzeromeanandunitvariance.Astrainingprogressesandweupdateparameters todifferentextents,welose this normalization, which slows down trainingand amplifies changes as the network becomesdeeper.

Batch normalization reestablishes thesenormalizationsforeverymini-batchandchangesareback-propagated through the operation as well. Bymakingnormalizationpartofthemodelarchitecture,

weareabletousehigherlearningratesandpaylessattention to the initialization parameters. Batchnormalization additionally acts as a regularizer,reducing(andsometimeseveneliminating)theneedforDropout.

EarlyStopping

Youshouldthusalwaysmonitorerroronavalidationsetduring trainingandstop (withsomepatience) ifyourvalidationerrordoesnotimproveenough.

GradientNoise

Adding noisemakes networksmore robust to poorinitializationandhelpstrainingparticularlydeepandcomplex networks. It is suspected that the addednoise gives themodelmore chances to escapeandfindnewlocalminima,whicharemorefrequentfordeepermodels.

Wehavetheninvestigatedalgorithmsthataremostcommonly used for optimizing SGD: Momentum,Nesterov accelerated gradient, Adagrad, Adadelta,RMSprop, Adam, as well as different algorithms tooptimize asynchronous SGD. Finally, we'veconsideredotherstrategiesto improveSGDsuchasshuffling and curriculum learning, batchnormalization,andearlystopping.

SGDhasbeensuccessfullyappliedtolarge-scaleandsparsemachinelearningproblemsoftenencounteredintextclassificationandnaturallanguageprocessing.Given that the data is sparse, the classifiers in thismoduleeasilyscaletoproblemswithmorethan10^5trainingexamplesandmorethan10^5features.

AdvantagesofStochasticGradientDescent

• Efficiency.• Easeofimplementation(lotsofopportunitiesfor

codetuning).

DisadvantagesofStochasticGradientDescent

• SGDrequiresanumberofhyperparameterssuchastheregularizationparameterandthenumberofiterations.

• SGDissensitivetofeaturescaling.

• The classSGDClassifier in Sklearn implements aplainstochasticgradientdescentlearningroutinewhich supports different loss functions andpenaltiesforclassification.

• The class SGDRegressor implements a plainstochastic gradient descent learning routinewhich supports different loss functions andpenalties to fit linear regressionmodels. SGDRegressor is well suited forregression problems with a large number oftraining samples (> 10.000), for other problemswerecommendRidge,Lasso,orElasticNet.

ThemajoradvantageofSGDisitsefficiency,whichisbasicallylinearinthenumberoftrainingexamples.IfX is a matrix of size (n, p) training has a costof , where k is the number of iterations(epochs) and is the average number of non-zeroattributespersample.

Application

SGD algorithms can therefore be used in scenarioswhereitisexpensivetoiterateovertheentiredatasetseveral times. This algorithm is ideal for streaminganalytics, where we can discard the data afterprocessing it.WecanusemodifiedversionsofSGDsuch as mini batch GD etc, with adjusted learningrates to get better results as seen above. Thisalgorithmcanbeusedasthecostfunctionofseveralclassificationandregressiontechniques.

SGDinarealtimescenariolearnscontinuouslyasandwhen the data comes in, somewhat like reinforcedmachinelearningusingfeedback.Anexampleofthiscanbearealtimesystemwhere,itisjudgedonthebasis of some parameters, whether a customer isgenuineornot.Basedonhistoricaldatathefirstsetofparametersarelearnt.Asanewcustomercomesinwithanewsetoffeatures,itfallsintoeitherofthetwo groups : genuine, not genuine. Based on thiscompletedclassification,thepriorparametersofthesystem are updated, using the features that thecustomerbroughtin.Overtimethesystembecomesmuchsmarterthatwhatitstartedwith.Apple’sSiriorMicrosoft’sCortanaassistantsarealsoanexampleofreal time learning,whichcorrect themselvesonthego.

ConclusionIn conclusion we saw that to manage large scaledatawecanusevariantsofgradientdescent,batchgradientdescent,stochasticgradientdescent(SGD),and mini gradient descent in the order ofimprovementoverthepreviousone.Eachofwhichdiffers in howmuch datawe use to compute thegradient of the objective function. Given theubiquity of large-scale data solutions and theavailabilityof low-commodity clusters,distributingSGDtospeeditupfurtherisanobviouschoice.SGDby itself is inherently sequential: Step-by-step,weprogressfurthertowardstheminimum.Runningitprovides good convergence but can be slowparticularlyon largedatasets. In contrast, runningSGD asynchronously is faster, but suboptimalcommunicationbetweenworkerscanleadtopoorconvergence. Additionally, we can also parallelizeSGDononemachinewithout theneedfora largecomputingcluster.

Dependingontheamountofdata,wemakeatrade-offbetweentheaccuracyoftheparameterupdateand the time it takes to perform an update. Toovercomethischallengeofparameterupdateusingthelearningratewelookedatvariousmethodsthataddressthisproblemsuchasmomentum,NesterovAcceleratedgradient,Adagrad,Adadelta,RMSPropandAdam.Insummary,RMSpropisanextensionofAdagrad that deals with its radically diminishinglearningrates.ItisidenticaltoAdadelta,exceptthatAdadeltausestheRMSofparameterupdatesinthenuminator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar,RMSprop, Adadelta, and Adam are very similaralgorithmsthatdowellinsimilarcircumstances.Itsbias-correction helps Adam slightly outperformRMSprop towards the end of optimization asgradientsbecomesparser. Insofar,Adammightbethebestoverallchoice.

Interestingly,many recent papers use vanilla SGDwithout momentum and a simple learning rateannealingschedule.Ashasbeenshown,SGDusuallyachieves to find a minimum, but it might takesignificantly longer than with some of the

optimizers, is much more reliant on a robustinitialization andannealing schedule, andmay getstuck in saddle points rather than local minima.Consequently, if you care about fast convergenceand train a deep or complex neural network, youshould choose one of the adaptive learning ratemethods.

References[1]Bottou,Léon(1998)."OnlineAlgorithmsandStochasticApproximations".[2]Bottou,Léon."Large-scalemachinelearningwithSGD."[3]Bottou,Léon."SGDtricks."NeuralNetworks:TricksoftheTrade.[4]https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent[5]http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/[6]http://scikit-learn.org/stable/modules/sgd.html[7]http://sebastianruder.com/optimizing-gradient-descent/

stochastic gradient descent and its tuning

Data & Analytics