Asynchronous Methods for Deep Reinforcement Learning PaperbyVolodymyrMnih,AdriàPuigdomènechBadia,MehdiMirza,
AlexGraves,TimothyP.Lillicrap,TimHarley,DavidSilver,KorayKavukcuoglu
Presentedby:PihelSaatmann
Reinforcement learning
• State–„snapshot“oftheenvironment
• Ac'on–leadstonewstate,someNmesreward
• Reward–Nmedelayed,sparse
• Policy–rulesforchoosingacNon
So far
• ThoughtthatonlineRLalgorithmswithdeepNN-sareunstable.• Problems-correlatedandnon-staNonaryinputdata.
• Tocountertheseproblemsdatacanbestoredinexperiencereplaymemory.• Thisusesmorememory/computaNonalpower.
• DeepRLmethodsrequirespecializedhardware(GPUs)ormassivedistributedarchitectures.
Q-learning
• AteachNmestept,theagentreceivesastatestandselectsanacNonaaccordingtoitspolicyπ.Thentheagentgetsthenextstatest+1andascalarrewardrt.• Thegoalistomaximizetheexpectedreturnfromeachstatest.• QfuncNonesimatestheacNon’svalue.• EachNmetheagentdoesanacNontheQvalueisupdated.• Off-policymethod–updaNngQfndoesnotdependonpolicy.
Asynchronous RL framework
• InsteadofexperiencereplaytheyasynchronouslyexecutemulNpleagentsinparallelonmulNpleinstancesoftheenvironment.
• Parallelactor-learnershaveastabilizingeffectontraining.• RunsonasinglemachinewithastandardmulN-coreCPU.
Asynchronous RL framework II
• AsyncvariantsoffourstandardRLalgorithms:• 1-stepQ-learning• N-stepQ-learning• 1-stepSarsa• Advantageactor-criNc(A3C)
1-step Q-learning
• NNisusedtoapproximatetheQ(s,a;Θ)funcNon.• Theparameters(weights)ΘarelearnedbyiteraNvelyminimizingasequenceoflossfuncNons,wherethei-thlossfuncNonisdefinedas:
Async 1-step Q-leaning
• Eachthreadhasowncopyofenvironment.• AteachstepcomputesagradientoftheQ-learningloss.• AccumulategradientsovermulNpleNmestepsbeforeapplying.• Sharedandslowlychangingtargetnetwork.
Asynchronous 1-step Sarsa
• Sameas1-stepQ-learning,butusesadifferenttargetvalue:
Asynchronous n-step Q-learning
• PotenNallyfasterwaytopropagaterewards.• Uses‘forward-view’-selectsacNonsusingitspolicyforuptonstepsinthefuture.• Receivesuptotmaxrewardssincelastupdate.• Totalaccumulatedreturn:• ValuefnisupdatedadereverytmaxacNonsoraderterminalstate.• Foreachupdateusesthelongestpossiblen-stepreturn.
Asynchronous advantage actor-criBc
• On-policymethod-hasapolicyandesNmatedvaluefuncNon.• Uses‘forward-view’.• Receivesuptotmaxrewardssincelastupdate.• Policyandvaluefn-sareupdatedadereverytmaxacNonsoraderterminalstate.• Foreachupdateusesthelongestpossiblen-stepreturn.
Performance evaluaBon
• Fourdifferentplaforms:• Atari2600-differentgames• TORCS3D-carracingsimulator• MuJoCo-physicssimulatorforconNnuousmotorcontrol(A3Conly)• Labyrinth-findingrewardsinrandomlygenerated3Dmazes(A3Conly)
Atari 2600 games
• AllfourmethodscansuccessfullytrainNNcontrollers.• AsyncmethodsmostlyfasterthanDQN(DeepQ-Network).• Advantageactor-criNcwasthebest.
Async A3C on 57 atari games
TORCS Car Racing Simulator
• EvaluatedonlytheA3Calgorithm.• Agenthadtodrivearacecarusingonlyrawpixelsasinput.• Duringtraining,theagentwasrewardedformaintaininghighvelocityalongthecenteroftheracetrack.
hgps://youtu.be/0xo1Ldx3L5Q
MuJoCo Physics Simulator
• EvaluatedonlytheA3Calgorithm.• Rigidbodyphysicswithcontactdynamics.• ConNnuousacNons.• InallproblemsA3CfoundgoodsoluNonsinlessthan24hoursoftraining(typicallyafewhours).
hgps://youtu.be/0xo1Ldx3L5Q
Labyrinth
• Theagentwasplacedinrandommazeandhad60stocollectpoints.• Apples–1point• Portals–10points,respawnedapplesandagentinrandomlocaNons• Visualinputonly.
• Theagentlearnedaresonablygoodgeneralstrategyforexploringrandommazes.
hgps://youtu.be/nMR5mjCFZCw
Scalability
• Theframeworkscaleswellwiththenumberofparallelworkers.• Evenshowssuperlinearspeedupsforsomemethods.
Robustness and stability
• Trainedmodelsonfivegamesusing50differentlearningratesandrandominiNalizaNon.• EachgameandalgorithmcombinaNonhadarangeoflearningratesforwhichallrandominiNalizaNonsachievedgoodscores.• Stabilityindicatedbyvirtuallyno0scoresinregionswithgoodlearningrates.
To summarize
• Asynchronousmethodsforfourstandardreinforcementlearningalgorithms(1-stepQ,n-stepQ,1-stepSARSA,A3C).• Abletotrainneuralnetworkcontrollersonavarietyofdomainsinstablemanner.• Usingparallelactorlearnerstoupdateasharedmodelstabilizedthelearningprocess(alternaNivetoexperiencereplay).• InAtarigamestheadvantageactor-criNc(A3C)surpassedthecurrentstate-of-the-artinhalfthetrainingNme.• Superlinearspeedupwhenincreasingthreadcountfor1-stepmethods.