![Page 1: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/1.jpg)
FeUdal NetworksforHierarchicalReinforcementLearning
byArtem BachysnkyiComputationalNeuroscienceSeminar
UniversityofTartu3May2017
![Page 2: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/2.jpg)
Reinforcementlearning
The basic reinforcement learning model consists of:
• a set of environment and agent states S • a set of actions A of the agent• policies of transitioning from states to actions• rules that determine the scalar immediate reward of a
transition • rules that describe what the agent observes.
![Page 3: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/3.jpg)
ATARIgames
![Page 4: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/4.jpg)
Standart approach
• useanaction-repeatheuristic,whereeachactiontranslatesintoseveralconsecutiveactionsintheenvironment
• notapplicableinnon-Marcovian environmentsthatrequirememory• can’tlearnontheweakrewardsignal
![Page 5: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/5.jpg)
Feudalreinforcementlearningintuition
• levelsofhierarchywithinanagentcommunicateviaexplicitgoals• goalscanbegeneratedinatop-downfashion• goalsettingcanbedecoupledfromgoalachievement
![Page 6: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/6.jpg)
Manager-Workermodel
Manager:• setsgoalsatalowertemporalresolution
Worker:• operatesatahighertemporalresolution
• producesprimitiveactions• followsthegoalsbyanintrinsicreward
![Page 7: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/7.jpg)
Mainproposals
• aconsistent,end-to-enddifferentiablemodel• approximatetransitionpolicygradientupdatefortrainingtheManager• useofgoalsthataredirectionalratherthanabsolute• dilatedLSTMfortheManagerRNNdesign
![Page 8: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/8.jpg)
FuN modeldescription
![Page 9: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/9.jpg)
FuN modeldescription
ℎ",ℎ# – internalstates𝑈% – workersoutput𝜙 – maps𝑔% into𝑤%𝜋 – vectorofprobabilitiesoverprimitiveactions
𝑠% – latentstaterepresentation𝑔% – goalvector𝑥% – observationfromtheenvironment𝑧% – sharedintermediaterepresentation
![Page 10: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/10.jpg)
Learning
Learningsteps:1. receivesanobservationfromtheenvironment2. selectanactionfromafiniteset3. theenvironmentrespondswithanewobservationandascalar
reward4. theprocesscontinuesuntiltheterminalstateisreached
![Page 11: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/11.jpg)
LearningBadidea:
trainfeudalnetworkend-to-endusingapolicygradientalgorithmoperatingontheactionstakenbytheWorker
Goodidea:independentlytrainManagertopredictadvantageousdirectionsinstatespaceandtointrinsicallyrewardtheWorkertofollowthesedirections
![Page 12: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/12.jpg)
Theagentsgoal
Maximizethediscountedreturn
where
Theagent’sbehaviour isdefinedbyitsaction-selectionpolicyπ.FuN producesadistributionoverpossibleactions.
![Page 13: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/13.jpg)
Managersupdaterule
where
– valuefunctionestimatefromtheinternalcritic
– cosinesimilarity
– advantagefunction
![Page 14: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/14.jpg)
Workersintrinsicreward
where
𝑐 – horizon
![Page 15: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/15.jpg)
TheWorkerspolicy
Advantageauthorcritic
Advantagefunction
![Page 16: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/16.jpg)
Architecturedetails
𝑓012310%– ConvolutionalNeuralNetwork:1. 168x8filters,stride42.324x4fil- ters ofstride23.fullyconnectedlayerhas256hiddenunits*eachlayerisfollowedbyarectifiednon-linearity
𝑓"40531 – anotherfullyconnectedlayer𝑓#266 – standardLSTM𝑓"266 – dilatedLSTM
![Page 17: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/17.jpg)
FuN modeldescription
![Page 18: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/18.jpg)
DilatedLSTMStateofthenetworkwith𝑟 separategroupsofsub-states
Attime𝑡 wecanindicatewich groupofcoresisupdated
Ateachtimesteponlythecorrespondingpartofthestateisupdatedandtheoutputispooledacrossthepreviouscoutputs.ThisallowsthergroupsofcoresinsidethedLSTM topreservethememoriesforlongperiods.
*Intheexperimentsr=10.
![Page 19: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/19.jpg)
Experiments:ATARI
![Page 20: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/20.jpg)
Experiments:Montezuma’srevenge
https://www.youtube.com/watch?v=_zbg9rs5QZY
![Page 21: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/21.jpg)
Experiments:Montezuma’srevenge
![Page 22: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/22.jpg)
Experiments:Non-matchandT-maze
![Page 23: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/23.jpg)
Experiments:Watermaze
![Page 24: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/24.jpg)
Experiments:transitionpolicygradient
![Page 25: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/25.jpg)
Experiments:Temporalresolution
![Page 26: FeUdalNetworks for Hierarchical Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050313/626f8bb24468c764ab2f7104/html5/thumbnails/26.jpg)
Experiments:DilateLSTMagentbaseline