Hierarchical Hierarchical Reinforcement LearningReinforcement Learning
Ersin BasaranErsin Basaran19/03/200519/03/2005
OutlineOutline
Reinforcement LearningReinforcement Learning RL AgentRL Agent PolicyPolicy
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning The NeedThe Need Sub-Goal DetectionSub-Goal Detection State ClustersState Clusters Border StatesBorder States Continuous State and/or Action SpacesContinuous State and/or Action Spaces OptionsOptions Macro Q-Learning with Parallel Option DiscoveryMacro Q-Learning with Parallel Option Discovery
Experimental ResultsExperimental Results
Reinforcement LearningReinforcement Learning
Agent observes the state, and takes the Agent observes the state, and takes the action according to the policyaction according to the policy
Policy is a function from the state space Policy is a function from the state space onto the action spaceonto the action space
Policy can be deterministic or non-Policy can be deterministic or non-deterministicdeterministic
State and action spaces can be discrete, State and action spaces can be discrete, continuous or hybridcontinuous or hybrid
RL AgentRL Agent
No model of the environmentNo model of the environment
Agent observes state s, takes action a and Agent observes state s, takes action a and goes into state s’ observing reward rgoes into state s’ observing reward r
Agent tries to maximize total expected Agent tries to maximize total expected reward (return) reward (return)
Finite state machine modelFinite state machine model
S S’
a, r
PolicyPolicy
In a flat RL model, policy is a map from each In a flat RL model, policy is a map from each state to a primitive actionstate to a primitive action
In the optimal policy, the action taken by the In the optimal policy, the action taken by the agent return highest return at each each stepagent return highest return at each each step
Can be kept in tabular format for small state and Can be kept in tabular format for small state and action spacesaction spaces
Function approximators can be used for large Function approximators can be used for large state or action spaces (or continuous ones)state or action spaces (or continuous ones)
The Need For Hierarchical RLThe Need For Hierarchical RL
Increase the performanceIncrease the performanceApplying RL to the problems with large action Applying RL to the problems with large action and/or state space become feasibleand/or state space become feasibleDetection of sub-goals can help the agent to Detection of sub-goals can help the agent to have the abstract actions defined over the have the abstract actions defined over the primitive actionsprimitive actionsSub-goals and abstract actions can be used in Sub-goals and abstract actions can be used in different tasks on the same domain. The different tasks on the same domain. The knowledge is transferred between tasksknowledge is transferred between tasksThe policy of the agent can be translated into a The policy of the agent can be translated into a natural languagenatural language
Sub-goal DetectionSub-goal Detection
A sub-goal can be a single state, a subset A sub-goal can be a single state, a subset of the state space, or a constraint in the of the state space, or a constraint in the state spacestate space
Reaching a sub-goal should help the Reaching a sub-goal should help the agent reaching the main goal (to get the agent reaching the main goal (to get the highest return)highest return)
Sub-goals must be discovered by the Sub-goals must be discovered by the agent autonomouslyagent autonomously
State ClustersState Clusters
The states in a cluster are strongly connected to The states in a cluster are strongly connected to each othereach otherThe number of state transitions among clusters The number of state transitions among clusters are smallare smallThe states at two ends of a state transition The states at two ends of a state transition between two different clusters are sub-goal between two different clusters are sub-goal candidatescandidatesClusters can be hierarchicalClusters can be hierarchical Different clusters can be in the same cluster at a Different clusters can be in the same cluster at a
higher levelhigher level
Border StatesBorder States
Some actions cannot be applied in some states. Some actions cannot be applied in some states. These states are defined as border statesThese states are defined as border states
Border states are assumed to have a transition Border states are assumed to have a transition sequence. We can travel through the border sequence. We can travel through the border states by taking some actionsstates by taking some actions
Each end in this transition sequence is a Each end in this transition sequence is a candidate sub-goal assuming the agent candidate sub-goal assuming the agent sufficiently explored the environmentsufficiently explored the environment
Border State DetectionBorder State Detection
For discrete action and state space For discrete action and state space F(s): set of states which can be reached from F(s): set of states which can be reached from
state state ss in one time unit in one time unit G(s): if an action in G(s) is applied at state s, G(s): if an action in G(s) is applied at state s,
no state transition occursno state transition occurs H(s): if an action in H(s) is applied at state s, H(s): if an action in H(s) is applied at state s,
the agent moves to a different statethe agent moves to a different state
Border State DetectionBorder State Detection
Detect the longest state sequence sDetect the longest state sequence s00,s,s11,s,s22,,
…,s…,sk-1k-1,s,skk which satisfies the following which satisfies the following
constraintsconstraints ssiiF(sF(si+1i+1) or s) or si+1i+1F(sF(sii) for 0) for 0i<ki<k
G(sG(sii))G(sG(si+1i+1) ) for 0<i<k-1 for 0<i<k-1
H(sH(s00) ) G(sG(s11) )
H(sH(skk) ) G(sG(sk-1k-1) )
ss0 0 and sand skk are candidate sub-goals are candidate sub-goals
Border States on Continuous State Border States on Continuous State and Action Spacesand Action Spaces
Environment is assumed to be boundedEnvironment is assumed to be boundedState and action vectors can include both State and action vectors can include both continuous and discrete dimensionscontinuous and discrete dimensionsThe derivative of state vector with respect The derivative of state vector with respect to the action vector can be usedto the action vector can be usedThe border state regions must have small The border state regions must have small derivatives for some action vectorsderivatives for some action vectorsThe large change in these derivatives is The large change in these derivatives is the indication of border state regionsthe indication of border state regions
OptionsOptions
An option is a policyAn option is a policy
It can be local (defined on a subset of It can be local (defined on a subset of state space) or can be globalstate space) or can be global
The option policy can use primitive actions The option policy can use primitive actions or other optionsor other options
It is hierarchicalIt is hierarchical
Used to reach sub-goalsUsed to reach sub-goals
Macro Q-Learning with Parallel Macro Q-Learning with Parallel Option DiscoveryOption Discovery
Agent starts with no sub-goal and optionAgent starts with no sub-goal and optionIt detects the sub-goals and learns the option policies It detects the sub-goals and learns the option policies and the main policy simultaneouslyand the main policy simultaneouslyOptions are formed and removed from the model Options are formed and removed from the model according the sub-goal detection algorithmaccording the sub-goal detection algorithmWhen a possible sub-goal is detected, a new option is When a possible sub-goal is detected, a new option is added to the model to have the policy to reach this sub-added to the model to have the policy to reach this sub-goalgoalAll options policies are updated in parallelAll options policies are updated in parallelThe agent generates an internal reward if a sub-goal is The agent generates an internal reward if a sub-goal is reachedreached
Macro Q-Learning with Parallel Macro Q-Learning with Parallel Option DiscoveryOption Discovery
An Option is defined by the following: An Option is defined by the following: O = (O = (oo, , oo, I, Ioo, Q, Qoo, r, roo))
where Qwhere Qoo is Q values for the option and r is Q values for the option and roo
is the internal reward signal associated is the internal reward signal associated with the optionwith the option
Intra-option learning method is usedIntra-option learning method is used
ExperimentsExperiments
Flat RLFlat RL Hierarchical RLHierarchical RL
Options in HRLOptions in HRL
Questions and Questions and Suggestions!!!Suggestions!!!