structured exploration for reinforcement learningai-lab/pubs/nkj-thesis...reinforcement learning...
TRANSCRIPT
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Structured Explorationfor Reinforcement Learning
Nicholas K. Jong
Department of Computer SciencesThe University of Texas at Austin
December 1, 2010 / PhD Final Defense
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Outline
1 Introduction
2 Exploration and Approximation
3 Exploration and Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Outline
1 IntroductionThe Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
2 Exploration and Approximation
3 Exploration and Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
One Solution to Many Problems
Reality Many tasks mean many engineering problemsDream Opportunity for a single general learning algorithm
Potential PayoffsReduce engineering costsSolve problems beyond our current abilitiesAchieve solutions robust to uncertainty
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
One Formalism for Many Problems
EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P
Goal Find a policy π : S → A that maximizes future rewards
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
One Formalism for Many Problems
Agent
Observes state s ∈ SChooses action a ∈ AFor arbitrary S and A
Agent Environment
a
r,s
EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P
Goal Find a policy π : S → A that maximizes future rewards
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
One Formalism for Many Problems
Agent
Observes state s ∈ SChooses action a ∈ AFor arbitrary S and A
Agent Environment
a
r,s
EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P
Goal Find a policy π : S → A that maximizes future rewards
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Example: A Resource Gathering Simulation
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
B
Simulated Robot’s TaskGather each of n resourcesNavigate around danger zones
n + 2 State VariablesBoolean flag for each resource: A,B, . . .x and y coordinates
n + 4 Actionsnorth, south, east, west change x and ypickupA sets flag A if near resource A, etc.Actions cost −1 generally but up to −40 in “puddles”
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Evaluating Policies with Value Functions
The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.
Vπ(s) = Qπ (s, π(s))
Qπ(s,a) = R(s,a) + γ∑s′
P(s,a, s′)Vπ(s′)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Evaluating Policies with Value Functions
The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.
Vπ = πQπ
Qπ = R + γPVπ
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Evaluating Policies with Value Functions
The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.
Vπ = πQπ
Qπ = R + γPVπ
V Policyπ Qs sa ℝ
Q RewardR
TransitionP V
sa
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Example: An Optimal Value Function
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
BSome policy achieves maximal V ∗
Planning algorithms compute V ∗ from R,PBut RL algorithms don’t know R and P
V ∗(x , y , {C})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {C,D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Example: An Optimal Value Function
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
BSome policy achieves maximal V ∗
Planning algorithms compute V ∗ from R,PBut RL algorithms don’t know R and P
V ∗(x , y , {C})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {C,D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Standard Approach: Learn the Value Function
Temporal Difference Learning
Estimate Vπ directly from data.(Sutton and Barto, 1998)
Given each piece of data 〈s,a, r , s′〉r + γV̂π(s′) is an estimate of Vπ(s).Update V̂π(s) towards this estimate.Improve π.
Converges to the optimal policy in the limit, givenappropriate data.In practice, converges very slowly!
Most RL research focuses on ways to compute valuefunctions more efficiently from data.
Action Selection
Data
Q
π
Value Function
Estimation
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Scaling to Real-World Problems
Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems
Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Scaling to Real-World Problems
Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems
Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Scaling to Real-World Problems
Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems
Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.
Reinforcement Learning
Hierarchical Decomposition
Function Approximation
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Exploration and Exploitation
ExploitationHow to estimate Q∗ from dataFocus of most RL research
ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis
Reinforcement Learning
Model-Based Exploration
Action Selection
Data
Q
π
Value Function
Estimation
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Exploration and Exploitation
ExploitationHow to estimate Q∗ from dataFocus of most RL research
ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis
Reinforcement Learning
Model-Based Exploration
Action Selection
Data
Q
π
Value Function
Estimation
Environment
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Exploration and Exploitation
ExploitationHow to estimate Q∗ from dataFocus of most RL research
ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis
Reinforcement Learning
Model-Based Exploration
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Environment
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Thesis Contributions
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Merging Branches of RLPreviously studied inisolationDemonstration ofsynergies
Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Thesis Contributions
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured ExplorationMerging Branches of RL
Previously studied inisolationDemonstration ofsynergies
Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus
Thesis Contributions
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured ExplorationMerging Branches of RL
Previously studied inisolationDemonstration ofsynergies
Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Outline
1 Introduction
2 Exploration and ApproximationModel-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
3 Exploration and Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning
Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning
Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning
Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes
R
-1
-1
-1
P
0.5
0.5
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes
R
-1
-1
rmax
P
0.0
0.0
1.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes
R
-1
-1
-1
rmax
P
0.0
0.0
1.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes
R
-1
-1
-1
-1
rmax
P
0.0
0.0
1.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes
R
-1-1
-1
-1
-1
-1
P
0.8
0.2
0.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes
R
-1-1
-1
-1
-1
-1
P
0.8
0.2
0.0
Q
R
P V
sa
sℝ
Unknown?
Vmax
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Challenges for Model-Based Reinforcement Learning
Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data
Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice
Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Challenges for Model-Based Reinforcement Learning
Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data
Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice
Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Challenges for Model-Based Reinforcement Learning
Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data
Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice
Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Function Approximation
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP
Q RewardR
TransitionP V
sa
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP
Q R
P Vsa
s
ℝΦ x
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP
Q R
P Vsa
s
ℝΦ x
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP
Q R
P Vsa
s
ℝΦ x
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model Approximation
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
i
s
i1
i2
3
P
sa s
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
s0.44
0.38
0.18i1
i2
i3
P
Ψsa si s
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
0.18
1.0
1.0
1.0
i1
i2
i3
s
0.38
0.44
P
Ψ Datasa si s
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
s
0.18
0.440.38
P
Ψ Datasa si s
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,PP
R
Ψ DP
DRΨ
sa
sa
si
si
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,PP
R
Unknown?
Ψ DP
Unknown?
DRΨ
Vmax
Pterm
sa
sa
sa
sa
sa
sa
si
si
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,PP
R
Unknown?
Ψ DP Φ
Unknown?
DRΨ
Vmax
Pterm
sa
sa
sa
sa
sa
sa
si
si s
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
An Instance of Fitted R-MAX
Model Averager
ψ(sa, si) ∝ K (s, si)δ(a, sa)
K (s, s′) = exp(
d(s,s′)2
b2
)“radial basis data”
R-MAX Exploration
sa known if sufficient weight:∑
i | ai=a K (s, si) ≥ m
Value Averager
Interpolation over a uniform grid
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
An Instance of Fitted R-MAX
Model Averager
ψ(sa, si) ∝ K (s, si)δ(a, sa)
K (s, s′) = exp(
d(s,s′)2
b2
)“radial basis data”
R-MAX Exploration
sa known if sufficient weight:∑
i | ai=a K (s, si) ≥ m
Value Averager
Interpolation over a uniform grid
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
An Instance of Fitted R-MAX
Model Averager
ψ(sa, si) ∝ K (s, si)δ(a, sa)
K (s, s′) = exp(
d(s,s′)2
b2
)“radial basis data”
R-MAX Exploration
sa known if sufficient weight:∑
i | ai=a K (s, si) ≥ m
Value Averager
Interpolation over a uniform grid
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Benchmark Performance
-500
-400
-300
-200
-100
0
0 200 400 600 800 1000
Rew
ard
per
epis
ode
Episodes
Fitted R-MAXXAI
Least Squares Policy IterationR-MAX
Neural Fitted Q Iteration
For n = 1 resource, almostequivalent to benchmarkdomain “Puddleworld”Can compare againstperformance data fromNIPS RL BenchmarkingWorkshop (2005)State-of-the-art algorithmsimplemented and tuned byother researchers
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 250 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-8
-7
-6
-5
-4
-3
-2
-1
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 500 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-8
-7
-6
-5
-4
-3
-2
-1
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 750 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-25
-20
-15
-10
-5
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 1000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-35
-30
-25
-20
-15
-10
-5
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 1500 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-70
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 2000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-70
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 3000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 4000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 5000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Generalization and Exploration
Inductive BiasModel-Free Similar states have similar values
Model-Based Similar states have similar dynamics
Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration
Value GeneralizationHow good is my policy π?Less generalization leads to more computation
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Generalization and Exploration
Inductive BiasModel-Free Similar states have similar values
Model-Based Similar states have similar dynamics
Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration
Value GeneralizationHow good is my policy π?Less generalization leads to more computation
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm
Generalization and Exploration
Inductive BiasModel-Free Similar states have similar values
Model-Based Similar states have similar dynamics
Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration
Value GeneralizationHow good is my policy π?Less generalization leads to more computation
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Outline
1 Introduction
2 Exploration and Approximation
3 Exploration and HierarchyHierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Appeal of Hierarchy
Realistic ProblemsMany states and many actions...But also deep structureMultiple levels of abstractionLocal dependencies
Structured Learning and PlanningDon’t write all programs in assembly!Reason above the level of primitive actions.
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchy in Reinforcement Learning
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchy in Reinforcement Learning
Options (Sutton, Precup, and Singh, 1999)Partial policies as macrosAn option o comprises:
An initiation set Io ⊂ SAn option policy πo : S → AA termination function T o : S → [0,1]
MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o comprises:
A set of subtasks Ao
A goal reward function Go : T o → RA set of terminal states T o
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchy in Reinforcement Learning
Options (Sutton, Precup, and Singh, 1999)Partial policies as macrosAn option o comprises:
An initiation set Io ⊂ SAn option policy πo : S → AA termination function T o : S → [0,1]
MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o comprises:
A set of subtasks Ao
A goal reward function Go : T o → RA set of terminal states T o
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
...sa
s
ℝTurn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝTurn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝ
V Reach
Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) +
V Reach(s) + CDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) +
QReachTurn (s) + CDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) = V Turn(s) + CReachTurn (s) + CDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition(Jong and Stone, 2008)
High-Level Successors are Low-Level Terminals
PDriveReach = ΩReach
Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′
Ωo(·, s′) is a value function!
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝ
V ReachTurn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition(Jong and Stone, 2008)
High-Level Successors are Low-Level Terminals
PDriveReach = ΩReach
Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′
Ωo(·, s′) is a value function!
QDrive
switch(a)
QDriveReach
Ω Reach
R ReachDrive
V Drive
...sa
s
s
ℝ
V ReachTurn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition(Jong and Stone, 2008)
High-Level Successors are Low-Level Terminals
PDriveReach = ΩReach
Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′
Ωo(·, s′) is a value function!
Ω
πT
P Ωs
sa
s
s
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The R-MAXQ Algorithm(Jong and Stone, 2008)
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa
Composite Tasks
Concatenate subtask V a and Ωa into Ro and Po
Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The R-MAXQ Algorithm(Jong and Stone, 2008)
Policy Evaluation
Data
Va,Ωa
Model Estimation
Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa
Composite Tasks
Concatenate subtask V a and Ωa into Ro and Po
Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The R-MAXQ Algorithm(Jong and Stone, 2008)
Planning
Policy Evaluation
Va,Ωa
π
Vo,Ωo
Model Assembly
R,P
Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa
Composite Tasks
Concatenate subtask V a and Ωa into Ro and Po
Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Fitted R-maxq Algorithm
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Fitted R-maxq Algorithm(Jong and Stone, 2009)
oVo Ω o
π
PoRo
Va Ω a
a
PDRDData
To
Model Approximation
Optimistic Exploration
Ψa
Ua
Value Approximation oΦ
Prediction
Planning Go
Primitive Action ModelsDefine V a = UV max+
(I − U)ΨaRDDefine Ωa = (I−U)ΨaPD
Prediction
Solve V o = πo(Ro + γPo(I − T o)V o)
Solve Ωo = πo(PoT o + γPo(I − T o)Ωo)
Planning
Optimize Ṽ o = T oGo+(I − T o)πo(Ro + γPoṼ o)
Value Approximation
Define Ro[sa] = V a[s]
Define Po[sa, x ] = Ωa[s, s′]Φo[s′, x ]
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Software Architecture
Algorithm1 Execute πRoot hierarchically2 Update data: RD and PD3 Propagate changes to πRoot
4 Repeat
Averagers
Φ Interpolation over uniform gridΨ Radial basis functions
oVo Ω o
π
PoRo
Va Ω a
a
PDRDData
To
Model Approximation
Optimistic Exploration
Ψa
Ua
Value Approximation oΦ
Prediction
Planning Go
Optimizations
Memoization and DP
Prioritized sweeping
Sparse representations
Cover trees for onlinenearest neighbors
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Example: A Resource Gathering Simulation
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
B
Simulated Robot’s TaskGather each of n resourcesNavigate around danger zones
n + 2 State VariablesBoolean flag for each resource: A,B, . . .x and y coordinates
n + 4 Actionsnorth, south, east, west change x and ypickupA sets flag A if near resource A, etc.Actions cost −1 generally but up to −40 in “puddles”
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Utility of Hierarchy and Model Generalization
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 5 10 15 20 25 30 35 40
Rew
ard
per
epis
ode
Episode
Fitted R-MAXR-MAX
-30000
-25000
-20000
-15000
-10000
-5000
0
0 20 40 60 80 100
Cum
ulat
ive
rew
ard
Episode
Fitted R-MAXR-MAX
Model generalization allows Fitted R-MAX to outperform R-MAX.
Hierarchical decomposition allows R-MAXQ to outperform R-MAX.
These two ideas synergize in Fitted R-MAXQ!
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Utility of Hierarchy and Model Generalization
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 5 10 15 20 25 30 35 40
Rew
ard
per
epis
ode
Episode
R-MAXQR-MAX
-30000
-25000
-20000
-15000
-10000
-5000
0
0 20 40 60 80 100
Cum
ulat
ive
rew
ard
Episode
R-MAXQR-MAX
Model generalization allows Fitted R-MAX to outperform R-MAX.
Hierarchical decomposition allows R-MAXQ to outperform R-MAX.
These two ideas synergize in Fitted R-MAXQ!
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
The Utility of Hierarchy and Model Generalization
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 5 10 15 20 25 30 35 40
Rew
ard
per
epis
ode
Episode
Fitted R-MAXQFitted R-MAX
R-MAXQR-MAX
-30000
-25000
-20000
-15000
-10000
-5000
0
0 20 40 60 80 100
Cum
ulat
ive
rew
ard
Episode
Fitted R-MAXQFitted R-MAX
R-MAXQR-MAX
Model generalization allows Fitted R-MAX to outperform R-MAX.
Hierarchical decomposition allows R-MAXQ to outperform R-MAX.
These two ideas synergize in Fitted R-MAXQ!
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Task Hierarchies as Domain Knowledge
Flat
...north south east west pickupA
Root
Shallow
...
south
GatherA
north
Root
east west pickupA
Deep
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
Flat hierarchy only knows modelaveragers.
Shallow hierarchy also knows that gatheringeach resource is independent.
Deep hierarchy also knows the set of resource locations (butmust still associate resource with location).
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Soft Inductive Bias in Hierarchy
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
s
BC
A
D
x
...
south
GatherA
north
Root
east west pickupA
From s, explore unknown state inpuddle or exploit known solution?
Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).
Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ
GatherD(s) < Vπ
GatherD(s).
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Soft Inductive Bias in Hierarchy
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
s
x
BC
A
D
...
south
GatherA
north
Root
east west pickupA
From s, explore unknown state inpuddle or exploit known solution?
Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).
Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ
GatherD(s) < Vπ
GatherD(s).
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Soft Inductive Bias in Hierarchy
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
s
x
BC
A
D
...
south
GatherA
north
Root
east west pickupA
From s, explore unknown state inpuddle or exploit known solution?
Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).
Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ
GatherD(s) < Vπ
GatherD(s).
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Constraint and Reformulation
Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.
Deep Hierarchypickup actions only possible beforeor after Navigate tasks.
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
BC
A
D
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Constraint and Reformulation
Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.
Deep Hierarchypickup actions only possible beforeor after Navigate tasks.
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
BC
A
D
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
Hierarchical Constraint and Reformulation
Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.
Deep Hierarchypickup actions only possible beforeor after Navigate tasks.
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
BC
A
D
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Outline
1 Introduction
2 Exploration and Approximation
3 Exploration and Hierarchy
4 ConclusionFuture WorkSummary
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Discovering Abstractions
Turn Left
Reach Highway
Drive to Campus
...
...
We can now use task hierarchies toefficiently explore continuousenvironments.Can we discover composite tasksautomatically?
What makes a good subtask?
Other research: “bottleneck states”My conjecture: “sets of relevant features”(Jong and Stone, 2005)
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Prior Distributions Over Inductive Biases
No Free LunchNo algorithm can learn or discover efficiently in all possibleworlds!
Bayesian Reinforcement LearningBegin with a prior distribution over environmentsPlan over “belief states”Update belief distribution given data
Key question What is the right prior distribution?Conjecture Distributions over task hierarchies
Goal Efficient appoximation of optimal Bayesiansolution
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Natural Knowledge Representations
Model-free methods learn a monolithic value function.Models are a natural form of domain knowledge.Models are modular: piecewise independent.
Don’t Reinvent the WheelExploit known reward functionExploit known dynamics of some actionsExploit known dynamics of some state variables
Nicholas K. Jong Structured Exploration for Reinforcement Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Connections to Other Fields of Arti