structured exploration for reinforcement learningai-lab/pubs/nkj-thesis...reinforcement learning...

111
university-logo Introduction Exploration and Approximation Exploration and Hierarchy Conclusion Structured Exploration for Reinforcement Learning Nicholas K. Jong Department of Computer Sciences The University of Texas at Austin December 1, 2010 / PhD Final Defense Nicholas K. Jong Structured Exploration for Reinforcement Learning

Upload: others

Post on 24-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Structured Explorationfor Reinforcement Learning

    Nicholas K. Jong

    Department of Computer SciencesThe University of Texas at Austin

    December 1, 2010 / PhD Final Defense

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Outline

    1 Introduction

    2 Exploration and Approximation

    3 Exploration and Hierarchy

    4 Conclusion

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Outline

    1 IntroductionThe Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    2 Exploration and Approximation

    3 Exploration and Hierarchy

    4 Conclusion

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    One Solution to Many Problems

    Reality Many tasks mean many engineering problemsDream Opportunity for a single general learning algorithm

    Potential PayoffsReduce engineering costsSolve problems beyond our current abilitiesAchieve solutions robust to uncertainty

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    One Formalism for Many Problems

    EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P

    Goal Find a policy π : S → A that maximizes future rewards

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    One Formalism for Many Problems

    Agent

    Observes state s ∈ SChooses action a ∈ AFor arbitrary S and A

    Agent Environment

    a

    r,s

    EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P

    Goal Find a policy π : S → A that maximizes future rewards

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    One Formalism for Many Problems

    Agent

    Observes state s ∈ SChooses action a ∈ AFor arbitrary S and A

    Agent Environment

    a

    r,s

    EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P

    Goal Find a policy π : S → A that maximizes future rewards

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Example: A Resource Gathering Simulation

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    C

    A

    D

    B

    Simulated Robot’s TaskGather each of n resourcesNavigate around danger zones

    n + 2 State VariablesBoolean flag for each resource: A,B, . . .x and y coordinates

    n + 4 Actionsnorth, south, east, west change x and ypickupA sets flag A if near resource A, etc.Actions cost −1 generally but up to −40 in “puddles”

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Evaluating Policies with Value Functions

    The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.

    Vπ(s) = Qπ (s, π(s))

    Qπ(s,a) = R(s,a) + γ∑s′

    P(s,a, s′)Vπ(s′)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Evaluating Policies with Value Functions

    The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.

    Vπ = πQπ

    Qπ = R + γPVπ

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Evaluating Policies with Value Functions

    The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.

    Vπ = πQπ

    Qπ = R + γPVπ

    V Policyπ Qs sa ℝ

    Q RewardR

    TransitionP V

    sa

    s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Example: An Optimal Value Function

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    C

    A

    D

    BSome policy achieves maximal V ∗

    Planning algorithms compute V ∗ from R,PBut RL algorithms don’t know R and P

    V ∗(x , y , {C})

    0 0.2

    0.4 0.6

    0.8 1 0

    0.2 0.4

    0.6 0.8

    1

    -70-60-50-40-30-20-10

    0

    V ∗(x , y , {C,D})

    0 0.2

    0.4 0.6

    0.8 1 0

    0.2 0.4

    0.6 0.8

    1

    -70-60-50-40-30-20-10

    0

    V ∗(x , y , {D})

    0 0.2

    0.4 0.6

    0.8 1 0

    0.2 0.4

    0.6 0.8

    1

    -70-60-50-40-30-20-10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Example: An Optimal Value Function

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    C

    A

    D

    BSome policy achieves maximal V ∗

    Planning algorithms compute V ∗ from R,PBut RL algorithms don’t know R and P

    V ∗(x , y , {C})

    0 0.2

    0.4 0.6

    0.8 1 0

    0.2 0.4

    0.6 0.8

    1

    -70-60-50-40-30-20-10

    0

    V ∗(x , y , {C,D})

    0 0.2

    0.4 0.6

    0.8 1 0

    0.2 0.4

    0.6 0.8

    1

    -70-60-50-40-30-20-10

    0

    V ∗(x , y , {D})

    0 0.2

    0.4 0.6

    0.8 1 0

    0.2 0.4

    0.6 0.8

    1

    -70-60-50-40-30-20-10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Standard Approach: Learn the Value Function

    Temporal Difference Learning

    Estimate Vπ directly from data.(Sutton and Barto, 1998)

    Given each piece of data 〈s,a, r , s′〉r + γV̂π(s′) is an estimate of Vπ(s).Update V̂π(s) towards this estimate.Improve π.

    Converges to the optimal policy in the limit, givenappropriate data.In practice, converges very slowly!

    Most RL research focuses on ways to compute valuefunctions more efficiently from data.

    Action Selection

    Data

    Q

    π

    Value Function

    Estimation

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Scaling to Real-World Problems

    Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems

    Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Scaling to Real-World Problems

    Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems

    Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Scaling to Real-World Problems

    Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems

    Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.

    Reinforcement Learning

    Hierarchical Decomposition

    Function Approximation

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Exploration and Exploitation

    ExploitationHow to estimate Q∗ from dataFocus of most RL research

    ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis

    Reinforcement Learning

    Model-Based Exploration

    Action Selection

    Data

    Q

    π

    Value Function

    Estimation

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Exploration and Exploitation

    ExploitationHow to estimate Q∗ from dataFocus of most RL research

    ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis

    Reinforcement Learning

    Model-Based Exploration

    Action Selection

    Data

    Q

    π

    Value Function

    Estimation

    Environment

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Exploration and Exploitation

    ExploitationHow to estimate Q∗ from dataFocus of most RL research

    ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis

    Reinforcement Learning

    Model-Based Exploration

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,P

    Environment

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Thesis Contributions

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Merging Branches of RLPreviously studied inisolationDemonstration ofsynergies

    Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Thesis Contributions

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured ExplorationMerging Branches of RL

    Previously studied inisolationDemonstration ofsynergies

    Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

    Thesis Contributions

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured ExplorationMerging Branches of RL

    Previously studied inisolationDemonstration ofsynergies

    Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Outline

    1 Introduction

    2 Exploration and ApproximationModel-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    3 Exploration and Hierarchy

    4 Conclusion

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model-Based Reinforcement Learning

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model-Based Reinforcement Learning

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,P

    Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning

    Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model-Based Reinforcement Learning

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,P

    Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning

    Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model-Based Reinforcement Learning

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,P

    Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning

    Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

    Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

    R

    -1

    -1

    -1

    P

    0.5

    0.5

    Small sample sizesUse optimistic model

    Given enough dataUse MLE model

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

    Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

    R

    -1

    -1

    rmax

    P

    0.0

    0.0

    1.0

    Small sample sizesUse optimistic model

    Given enough dataUse MLE model

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

    Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

    R

    -1

    -1

    -1

    rmax

    P

    0.0

    0.0

    1.0

    Small sample sizesUse optimistic model

    Given enough dataUse MLE model

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

    Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

    R

    -1

    -1

    -1

    -1

    rmax

    P

    0.0

    0.0

    1.0

    Small sample sizesUse optimistic model

    Given enough dataUse MLE model

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

    Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

    R

    -1-1

    -1

    -1

    -1

    -1

    P

    0.8

    0.2

    0.0

    Small sample sizesUse optimistic model

    Given enough dataUse MLE model

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

    Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

    R

    -1-1

    -1

    -1

    -1

    -1

    P

    0.8

    0.2

    0.0

    Q

    R

    P V

    sa

    sℝ

    Unknown?

    Vmax

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Challenges for Model-Based Reinforcement Learning

    Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data

    Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice

    Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Challenges for Model-Based Reinforcement Learning

    Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data

    Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice

    Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Challenges for Model-Based Reinforcement Learning

    Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data

    Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice

    Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Function Approximation

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Function Approximation

    Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!

    Key IdeaRepresent Vπ using a small number of parameters.

    ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =

    ∑i w

    πi φi

    Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Function Approximation

    Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!

    Key IdeaRepresent Vπ using a small number of parameters.

    ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =

    ∑i w

    πi φi

    Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Function Approximation

    Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!

    Key IdeaRepresent Vπ using a small number of parameters.

    ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =

    ∑i w

    πi φi

    Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Function Approximation

    Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!

    Key IdeaRepresent Vπ using a small number of parameters.

    ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =

    ∑i w

    πi φi

    Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Iteration(Gordon, 1995)

    Averagers

    Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average

    ∑x∈X φ(s, x)V

    π(x)

    Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP

    Q RewardR

    TransitionP V

    sa

    s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Iteration(Gordon, 1995)

    Averagers

    Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average

    ∑x∈X φ(s, x)V

    π(x)

    Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP

    Q R

    P Vsa

    s

    ℝΦ x

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Iteration(Gordon, 1995)

    Averagers

    Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average

    ∑x∈X φ(s, x)V

    π(x)

    Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP

    Q R

    P Vsa

    s

    ℝΦ x

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Iteration(Gordon, 1995)

    Averagers

    Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average

    ∑x∈X φ(s, x)V

    π(x)

    Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP

    Q R

    P Vsa

    s

    ℝΦ x

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model Approximation

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model Approximation(Jong and Stone, 2007b)

    Approximate sa using instances i =〈si ,ai , ri , s′i

    〉Ψ(sa, i) Model averager weighting sa against siai

    DP(si , s′) Empirical effect applying transition at i to s

    i

    s

    i1

    i2

    3

    P

    sa s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model Approximation(Jong and Stone, 2007b)

    Approximate sa using instances i =〈si ,ai , ri , s′i

    〉Ψ(sa, i) Model averager weighting sa against siai

    DP(si , s′) Empirical effect applying transition at i to s

    s0.44

    0.38

    0.18i1

    i2

    i3

    P

    Ψsa si s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model Approximation(Jong and Stone, 2007b)

    Approximate sa using instances i =〈si ,ai , ri , s′i

    〉Ψ(sa, i) Model averager weighting sa against siai

    DP(si , s′) Empirical effect applying transition at i to s

    0.18

    1.0

    1.0

    1.0

    i1

    i2

    i3

    s

    0.38

    0.44

    P

    Ψ Datasa si s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Model Approximation(Jong and Stone, 2007b)

    Approximate sa using instances i =〈si ,ai , ri , s′i

    〉Ψ(sa, i) Model averager weighting sa against siai

    DP(si , s′) Empirical effect applying transition at i to s

    s

    0.18

    0.440.38

    P

    Ψ Datasa si s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted R-MAX(Jong and Stone, 2007a)

    Model approximation, R-MAX exploration, value approximation

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,P

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted R-MAX(Jong and Stone, 2007a)

    Model approximation, R-MAX exploration, value approximation

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,PP

    R

    Ψ DP

    DRΨ

    sa

    sa

    si

    si

    s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted R-MAX(Jong and Stone, 2007a)

    Model approximation, R-MAX exploration, value approximation

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,PP

    R

    Unknown?

    Ψ DP

    Unknown?

    DRΨ

    Vmax

    Pterm

    sa

    sa

    sa

    sa

    sa

    sa

    si

    si

    s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted R-MAX(Jong and Stone, 2007a)

    Model approximation, R-MAX exploration, value approximation

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,PP

    R

    Unknown?

    Ψ DP Φ

    Unknown?

    DRΨ

    Vmax

    Pterm

    sa

    sa

    sa

    sa

    sa

    sa

    si

    si s

    s

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    An Instance of Fitted R-MAX

    Model Averager

    ψ(sa, si) ∝ K (s, si)δ(a, sa)

    K (s, s′) = exp(

    d(s,s′)2

    b2

    )“radial basis data”

    R-MAX Exploration

    sa known if sufficient weight:∑

    i | ai=a K (s, si) ≥ m

    Value Averager

    Interpolation over a uniform grid

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    An Instance of Fitted R-MAX

    Model Averager

    ψ(sa, si) ∝ K (s, si)δ(a, sa)

    K (s, s′) = exp(

    d(s,s′)2

    b2

    )“radial basis data”

    R-MAX Exploration

    sa known if sufficient weight:∑

    i | ai=a K (s, si) ≥ m

    Value Averager

    Interpolation over a uniform grid

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    An Instance of Fitted R-MAX

    Model Averager

    ψ(sa, si) ∝ K (s, si)δ(a, sa)

    K (s, s′) = exp(

    d(s,s′)2

    b2

    )“radial basis data”

    R-MAX Exploration

    sa known if sufficient weight:∑

    i | ai=a K (s, si) ≥ m

    Value Averager

    Interpolation over a uniform grid

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Benchmark Performance

    -500

    -400

    -300

    -200

    -100

    0

    0 200 400 600 800 1000

    Rew

    ard

    per

    epis

    ode

    Episodes

    Fitted R-MAXXAI

    Least Squares Policy IterationR-MAX

    Neural Fitted Q Iteration

    For n = 1 resource, almostequivalent to benchmarkdomain “Puddleworld”Can compare againstperformance data fromNIPS RL BenchmarkingWorkshop (2005)State-of-the-art algorithmsimplemented and tuned byother researchers

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 250 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -8

    -7

    -6

    -5

    -4

    -3

    -2

    -1

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 500 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -8

    -7

    -6

    -5

    -4

    -3

    -2

    -1

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 750 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -25

    -20

    -15

    -10

    -5

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 1000 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -35

    -30

    -25

    -20

    -15

    -10

    -5

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 1500 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 2000 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 3000 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -60

    -50

    -40

    -30

    -20

    -10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 4000 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -60

    -50

    -40

    -30

    -20

    -10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Fitted Value Functions for PuddleWorld

    Policy and value function with 5000 instances

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1 0

    0.2

    0.4

    0.6

    0.8

    1

    -70

    -60

    -50

    -40

    -30

    -20

    -10

    0

    -60

    -50

    -40

    -30

    -20

    -10

    0

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Generalization and Exploration

    Inductive BiasModel-Free Similar states have similar values

    Model-Based Similar states have similar dynamics

    Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration

    Value GeneralizationHow good is my policy π?Less generalization leads to more computation

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Generalization and Exploration

    Inductive BiasModel-Free Similar states have similar values

    Model-Based Similar states have similar dynamics

    Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration

    Value GeneralizationHow good is my policy π?Less generalization leads to more computation

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

    Generalization and Exploration

    Inductive BiasModel-Free Similar states have similar values

    Model-Based Similar states have similar dynamics

    Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration

    Value GeneralizationHow good is my policy π?Less generalization leads to more computation

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Outline

    1 Introduction

    2 Exploration and Approximation

    3 Exploration and HierarchyHierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    4 Conclusion

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Appeal of Hierarchy

    Realistic ProblemsMany states and many actions...But also deep structureMultiple levels of abstractionLocal dependencies

    Structured Learning and PlanningDon’t write all programs in assembly!Reason above the level of primitive actions.

    Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchy in Reinforcement Learning

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchy in Reinforcement Learning

    Options (Sutton, Precup, and Singh, 1999)Partial policies as macrosAn option o comprises:

    An initiation set Io ⊂ SAn option policy πo : S → AA termination function T o : S → [0,1]

    MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o comprises:

    A set of subtasks Ao

    A goal reward function Go : T o → RA set of terminal states T o

    Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchy in Reinforcement Learning

    Options (Sutton, Precup, and Singh, 1999)Partial policies as macrosAn option o comprises:

    An initiation set Io ⊂ SAn option policy πo : S → AA termination function T o : S → [0,1]

    MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o comprises:

    A set of subtasks Ao

    A goal reward function Go : T o → RA set of terminal states T o

    Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach

    ...sa

    s

    ℝTurn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) + V Reach(s) +

    QDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach

    P DriveReach

    R ReachDrive

    V Drive

    ...sa

    s

    s

    ℝTurn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) + V Reach(s) +

    QDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach

    P DriveReach

    R ReachDrive

    V Drive

    ...sa

    s

    s

    V Reach

    Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) + V Reach(s) +

    QDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach

    P DriveReach

    R ReachDrive

    V Drive

    ...sa

    s

    s

    V Reach

    CDriveReach Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) + V Reach(s) +

    QDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach R ReachDrive

    ...sa

    s

    V Reach

    CDriveReach Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) + V Reach(s) +

    QDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach R ReachDrive

    ...sa

    s

    V Reach

    CDriveReach Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) +

    V Reach(s) + CDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach R ReachDrive

    ...sa

    s

    V Reach

    CDriveReach Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) =

    V Turn(s) +

    QReachTurn (s) + CDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    MAXQ Value Function Decomposition

    High-Level Rewards are Low-Level Values

    Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

    QDrive

    switch(a)

    QDriveReach R ReachDrive

    ...sa

    s

    V Reach

    CDriveReach Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    V Drive(s) = V Turn(s) + CReachTurn (s) + CDriveReach(s)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Model Decomposition

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Model Decomposition(Jong and Stone, 2008)

    High-Level Successors are Low-Level Terminals

    PDriveReach = ΩReach

    Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′

    Ωo(·, s′) is a value function!

    QDrive

    switch(a)

    QDriveReach

    P DriveReach

    R ReachDrive

    V Drive

    ...sa

    s

    s

    V ReachTurn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Model Decomposition(Jong and Stone, 2008)

    High-Level Successors are Low-Level Terminals

    PDriveReach = ΩReach

    Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′

    Ωo(·, s′) is a value function!

    QDrive

    switch(a)

    QDriveReach

    Ω Reach

    R ReachDrive

    V Drive

    ...sa

    s

    s

    V ReachTurn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Model Decomposition(Jong and Stone, 2008)

    High-Level Successors are Low-Level Terminals

    PDriveReach = ΩReach

    Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′

    Ωo(·, s′) is a value function!

    Ω

    πT

    P Ωs

    sa

    s

    s

    Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The R-MAXQ Algorithm(Jong and Stone, 2008)

    Planning

    Action Selection

    Data

    Q

    π

    Model Estimation

    R,P

    Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa

    Composite Tasks

    Concatenate subtask V a and Ωa into Ro and Po

    Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The R-MAXQ Algorithm(Jong and Stone, 2008)

    Policy Evaluation

    Data

    Va,Ωa

    Model Estimation

    Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa

    Composite Tasks

    Concatenate subtask V a and Ωa into Ro and Po

    Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The R-MAXQ Algorithm(Jong and Stone, 2008)

    Planning

    Policy Evaluation

    Va,Ωa

    π

    Vo,Ωo

    Model Assembly

    R,P

    Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa

    Composite Tasks

    Concatenate subtask V a and Ωa into Ro and Po

    Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Fitted R-maxq Algorithm

    Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Fitted R-maxq Algorithm(Jong and Stone, 2009)

    oVo Ω o

    π

    PoRo

    Va Ω a

    a

    PDRDData

    To

    Model Approximation

    Optimistic Exploration

    Ψa

    Ua

    Value Approximation oΦ

    Prediction

    Planning Go

    Primitive Action ModelsDefine V a = UV max+

    (I − U)ΨaRDDefine Ωa = (I−U)ΨaPD

    Prediction

    Solve V o = πo(Ro + γPo(I − T o)V o)

    Solve Ωo = πo(PoT o + γPo(I − T o)Ωo)

    Planning

    Optimize Ṽ o = T oGo+(I − T o)πo(Ro + γPoṼ o)

    Value Approximation

    Define Ro[sa] = V a[s]

    Define Po[sa, x ] = Ωa[s, s′]Φo[s′, x ]

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Software Architecture

    Algorithm1 Execute πRoot hierarchically2 Update data: RD and PD3 Propagate changes to πRoot

    4 Repeat

    Averagers

    Φ Interpolation over uniform gridΨ Radial basis functions

    oVo Ω o

    π

    PoRo

    Va Ω a

    a

    PDRDData

    To

    Model Approximation

    Optimistic Exploration

    Ψa

    Ua

    Value Approximation oΦ

    Prediction

    Planning Go

    Optimizations

    Memoization and DP

    Prioritized sweeping

    Sparse representations

    Cover trees for onlinenearest neighbors

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Example: A Resource Gathering Simulation

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    C

    A

    D

    B

    Simulated Robot’s TaskGather each of n resourcesNavigate around danger zones

    n + 2 State VariablesBoolean flag for each resource: A,B, . . .x and y coordinates

    n + 4 Actionsnorth, south, east, west change x and ypickupA sets flag A if near resource A, etc.Actions cost −1 generally but up to −40 in “puddles”

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Utility of Hierarchy and Model Generalization

    -4000

    -3500

    -3000

    -2500

    -2000

    -1500

    -1000

    -500

    0

    0 5 10 15 20 25 30 35 40

    Rew

    ard

    per

    epis

    ode

    Episode

    Fitted R-MAXR-MAX

    -30000

    -25000

    -20000

    -15000

    -10000

    -5000

    0

    0 20 40 60 80 100

    Cum

    ulat

    ive

    rew

    ard

    Episode

    Fitted R-MAXR-MAX

    Model generalization allows Fitted R-MAX to outperform R-MAX.

    Hierarchical decomposition allows R-MAXQ to outperform R-MAX.

    These two ideas synergize in Fitted R-MAXQ!

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Utility of Hierarchy and Model Generalization

    -4000

    -3500

    -3000

    -2500

    -2000

    -1500

    -1000

    -500

    0

    0 5 10 15 20 25 30 35 40

    Rew

    ard

    per

    epis

    ode

    Episode

    R-MAXQR-MAX

    -30000

    -25000

    -20000

    -15000

    -10000

    -5000

    0

    0 20 40 60 80 100

    Cum

    ulat

    ive

    rew

    ard

    Episode

    R-MAXQR-MAX

    Model generalization allows Fitted R-MAX to outperform R-MAX.

    Hierarchical decomposition allows R-MAXQ to outperform R-MAX.

    These two ideas synergize in Fitted R-MAXQ!

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    The Utility of Hierarchy and Model Generalization

    -4000

    -3500

    -3000

    -2500

    -2000

    -1500

    -1000

    -500

    0

    0 5 10 15 20 25 30 35 40

    Rew

    ard

    per

    epis

    ode

    Episode

    Fitted R-MAXQFitted R-MAX

    R-MAXQR-MAX

    -30000

    -25000

    -20000

    -15000

    -10000

    -5000

    0

    0 20 40 60 80 100

    Cum

    ulat

    ive

    rew

    ard

    Episode

    Fitted R-MAXQFitted R-MAX

    R-MAXQR-MAX

    Model generalization allows Fitted R-MAX to outperform R-MAX.

    Hierarchical decomposition allows R-MAXQ to outperform R-MAX.

    These two ideas synergize in Fitted R-MAXQ!

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Task Hierarchies as Domain Knowledge

    Flat

    ...north south east west pickupA

    Root

    Shallow

    ...

    south

    GatherA

    north

    Root

    east west pickupA

    Deep

    ...

    west

    Location1pickupA

    north south

    GatherA

    Root

    east

    Navigate

    Flat hierarchy only knows modelaveragers.

    Shallow hierarchy also knows that gatheringeach resource is independent.

    Deep hierarchy also knows the set of resource locations (butmust still associate resource with location).

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Soft Inductive Bias in Hierarchy

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    s

    BC

    A

    D

    x

    ...

    south

    GatherA

    north

    Root

    east west pickupA

    From s, explore unknown state inpuddle or exploit known solution?

    Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).

    Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ

    GatherD(s) < Vπ

    GatherD(s).

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Soft Inductive Bias in Hierarchy

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    s

    x

    BC

    A

    D

    ...

    south

    GatherA

    north

    Root

    east west pickupA

    From s, explore unknown state inpuddle or exploit known solution?

    Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).

    Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ

    GatherD(s) < Vπ

    GatherD(s).

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Soft Inductive Bias in Hierarchy

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    s

    x

    BC

    A

    D

    ...

    south

    GatherA

    north

    Root

    east west pickupA

    From s, explore unknown state inpuddle or exploit known solution?

    Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).

    Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ

    GatherD(s) < Vπ

    GatherD(s).

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Constraint and Reformulation

    Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.

    Deep Hierarchypickup actions only possible beforeor after Navigate tasks.

    ...

    west

    Location1pickupA

    north south

    GatherA

    Root

    east

    Navigate

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    BC

    A

    D

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Constraint and Reformulation

    Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.

    Deep Hierarchypickup actions only possible beforeor after Navigate tasks.

    ...

    west

    Location1pickupA

    north south

    GatherA

    Root

    east

    Navigate

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    BC

    A

    D

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • Reinforcement Learning

    Model-Based Exploration

    Hierarchical Decomposition

    Function Approximation

    Hierarchical Exploration

    Approximate Exploration

    Structured Exploration

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

    Hierarchical Constraint and Reformulation

    Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.

    Deep Hierarchypickup actions only possible beforeor after Navigate tasks.

    ...

    west

    Location1pickupA

    north south

    GatherA

    Root

    east

    Navigate

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    BC

    A

    D

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Future WorkSummary

    Outline

    1 Introduction

    2 Exploration and Approximation

    3 Exploration and Hierarchy

    4 ConclusionFuture WorkSummary

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Future WorkSummary

    Discovering Abstractions

    Turn Left

    Reach Highway

    Drive to Campus

    ...

    ...

    We can now use task hierarchies toefficiently explore continuousenvironments.Can we discover composite tasksautomatically?

    What makes a good subtask?

    Other research: “bottleneck states”My conjecture: “sets of relevant features”(Jong and Stone, 2005)

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Future WorkSummary

    Prior Distributions Over Inductive Biases

    No Free LunchNo algorithm can learn or discover efficiently in all possibleworlds!

    Bayesian Reinforcement LearningBegin with a prior distribution over environmentsPlan over “belief states”Update belief distribution given data

    Key question What is the right prior distribution?Conjecture Distributions over task hierarchies

    Goal Efficient appoximation of optimal Bayesiansolution

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Future WorkSummary

    Natural Knowledge Representations

    Model-free methods learn a monolithic value function.Models are a natural form of domain knowledge.Models are modular: piecewise independent.

    Don’t Reinvent the WheelExploit known reward functionExploit known dynamics of some actionsExploit known dynamics of some state variables

    Nicholas K. Jong Structured Exploration for Reinforcement Learning

  • university-logo

    IntroductionExploration and Approximation

    Exploration and HierarchyConclusion

    Future WorkSummary

    Connections to Other Fields of Arti