structured exploration for reinforcement learningai-lab/pubs/nkj-thesis...reinforcement learning...

university-logo

IntroductionExploration and Approximation

Exploration and HierarchyConclusion

Structured Explorationfor Reinforcement Learning

Nicholas K. Jong

Department of Computer SciencesThe University of Texas at Austin

December 1, 2010 / PhD Final Defense

Nicholas K. Jong Structured Exploration for Reinforcement Learning

university-logo



Outline

1 Introduction

2 Exploration and Approximation

3 Exploration and Hierarchy

4 Conclusion


university-logo



The Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus

Outline

1 IntroductionThe Reinforcement Learning ProblemReinforcement Learning MethodsThesis Focus



4 Conclusion


university-logo




One Solution to Many Problems

Reality Many tasks mean many engineering problemsDream Opportunity for a single general learning algorithm

Potential PayoffsReduce engineering costsSolve problems beyond our current abilitiesAchieve solutions robust to uncertainty


university-logo




One Formalism for Many Problems

EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P

Goal Find a policy π : S → A that maximizes future rewards


university-logo




One Formalism for Many Problems

Agent

Observes state s ∈ SChooses action a ∈ AFor arbitrary S and A

Agent Environment

a

r,s

EnvironmentGenerate reward r ∈ R with expected value R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using unknown reward and transition functions R and P

Goal Find a policy π : S → A that maximizes future rewards


university-logo




Example: A Resource Gathering Simulation

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

C

A

D

B

Simulated Robot’s TaskGather each of n resourcesNavigate around danger zones

n + 2 State VariablesBoolean flag for each resource: A,B, . . .x and y coordinates

n + 4 Actionsnorth, south, east, west change x and ypickupA sets flag A if near resource A, etc.Actions cost −1 generally but up to −40 in “puddles”


university-logo




Evaluating Policies with Value Functions

The Bellman EquationState value depends on policy and action valuesLong-term value equals present value plus future value.

Vπ(s) = Qπ (s, π(s))

Qπ(s,a) = R(s,a) + γ∑s′

P(s,a, s′)Vπ(s′)


university-logo






Vπ = πQπ

Qπ = R + γPVπ


university-logo






Vπ = πQπ

Qπ = R + γPVπ

V Policyπ Qs sa ℝ

Q RewardR

TransitionP V

sa

s

ℝ


university-logo




Example: An Optimal Value Function

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

C

A

D

BSome policy achieves maximal V ∗

Planning algorithms compute V ∗ from R,PBut RL algorithms don’t know R and P

V ∗(x , y , {C})

0 0.2

0.4 0.6

0.8 1 0

0.2 0.4

0.6 0.8

1

-70-60-50-40-30-20-10

0

V ∗(x , y , {C,D})

0 0.2

0.4 0.6

0.8 1 0

0.2 0.4

0.6 0.8

1

-70-60-50-40-30-20-10

0

V ∗(x , y , {D})

0 0.2

0.4 0.6

0.8 1 0

0.2 0.4

0.6 0.8

1

-70-60-50-40-30-20-10

0


university-logo




Standard Approach: Learn the Value Function

Temporal Difference Learning

Estimate Vπ directly from data.(Sutton and Barto, 1998)

Given each piece of data 〈s,a, r , s′〉r + γV̂π(s′) is an estimate of Vπ(s).Update V̂π(s) towards this estimate.Improve π.

Converges to the optimal policy in the limit, givenappropriate data.In practice, converges very slowly!

Most RL research focuses on ways to compute valuefunctions more efficiently from data.

Action Selection

Data

Q

π

Value Function

Estimation


university-logo




Scaling to Real-World Problems

Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems

Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.


university-logo




Scaling to Real-World Problems

Theory Eventual convergence to optimal behaviorPractice Too slow for interesting problems

Branches of RL ResearchFunction ApproximationHierarchical RLRelational RLInverse RLEtc.

Reinforcement Learning

Hierarchical Decomposition

Function Approximation


university-logo




Exploration and Exploitation

ExploitationHow to estimate Q∗ from dataFocus of most RL research

ExplorationHow to gather better dataEmphasized by model-based RLFocus of this thesis


Model-Based Exploration

Action Selection

Data

Q

π

Value Function

Estimation


university-logo









Action Selection

Data

Q

π

Value Function

Estimation

Environment


university-logo









Planning

Action Selection

Data

Q

π

Model Estimation

R,P

Environment


university-logo




Thesis Contributions





Merging Branches of RLPreviously studied inisolationDemonstration ofsynergies

Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent


university-logo




Thesis Contributions





Hierarchical Exploration

Approximate Exploration

Structured ExplorationMerging Branches of RL

Previously studied inisolationDemonstration ofsynergies

Efficient exploration in continuous state spacesEfficient exploration given hierarchical knowledgeFramework for combining algorithmic ideasPublicly available implementation of final agent


university-logo



Model-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm

Outline

1 Introduction

2 Exploration and ApproximationModel-Based ExplorationGeneralization in Large State SpacesThe Fitted R-MAX Algorithm


4 Conclusion


university-logo




Model-Based Reinforcement Learning







Structured Exploration












Model-Based Reinforcement Learning

Planning

Action Selection

Data

Q

π

Model Estimation

R,P

Indirection Permits SimplicityR,P predict only one time stepR,P involve only one action at a timeDirect training data permits supervised learning

Uncertainty Guides ExplorationUse model of known states to reach the unknownFirst polynomial-time sample-complexity bounds(Kearns and Singh, 1998; Kakade, 2003)












Simple and Efficient Learning with R-max(Moore and Atkeson, 1993; Brafman and Tennenholtz, 2002)

Maximum-Likelihood EstimationStraightforward in finite state spacesUnreliable with small sample sizes

R

-1

-1

-1

P

0.5

0.5

Small sample sizesUse optimistic model

Given enough dataUse MLE model














R

-1

-1

rmax

P

0.0

0.0

1.0
















R

-1

-1

-1

rmax

P

0.0

0.0

1.0
















R

-1

-1

-1

-1

rmax

P

0.0

0.0

1.0
















R

-1-1

-1

-1

-1

-1

P

0.8

0.2

0.0
















R

-1-1

-1

-1

-1

-1

P

0.8

0.2

0.0

Q

R

P V

sa

sℝ

Unknown?

Vmax












Challenges for Model-Based Reinforcement Learning

Computational ComplexityMDP planning can be expensive...But CPU cycles are cheaper than data

Representational ComplexityState distributions harder to represent than scalar values...But simple approximations may suffice

Exhaustive ExplorationExploring every unknown state seems unnecessary...But intuitive domain knowledge can constrain exploration


university-logo
























Problem Exact representation of V requires a parameter foreach state. Many environments have infinite states!

Key IdeaRepresent Vπ using a small number of parameters.

ExamplesThe weights of a neural networkCoefficients of some basis functions: Vπ =

∑i w

πi φi

Generalization of valuesChanging Vπ(s) changes one or more parameters.Each parameter influences the value of several states.












Fitted Value Iteration(Gordon, 1995)

Averagers

Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average

∑x∈X φ(s, x)V

π(x)

Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP

Q RewardR

TransitionP V

sa

s

ℝ












Fitted Value Iteration(Gordon, 1995)

Averagers

Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted average

∑x∈X φ(s, x)V

π(x)

Discrete Planning in Continuous State SpacesApproximate planning with an exact MDPExact planning with an approximate MDP

Q R

P Vsa

s

ℝΦ x


university-logo




Model Approximation



















Model Approximation(Jong and Stone, 2007b)

Approximate sa using instances i =〈si ,ai , ri , s′i

〉Ψ(sa, i) Model averager weighting sa against siai

DP(si , s′) Empirical effect applying transition at i to s

i

s

i1

i2

3

P

sa s
















s0.44

0.38

0.18i1

i2

i3

P

Ψsa si s
















0.18

1.0

1.0

1.0

i1

i2

i3

s

0.38

0.44

P

Ψ Datasa si s
















s

0.18

0.440.38

P

Ψ Datasa si s












Fitted R-MAX(Jong and Stone, 2007a)

Model approximation, R-MAX exploration, value approximation

Planning

Action Selection

Data

Q

π

Model Estimation

R,P














Planning

Action Selection

Data

Q

π

Model Estimation

R,PP

R

Ψ DP

DRΨ

sa

sa

si

si

s

ℝ














Planning

Action Selection

Data

Q

π

Model Estimation

R,PP

R

Unknown?

Ψ DP

Unknown?

DRΨ

Vmax

Pterm

sa

sa

sa

sa

sa

sa

si

si

s

ℝ














Planning

Action Selection

Data

Q

π

Model Estimation

R,PP

R

Unknown?

Ψ DP Φ

Unknown?

DRΨ

Vmax

Pterm

sa

sa

sa

sa

sa

sa

si

si s

s

ℝ












An Instance of Fitted R-MAX

Model Averager

ψ(sa, si) ∝ K (s, si)δ(a, sa)

K (s, s′) = exp(

d(s,s′)2

b2

)“radial basis data”

R-MAX Exploration

sa known if sufficient weight:∑

i | ai=a K (s, si) ≥ m

Value Averager

Interpolation over a uniform grid












Benchmark Performance

-500

-400

-300

-200

-100

0

0 200 400 600 800 1000

Rew

ard

per

epis

ode

Episodes

Fitted R-MAXXAI

Least Squares Policy IterationR-MAX

Neural Fitted Q Iteration

For n = 1 resource, almostequivalent to benchmarkdomain “Puddleworld”Can compare againstperformance data fromNIPS RL BenchmarkingWorkshop (2005)State-of-the-art algorithmsimplemented and tuned byother researchers












Fitted Value Functions for PuddleWorld

Policy and value function with 250 instances

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

1

-70

-60

-50

-40

-30

-20

-10

0

-8

-7

-6

-5

-4

-3

-2

-1

0














0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

1

-70

-60

-50

-40

-30

-20

-10

0

-8

-7

-6

-5

-4

-3

-2

-1

0














0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

1

-70

-60

-50

-40

-30

-20

-10

0

-25

-20

-15

-10

-5

0














0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

1

-70

-60

-50

-40

-30

-20

-10

0

-35

-30

-25

-20

-15

-10

-5

0














0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

1

-70

-60

-50

-40

-30

-20

-10

0

-70

-60

-50

-40

-30

-20

-10

0














0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

1

-70

-60

-50

-40

-30

-20

-10

0

-60

-50

-40

-30

-20

-10

0












Generalization and Exploration

Inductive BiasModel-Free Similar states have similar values

Model-Based Similar states have similar dynamics

Model GeneralizationIs the effect of sa known or unknown?Less generalization leads to more exploration

Value GeneralizationHow good is my policy π?Less generalization leads to more computation


university-logo



Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

Outline

1 Introduction


3 Exploration and HierarchyHierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy

4 Conclusion


university-logo




The Appeal of Hierarchy

Realistic ProblemsMany states and many actions...But also deep structureMultiple levels of abstractionLocal dependencies

Structured Learning and PlanningDon’t write all programs in assembly!Reason above the level of primitive actions.

Turn Left

Reach Highway

Drive to Campus

...

...


university-logo




Hierarchy in Reinforcement Learning



















Hierarchy in Reinforcement Learning

Options (Sutton, Precup, and Singh, 1999)Partial policies as macrosAn option o comprises:

An initiation set Io ⊂ SAn option policy πo : S → AA termination function T o : S → [0,1]

MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o comprises:

A set of subtasks Ao

A goal reward function Go : T o → RA set of terminal states T o

Turn Left

Reach Highway

Drive to Campus

...

...












MAXQ Value Function Decomposition

High-Level Rewards are Low-Level Values

Separate Qo into components Qoa by actionCompute Roa = V o recursivelyLearn Coa := γPoa V o directly

QDrive

switch(a)

QDriveReach

...sa

s

ℝTurn Left

Reach Highway

Drive to Campus

...

...

V Drive(s) =

V Turn(s) + V Reach(s) +

QDriveReach(s)















QDrive

switch(a)

QDriveReach

P DriveReach

R ReachDrive

V Drive

...sa

s

s

ℝTurn Left

Reach Highway

Drive to Campus

...

...

V Drive(s) =


QDriveReach(s)















QDrive

switch(a)

QDriveReach

P DriveReach

R ReachDrive

V Drive

...sa

s

s

ℝ

V Reach

Turn Left

Reach Highway

Drive to Campus

...

...

V Drive(s) =


QDriveReach(s)















QDrive

switch(a)

QDriveReach

P DriveReach

R ReachDrive

V Drive

...sa

s

s

ℝ

V Reach

CDriveReach Turn Left

Reach Highway

Drive to Campus

...

...

V Drive(s) =


QDriveReach(s)















QDrive

switch(a)

QDriveReach R ReachDrive

...sa

s

ℝ

V Reach


Reach Highway

Drive to Campus

...

...

V Drive(s) =


QDriveReach(s)















QDrive

switch(a)


...sa

s

ℝ

V Reach


Reach Highway

Drive to Campus

...

...

V Drive(s) =

V Turn(s) +

V Reach(s) + CDriveReach(s)















QDrive

switch(a)


...sa

s

ℝ

V Reach


Reach Highway

Drive to Campus

...

...

V Drive(s) =

V Turn(s) +

QReachTurn (s) + CDriveReach(s)















QDrive

switch(a)


...sa

s

ℝ

V Reach


Reach Highway

Drive to Campus

...

...

V Drive(s) = V Turn(s) + CReachTurn (s) + CDriveReach(s)


university-logo




Hierarchical Model Decomposition



















Hierarchical Model Decomposition(Jong and Stone, 2008)

High-Level Successors are Low-Level Terminals

PDriveReach = ΩReach

Ωo(s, s′): Discounted probability thatexecuting o in s terminates at s′

Ωo(·, s′) is a value function!

QDrive

switch(a)

QDriveReach

P DriveReach

R ReachDrive

V Drive

...sa

s

s

ℝ

V ReachTurn Left

Reach Highway

Drive to Campus

...

...

















QDrive

switch(a)

QDriveReach

Ω Reach

R ReachDrive

V Drive

...sa

s

s

ℝ

V ReachTurn Left

Reach Highway

Drive to Campus

...

...

















Ω

πT

P Ωs

sa

s

s

Turn Left

Reach Highway

Drive to Campus

...

...












The R-MAXQ Algorithm(Jong and Stone, 2008)

Planning

Action Selection

Data

Q

π

Model Estimation

R,P

Primitive TasksLearn primitive models from dataSplice in R-MAX optimistic explorationResult: V a and Ωa

Composite Tasks

Concatenate subtask V a and Ωa into Ro and Po

Plan πo using MAXQ goal rewardsEvaluate πo without goal rewardsResult: V o and Ωo













Policy Evaluation

Data

Va,Ωa

Model Estimation


Composite Tasks















Planning

Policy Evaluation

Va,Ωa

π

Vo,Ωo

Model Assembly

R,P


Composite Tasks




university-logo




The Fitted R-maxq Algorithm



















The Fitted R-maxq Algorithm(Jong and Stone, 2009)

oVo Ω o

π

PoRo

Va Ω a

a

PDRDData

To

Model Approximation

Optimistic Exploration

Ψa

Ua

Value Approximation oΦ

Prediction

Planning Go

Primitive Action ModelsDefine V a = UV max+

(I − U)ΨaRDDefine Ωa = (I−U)ΨaPD

Prediction

Solve V o = πo(Ro + γPo(I − T o)V o)

Solve Ωo = πo(PoT o + γPo(I − T o)Ωo)

Planning

Optimize Ṽ o = T oGo+(I − T o)πo(Ro + γPoṼ o)

Value Approximation

Define Ro[sa] = V a[s]

Define Po[sa, x ] = Ωa[s, s′]Φo[s′, x ]












The Software Architecture

Algorithm1 Execute πRoot hierarchically2 Update data: RD and PD3 Propagate changes to πRoot

4 Repeat

Averagers

Φ Interpolation over uniform gridΨ Radial basis functions

oVo Ω o

π

PoRo

Va Ω a

a

PDRDData

To

Model Approximation

Optimistic Exploration

Ψa

Ua

Value Approximation oΦ

Prediction

Planning Go

Optimizations

Memoization and DP

Prioritized sweeping

Sparse representations

Cover trees for onlinenearest neighbors












Example: A Resource Gathering Simulation

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

C

A

D

B

Simulated Robot’s TaskGather each of n resourcesNavigate around danger zones

n + 2 State VariablesBoolean flag for each resource: A,B, . . .x and y coordinates

n + 4 Actionsnorth, south, east, west change x and ypickupA sets flag A if near resource A, etc.Actions cost −1 generally but up to −40 in “puddles”












The Utility of Hierarchy and Model Generalization

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 5 10 15 20 25 30 35 40

Rew

ard

per

epis

ode

Episode

Fitted R-MAXR-MAX

-30000

-25000

-20000

-15000

-10000

-5000

0

0 20 40 60 80 100

Cum

ulat

ive

rew

ard

Episode

Fitted R-MAXR-MAX

Model generalization allows Fitted R-MAX to outperform R-MAX.

Hierarchical decomposition allows R-MAXQ to outperform R-MAX.

These two ideas synergize in Fitted R-MAXQ!













-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 5 10 15 20 25 30 35 40

Rew

ard

per

epis

ode

Episode

R-MAXQR-MAX

-30000

-25000

-20000

-15000

-10000

-5000

0

0 20 40 60 80 100

Cum

ulat

ive

rew

ard

Episode

R-MAXQR-MAX
















-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 5 10 15 20 25 30 35 40

Rew

ard

per

epis

ode

Episode

Fitted R-MAXQFitted R-MAX

R-MAXQR-MAX

-30000

-25000

-20000

-15000

-10000

-5000

0

0 20 40 60 80 100

Cum

ulat

ive

rew

ard

Episode

Fitted R-MAXQFitted R-MAX

R-MAXQR-MAX















Task Hierarchies as Domain Knowledge

Flat

...north south east west pickupA

Root

Shallow

...

south

GatherA

north

Root

east west pickupA

Deep

...

west

Location1pickupA

north south

GatherA

Root

east

Navigate

Flat hierarchy only knows modelaveragers.

Shallow hierarchy also knows that gatheringeach resource is independent.

Deep hierarchy also knows the set of resource locations (butmust still associate resource with location).












Soft Inductive Bias in Hierarchy

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

s

BC

A

D

x

...

south

GatherA

north

Root

east west pickupA

From s, explore unknown state inpuddle or exploit known solution?

Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).

Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ

GatherD(s) < Vπ

GatherD(s).












Soft Inductive Bias in Hierarchy

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

s

x

BC

A

D

...

south

GatherA

north

Root

east west pickupA

From s, explore unknown state inpuddle or exploit known solution?

Flat HierarchyOptimism about the unknown effectsof pickupD at x outweighs value ofknown solution, Vπ(s) > Vπ(s).

Shallow HierarchyValue of pickupD at x less than valueof known solution in the context ofGatherD, Vπ

GatherD(s) < Vπ

GatherD(s).












Hierarchical Constraint and Reformulation

Hierarchies can find embeddedstructure.Hierarchies can constrain policiesand therefore exploration.

Deep Hierarchypickup actions only possible beforeor after Navigate tasks.

...

west

Location1pickupA

north south

GatherA

Root

east

Navigate

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

BC

A

D


university-logo



Future WorkSummary

Outline

1 Introduction



4 ConclusionFuture WorkSummary


university-logo



Future WorkSummary

Discovering Abstractions

Turn Left

Reach Highway

Drive to Campus

...

...

We can now use task hierarchies toefficiently explore continuousenvironments.Can we discover composite tasksautomatically?

What makes a good subtask?

Other research: “bottleneck states”My conjecture: “sets of relevant features”(Jong and Stone, 2005)


university-logo



Future WorkSummary

Prior Distributions Over Inductive Biases

No Free LunchNo algorithm can learn or discover efficiently in all possibleworlds!

Bayesian Reinforcement LearningBegin with a prior distribution over environmentsPlan over “belief states”Update belief distribution given data

Key question What is the right prior distribution?Conjecture Distributions over task hierarchies

Goal Efficient appoximation of optimal Bayesiansolution


university-logo



Future WorkSummary

Natural Knowledge Representations

Model-free methods learn a monolithic value function.Models are a natural form of domain knowledge.Models are modular: piecewise independent.

Don’t Reinvent the WheelExploit known reward functionExploit known dynamics of some actionsExploit known dynamics of some state variables


university-logo



Future WorkSummary

Connections to Other Fields of Arti

structured exploration for reinforcement learningai-lab/pubs/nkj-thesis...reinforcement learning...

Documents