automatic induction of maxq hierarchies

Automatic Automatic Induction of MAXQ Induction of MAXQ

HierarchiesHierarchiesNeville MehtaNeville Mehta

Michael WynkoopMichael WynkoopSoumya RaySoumya Ray

Prasad TadepalliPrasad TadepalliTom DietterichTom Dietterich

School of EECSSchool of EECSOregon State UniversityOregon State University

Funded by DARPA Transfer Learning Program

Hierarchical Reinforcement Hierarchical Reinforcement LearningLearning

Exploits domain structure to facilitate Exploits domain structure to facilitate learninglearning Policy constraintsPolicy constraints State abstractionState abstraction

Paradigms: Options, HAMs, MaxQParadigms: Options, HAMs, MaxQ MaxQ task hierarchyMaxQ task hierarchy

Directed acyclic graph of subtasksDirected acyclic graph of subtasks Leaves are the primitive MDP actionsLeaves are the primitive MDP actions

Traditionally, task structure is provided Traditionally, task structure is provided as prior knowledge to the learning agentas prior knowledge to the learning agent

Model RepresentationModel Representation

Dynamic Bayesian Networks for the Dynamic Bayesian Networks for the transition and reward modelstransition and reward models

Symbolic representation of the Symbolic representation of the conditional probabilities/reward conditional probabilities/reward values as decision treesvalues as decision trees

Goal: Learn Task Goal: Learn Task HierarchiesHierarchies

Avoid the significant manual engineering of Avoid the significant manual engineering of task decompositiontask decomposition Requiring deep understanding of the purpose Requiring deep understanding of the purpose

and function of subroutines in computer and function of subroutines in computer sciencescience

Frameworks for learning exit-option Frameworks for learning exit-option hierarchies:hierarchies: HexQ: Determine exit states through random HexQ: Determine exit states through random

explorationexploration VISA: Determine exit states by analyzing DBN VISA: Determine exit states by analyzing DBN

action modelsaction models

Focused Creation of Focused Creation of SubtasksSubtasks

HEXQ & VISA: Create a separate HEXQ & VISA: Create a separate subtask for each possible exit state.subtask for each possible exit state. This can generate a large number of This can generate a large number of

subtaskssubtasks Claim: Defining good subtasks requires Claim: Defining good subtasks requires

maximizing state abstraction while maximizing state abstraction while identifying “useful” subgoals.identifying “useful” subgoals.

Our approach: Our approach: selectivelyselectively define define subtasks with single abstract exit statessubtasks with single abstract exit states

Transfer Learning Transfer Learning ScenarioScenario

Working hypothesis:Working hypothesis: MaxQ value-function learning is much quicker MaxQ value-function learning is much quicker

than non-hierarchical (flat) Q-learningthan non-hierarchical (flat) Q-learning Hierarchical structure is more amenable to Hierarchical structure is more amenable to

transfer from source tasks to the target than transfer from source tasks to the target than value functionsvalue functions

Transfer scenario:Transfer scenario: Solve a “source problem” (no CPU time limit)Solve a “source problem” (no CPU time limit)

Learn DBN modelsLearn DBN models Learn MAXQ hierarchyLearn MAXQ hierarchy

Solve a “target problem” under the assumption Solve a “target problem” under the assumption that the same hierarchical structure appliesthat the same hierarchical structure applies

Will relax this constraint in future workWill relax this constraint in future work

MaxNode State MaxNode State AbstractionAbstraction

Y is irrelevant within this actionY is irrelevant within this action It affects the dynamics but not the reward It affects the dynamics but not the reward

functionfunction In HEXQ, VISA, and our work, we assume In HEXQ, VISA, and our work, we assume

there is only one terminal abstract state, there is only one terminal abstract state, hence no pseudo-reward is neededhence no pseudo-reward is needed

As a side-effect, this enables “funnel” As a side-effect, this enables “funnel” abstractions in parent tasksabstractions in parent tasks

Rt+1

Xt

Yt

At

Xt+1

Yt+1

Our Approach: AI-MAXQOur Approach: AI-MAXQLearn DBN action models via

random exploration

(Other work)

Apply Q learning to solve the source problem

Generate a good trajectory from the learned Q function

Analyze trajectory to produce CAT

Analyze CAT to define MAXQ Hierarchy

(This Talk)

(This Talk)

Wargus Resource-Wargus Resource-Gathering DomainGathering Domain

reg.*

a.l

Causally Annotated Causally Annotated Trajectory (CAT)Trajectory (CAT)

EndStart Goto MG Goto Dep Goto CW Goto Dep

a.r a.r a.r a.r

req.gold

req.wood

req.gold

a.*reg.*

req.wood

A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes)

Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.

CAT ScanCAT Scan


An action is absorbed regressively as long An action is absorbed regressively as long asas It does not have an effect beyond the trajectory It does not have an effect beyond the trajectory

segment, preventing exogenous effectssegment, preventing exogenous effects It does not increase the state abstractionIt does not increase the state abstraction

CAT ScanCAT Scan


CAT ScanCAT Scan


Root

CAT ScanCAT Scan


Root

Harvest WoodHarvest Gold

Induced Wargus Induced Wargus HierarchyHierarchy

Root

Harvest WoodHarvest Gold

Get Gold Get Wood

Goto(loc)

Mine Gold Chop WoodGDeposit

Put Gold Put Wood

WGoto(townhall)GGoto(goldmine) WGoto(forest)GGoto(townhall)

WDeposit

Induced Abstraction & Induced Abstraction & TerminationTermination

Task Task NameName State AbstractionState Abstraction Termination ConditionTermination Condition

RootRoot req.gold, req.woodreq.gold, req.wood req.gold = 1 && req.wood = 1req.gold = 1 && req.wood = 1

Harvest GoldHarvest Gold req.gold, agent.resource, req.gold, agent.resource, region.townhallregion.townhall req.gold = 1req.gold = 1

Get GoldGet Gold agent.resource, agent.resource, region.goldmineregion.goldmine agent.resource = goldagent.resource = gold

Put GoldPut Gold req.gold, agent.resource, req.gold, agent.resource, region.townhallregion.townhall agent.resource = 0agent.resource = 0

GGoto(goldmGGoto(goldmine)ine) agent.x, agent.yagent.x, agent.y agent.resource = 0 && region.goldmine = 1agent.resource = 0 && region.goldmine = 1

GGoto(townhGGoto(townhall)all) agent.x, agent.yagent.x, agent.y req.gold = 0 && agent.resource = gold && req.gold = 0 && agent.resource = gold &&

region.townhall = 1region.townhall = 1Harvest Harvest WoodWood

req.wood, agent.resource, req.wood, agent.resource, region.townhallregion.townhall req.wood = 1req.wood = 1

Get WoodGet Wood agent.resource, region.forestagent.resource, region.forest agent.resource = woodagent.resource = wood

Put WoodPut Wood req.wood, agent.resource, req.wood, agent.resource, region.townhallregion.townhall agent.resource = 0agent.resource = 0

WGoto(forestWGoto(forest))

agent.x, agent.yagent.x, agent.y agent.resource = 0 && region.forest = 1agent.resource = 0 && region.forest = 1

WGoto(townhWGoto(townhall)all) agent.x, agent.yagent.x, agent.y req.wood = 0 && agent.resource = wood && req.wood = 0 && agent.resource = wood &&

region.townhall = 1region.townhall = 1

Mine GoldMine Gold agent.resource, agent.resource, region.goldmineregion.goldmine NANA

Chop WoodChop Wood agent.resource, region.forestagent.resource, region.forest NANA

GDepositGDeposit req.gold, agent.resource, req.gold, agent.resource, region.townhallregion.townhall NANA

WDepositWDeposit req.wood, agent.resource, req.wood, agent.resource, region.townhallregion.townhall NANA

Goto(loc)Goto(loc) agent.x, agent.yagent.x, agent.y NANA

Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies

ClaimsClaims

The resulting hierarchy is uniqueThe resulting hierarchy is unique Does not depend on the order in which goals Does not depend on the order in which goals

and trajectory sequences are analyzedand trajectory sequences are analyzed All state abstractions are safeAll state abstractions are safe

There exists a hierarchical policy within the induced There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectoryhierarchy that will reproduce the observed trajectory

Extend MaxQ Node Irrelevance to the induced structureExtend MaxQ Node Irrelevance to the induced structure

Learned hierarchical structure is “locally Learned hierarchical structure is “locally optimal”optimal” No local change in the trajectory segmentation No local change in the trajectory segmentation

can improve the state abstractions (very weak)can improve the state abstractions (very weak)

Experimental SetupExperimental Setup

Randomly generate pairs of Randomly generate pairs of sourcesource--targettarget resource-gathering maps in Wargusresource-gathering maps in Wargus

Learn the optimal policy in Learn the optimal policy in sourcesource

Induce task hierarchy from a single (near) Induce task hierarchy from a single (near) optimal trajectoryoptimal trajectory

Transfer this hierarchical structure to the Transfer this hierarchical structure to the MaxQ value-function learner for MaxQ value-function learner for targettarget

Compare to direct Q learning, and MaxQ Compare to direct Q learning, and MaxQ learning on a manually engineered learning on a manually engineered hierarchy within hierarchy within targettarget

Hand-Built Wargus Hand-Built Wargus HierarchyHierarchy

Root

Get Gold Get Wood

Goto(loc)Mine Gold Chop Wood Deposit

GWDeposit

Hand-Built Abstractions & Hand-Built Abstractions & TerminationsTerminations

Task Task NameName State AbstractionState Abstraction Termination Termination

ConditionCondition

RootRoot req.gold, req.wood, agent.resourcereq.gold, req.wood, agent.resource req.gold = 1 && req.gold = 1 && req.wood = 1req.wood = 1

Harvest Harvest GoldGold agent.resource, region.goldmineagent.resource, region.goldmine agent.resource agent.resource ≠≠ 0 0

Harvest Harvest WoodWood agent.resource, region.forestagent.resource, region.forest agent.resource agent.resource ≠≠ 0 0

GWDepositGWDeposit req.gold, req.wood, agent.resource, region.townhallreq.gold, req.wood, agent.resource, region.townhall agent.resource agent.resource == 0 0

Mine GoldMine Gold region.goldmineregion.goldmine NANA

Chop WoodChop Wood region.forestregion.forest NANA

DepositDeposit req.gold, req.wood, agent.resource, region.townhallreq.gold, req.wood, agent.resource, region.townhall NANA

Goto(loc)Goto(loc) agent.x, agent.yagent.x, agent.y NANA

Results: WargusResults: WargusWargus domain: 7 reps

-1000

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80 90 100Episode

To

tal

Du

rati

on

Induced (MAXQ)

Hand-engineered (MAXQ)

No transfer (Q)

Need For Need For DemonstrationsDemonstrations

VISA only uses DBNs for causal informationVISA only uses DBNs for causal information Globally applicable across state space without Globally applicable across state space without

focusing on the pertinent subspacefocusing on the pertinent subspace ProblemsProblems

Global variable coupling might prevent concise Global variable coupling might prevent concise abstractionabstraction

Exit states can grow exponentially: one for each Exit states can grow exponentially: one for each path in the decision tree encodingpath in the decision tree encoding

Modified bitflip domain exposes these Modified bitflip domain exposes these shortcomingsshortcomings

Modified Bitflip DomainModified Bitflip Domain

State space: bState space: b00,…,b,…,bn-1n-1

Action space:Action space: Flip(i), 0 < i < n-1Flip(i), 0 < i < n-1

If bIf b00 … … b bi-1i-1 = 1 then b = 1 then bii ← ~b← ~bii

Else bElse b0 0 ← 0, …, b← 0, …, bii ← 0 ← 0 Flip(n-1)Flip(n-1)

If parity(If parity(bb00, …,b, …,bn-2n-2) ) b bn-2n-2 = 1, b = 1, bn-1n-1 ← ~b← ~bn-1n-1

Else bElse b0 0 ← 0, …, b← 0, …, bn-1n-1 ← 0 ← 0 parity(…) = even if n-1 is even, odd otherwiseparity(…) = even if n-1 is even, odd otherwise

Reward: -1 for all actionsReward: -1 for all actions Terminal/goal state: Terminal/goal state: bb00 … … b bn-1n-1 = 1 = 1

Modified Bitflip DomainModified Bitflip Domain

1 1 1 0 0 0 0

1 1 1 1 0 0 0

Flip(3)

1 0 1 1 0 0 0

Flip(1)

0 0 0 0 0 0 0

Flip(4)

VISA’s Causal GraphVISA’s Causal Graph

Variables grouped into two strongly connected Variables grouped into two strongly connected components (dashed ellipses)components (dashed ellipses)

Both components affect the reward nodeBoth components affect the reward node

b0 b1

Flip(1) Flip(2)bn-2 bn-1

Flip(n-1)b2

Flip(2)

R

Flip(n-1)

Flip(n-1)Flip(3)

Flip(n-1)Flip(3)Flip(2)

Flip(n-2)

Flip(n-2)

Flip(n-2)

VISA task hierarchyVISA task hierarchy

Root

Flip(1)Flip(0) Flip(n-1)

Flip(n-1)Parity(b0,…,bn-2) bn-2 = 1

2n-3 exit options

Bitflip CATBitflip CAT

Flip(n-1) EndStart Flip(0) Flip(n-2)

bn-1

b0 Flip(1)b0

bn-2

b0,…,bn-2 b0,…,bn-1

b1

Induced MAXQ task Induced MAXQ task hierarchyhierarchy

Root

Flip(1)Flip(0)

Flip(n-1)b0…bn-2 = 1

b0…bn-3 = 1 Flip(n-2)

b0 b1 = 1 Flip(n-3)

Results: BitflipResults: BitflipBitflip domain: 7 bits, 20 reps

-500

0

500

1000

1500

2000

2500

3000

0 10 20 30 40 50 60 70 80 90 100

Episode

To

tal

Du

rati

on

Q

MaxQ

VISA

ConclusionConclusion

Causality analysis is the key to our Causality analysis is the key to our approachapproach

Enables us to find concise subtask Enables us to find concise subtask definitions from a demonstrationdefinitions from a demonstration CAT scan is easy to performCAT scan is easy to perform

Need to extend to learn from multiple Need to extend to learn from multiple demonstrationsdemonstrations Disjunctive goalsDisjunctive goals

automatic induction of maxq hierarchies

Documents

possible exit state

task structure

terminal abstract state

maxq valuefunction learning

maximizing state abstraction

unique terminal state

good trajectory

action tests