1 hierarchical lookahead agents: a preliminary report bhaskara marthi stuart russell jason wolfe

1

Hierarchical Lookahead Agents:A Preliminary Report

Bhaskara Marthi

Stuart Russell

Jason Wolfe

2

High-level actions

• For our purposes, a high-level action (HLA)• Has a set of possible immediate refinements into sequences of

actions, primitive or high-level• Each refinement may have a precondition on its use

• Many of the actions we do are high-level• Go to work• Write a NIPS paper

• This definition generalizes others we’ve seen• Options don’t provide a choice of refinements• MAXQ tasks can’t represent sequencing constraints

3

Abstract Lookahead

• k-step lookahead >> 1-step lookahead• e.g., chess

• k-step lookahead no use if steps are too small• e.g., first k characters of a NIPS paper • this is one small part of a human life, =

~20,000,000,000,000 primitive actions

• Abstract plans with HLAs are shorter• Much shorter plans for given goals => exponential savings• Can look ahead much further into future

• Requires a model (i.e., transition and reward fns) for HLAs • No suitable models to be found in HRL or HTN literature• We proposed angelic semantics [MRW 07], which we extend here

Kc3

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.







Rxa1

Qa4 c2 c2 Kxa1

‘a’ ‘b’

‘a’ ‘b’ ‘a’ ‘b’

a b

aa ab ba bb

4

s0

t

S

s0

t

S

s0

t

S

h2h1 h1h2

h2

h1h2 is a solution

Angelic semantics for HLAs • Models HLAs in deterministic domains without rewards• Central idea is reachable set of an HLA from some state

• When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal

• May seem related to nondeterminism• But the nondeterminism is angelic: the “uncertainty” will be resolved by

the agent, not an adversary or nature

a4 a1 a3 a2

Goal

State space

5

Contributions

• Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead

• Hierarchical Forward Search• An optimal, offline algorithm• Can prove optimality of plans entirely at high level

• Hierarchical Real-Time Search• An online algorithm• Requires only bounded computation per time step• Significantly outperforms previous ``flat’’ lookahead algorithms

• Both algorithms require three inputs:• a deterministic MDP model• an action hierarchy (HLAs)• models for HLAs (based on angelic semantics)

6

S

Deterministic MDPs• A deterministic MDP consists of:

• State space S

• Initial state s0 2 S, (w.l.o.g.) single terminal state t S

• Primitive action set A• Transition function f : S x A ! S• Reward function r : S x A ! R

• Undiscounted, assume no non-negative-reward cycles

Transition & reward functions for action a1

S

s0

t

-5

3

-8-1

-4

s0

t

7

Example – Warehouse World

• Has similarities to blocks and taxi domains, but more choices and constraints• Gripper must stay in bounds• Can’t pass through blocks• Can only turn around at top row

• All actions have reward -1• Goal: have C on T4

• Can’t just move directly• Final plan has 22 steps

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B C

T1 T2 T3 T4

A B C

Left, Down, Pickup,Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

T1 T2 T3 T4

A B

C

8

Some HLAs for warehouse world

[Move(B,C) Act]

• Each HLA has a set of immediate refinements into sequences of actions

• All plans of interest are refinements of the special HLA Act

• HLAs define a context free grammar over primitive plans (Act is the start symbol)

[Act]

…

[Get(B) PutOn(C)]

[GetL(B)] [GetR(B)]

…

…

… …

9

Abstract Lookahead Trees• ALTs are primary data structures

• Represent a set of potential plans• Edges are (possibly abstract) actions

• Basic operation: refine a plan (replace with all refs at some HLA)• Nodes represent potential situations

(pairs of upper & lower valuations)

ActMove(C,A) Act

Act

Move(A,C)

Move(B,C) Act

Act

Move(C,B)

Get(C

)

PutOn(A)Act

Move(C,B)

5

2

3

00

1

2

22

34

13

11

32

10

S

s0

t

37

---

--

-2

--

--9

- 4

37

---

--

-5

--

--9

- 2

Modeling HLAs

• An HLA is fully characterized by primitive MDP + hierarchy• But without abstraction, we lose benefits of hierarchy

• Extension of idea from “Angelic Semantics for HLAs” [MRW ‘07]:• Valuation of HLA h from state s:

• For each s’, best reward of any prim. ref. of h that takes s to s’

• Exact description of h = valuation of h from each s• But this description has no compact, efficient rep. in general

S

s0

t

h 37

---

--

--

- 2

--

- -

5

-1

1

4

27

3

11

S

s0

t

S

s0

t

Upper and lower valuations

• Instead, use approximate valuations • Upper valuations never underestimate best achievable reward• Lower valuations never overestimate best achievable reward• We choose a simple form: reachable set + reward bound on set

Exact

- - -

- - -- --3

7 5 -2 4 6

8Upper

1Lower

12

Representing descriptions: NCSTRIPS

• Assume S = set of truth assignments to propositions• Descriptions specify propositions (possibly) added/deleted by HLA

• Also include a bound on rewards

• An efficient algorithm exists to progress valuations (represented as DNF formulae + numeric bound) through descriptions

Navigate(xt,yt) (Pre: At(xs,ys))

Upper: -At(xs,ys), +At(xt,yt), -FacingRight,

+FacingRight, +R(-|xs-xt| - |ys-yt|)

Lower: IF (Free(xt,yt) x Free(x,ymax)):

-At(xs,ys), +At(xt,yt), -FacingRight,

+FacingRight, +R(-|xs-xt| -ys-yt+1) ELSE: nil

~~

st

~~

st

st

x

13

Back to ALT nodes

• Initially, we know we are at s0 with reward 0

• Each ALT node is created from parent & descriptions of action from parent• Valuations give provable bounds on reachable states & rewards

• Lower valuation after Act usually vacuous, upper valuation gives admissible heuristic to t

S

s0

t

S

s0

t

-5

-3Move(A,C) Act

S

s0

t

-9

0

0

14

S

s0

t

S

s0

t

S

s0

t

PruningPrune whenever a node’s lower valuationdominates another node’s upper valuation

-5

0-9-7

-5-7

15

Hierarchical Forward Search

• Construct the initial ALT• Loop

• Select the plan with highest upper reward• If this plan is primitive, return it• Otherwise, refine one of its HLAs

• Related to A*

16

Intuitive Picture

s0 tAct

highest-level

primitive

17

Intuitive Picture

s0 tAct

highest-level

primitive

18

Intuitive Picture

s0 tAct

highest-level

primitive

19

Intuitive Picture

s0 tAct

highest-level

primitive

20

Intuitive Picture

tAct

s0

highest-level

primitive

21

Intuitive Picture

tAct

s0

highest-level

primitive

22

Intuitive Picture

ts0

highest-level

primitive

23

Intuitive Picture

ts0

highest-level

primitive

24

Analysis

• Optimality follows from guarantees offered by upper descriptions

• Accurate upper descriptions --> more directed search • Accurate lower descriptions --> more pruning

25

An online version?

• Hierarchical Forward Search plans all the way to t before starting acting• Can’t act in real time• Not really useful for RL• In stochastic case, must plan for all contingencies

• How can we choose a good first action with bounded computation?

ts0

26

Hierarchical Real-Time Search

• Loop until at t• Run Hierarchical Forward Search for (at most) k steps• Choose a plan that

• begins with a primitive action • has total upper reward ≥ that of the best unrefined plan• has minimal upper reward up to but not including Act

• Do the first action of this plan in the world• Do a Bellman backup for the current state (prevents looping)

• Related to LRTA*

27

Preliminary Results

Instance 1 (small) Instance 2 (larger)

-4000

-3000

-2000

-1000

0

0 1000 2000 3000 4000 5000 6000 7000 8000

# ALT refinements per step

Total reward

FlatHierarchical

-40

-30

-20

-10

0

0 100 200 300 400 500 600 700 800

# ALT refinements per step

Total reward

FlatHierarchical

28

Discussion

• We see a definite benefit to high-level lookahead• Still some subtleties still to be worked out

• Each HLA description has different biases, making direct comparisons between plans difficult

• Current scheme is overly optimistic

• Other directions for future work:• Extensions to probabilitic/partially obs. environments• Better meta-level control• Inferring/learning HLA models• Integration into RL algorithms• Abstract-level DP• …

Questions?

1 hierarchical lookahead agents: a preliminary report bhaskara marthi stuart russell jason wolfe

Documents

hlas models hlas

state space s initial

s deterministic mdps

step lookahead

highlevel actions

s reward function r

sequences of actions

putdown slide