cs-424 gregory dudek today’s lecture reinforcement learning: further thoughts. planning

23
CS-424 Gregory Dudek Today’s Lecture • Reinforcement learning: further thoughts. • Planning

Upload: monica-gibson

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Today’s Lecture

• Reinforcement learning: further thoughts.

• Planning

Page 2: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Transition networks• How determine strategies (policy) in a problem defined by a

transition network. It was:– Deterministic or stochastic– Markovian (exhibited the Markov property).– Fully observable (RN: accessible): we can directly observe

(determine) exactly what what we are in during the update process.

• Computing the optimal policy is a Markov Decision Problem (MDP).

If we don’t know the current state for sure, but can only infer it (probabilistically), then we have a Partially Observable system.

Partially Observable Markov Decision Problem (POMDP).

– How hard is it to compute the optimal policy?

Page 3: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Specific details on reinforcementSimplest model:

Given we know all transition probabilities, and the immediate (short-term) reward R(i) associated with each state i.

We can compute the value function U() by solving a linear system

U(i) = R(i) + j M(i,j) U(j)

This approach is referred to as adaptive dynamic programming.

In contrast,

• Sampling and TD methods update this system intermittently based on partial information.

(Note we have omitted the less-effective LMS algorithm in the textbook.)

Page 4: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Types of learners• 2 classes wrt reinforcement learning:

– Passive learners: you just update the state transition/reward info for the states you are taken to, but do not control the sequence of states visited.

• Backgammon learner that merely observers another part of the system playing. A kid watching it’s parents.

– Active learners: the learner actively modifies the sequence of states visited in order to (presumably) acquire information.

Page 5: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Exploration versus Exploitation• Fundamental tradeoff.

• We want to maximize return:– Should we do what we know is best, based on incomplete

information– Or should we seek information about unknown things,

although this may not lead to rewards?

• Plenty of intuitive relevance. • How do we combine these two processes?

Page 6: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Planning: general approach• Use a (restrictive) formal language to describe

problems and goals.– Why restrictive? More precision and fewer states to

search

• Have a goal state specification and an initial state.

• Use a special-purpose planner to search for a solution.

Page 7: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Basic formalism• Basic logical formalism derived from STRIPS.

• State variables determine what actions can or should be taken: in this context they are conditions– Shoe_untied()– Door_open(MC)

• An operator (remember those?) is now a triplePreconditions

AdditionsDeletions

Together, called effects of an operator

Seen in thecontext of search

Page 8: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

A plan is• 4 components

– A set of steps defined by a sequence of operator applications

– A set of constraints on the ordering of these steps. (Not necessarily a total ordering.)

– A set of variable binding constraints: set of things various operators can apply to.

– Set of causal links that specify what effects one action achieves that are needed by another.

Page 9: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Going forwards• All state variables are true or false,

but some may not be defined at a certain point on our

State Progression.A planner based on this is a progression planner.

Idea: In a state S,Can apply operator X=(P,A,D).Leads to new state T

T = fX(S) = (S-D) A

Page 10: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Constancy• Important caveat

• When we go from one state to another,we assume that the

only changes were thosethat resulted explicitly from the

Additions and Deletions.

Given this assumption, the operator X computes the strongest provable postconditions.

In reality, even more might be deleted.

Page 11: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Aside: FOL with time

• One approach is a variation of first-order logic called situation calculus [McCarthy].– Events take place at specific times.– Some predicates are fluents and only apply for certain

ranges in time.– A situation is a temporal interval over which all the

predicates remain fixed.

– Reference: read RN Sec 7.6 or DAA Ch. 6.

Page 12: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Going backwards• Remember backwards chaining?

• State at the goal G.• Assuming the deletions aren’t there for some

operator X– Why?

• Can chain backwards by adding what would have been deleted and removing what would have been added

S = f-1X(G) = (G-A) D

Maybe we added too much (with D), or deleted too little?

Page 13: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Means/ends analysis• How can we get from initial to final?

– Assume the states and operators are given.– What’s the right path? How to we measure distance?

• Means/ends analysis assumes we simply reduce the number of things that make our current state different from out goal.

Page 14: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

STRIPS• STRIPS is an old planning language

• STanford Research Institute Problem Solver.– Less expressive than situation calculus

– Initial state:

At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels)

– Goal stateAt(Home) & Have(Video) & Have(Cooked-Popcorn)

Page 15: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Schemas• Basic operators assume a complete specification of

the state in which they are applied.

• This can be tedious– An operator schema is a “generic” operator that has

variables in it• Related to axiom schemas• Related to unification in logic (e.g. prolog)

E.g.

Tie_shoes(h), Tie_necktie(h), Tie_boat_rope(h), Tie_straightjacket(h)

might all be abstracted by Tie_object(X,h)

Page 16: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Least Commitment Planning• When we formulate a plan intuitively, we often think

of doing things in a specific sequenceeven when the sequencing is arbitrary.– This may not be wise.

• This can leads to re-shuffling actions...which is undesirable.

Generate plans such that we have sets of applicable actions, but we don’t order the actions unless there is something (conditions) that demands it.

Page 17: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Partially ordered plan

A

D

EGB

CF

Page 18: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Terminology• Constraints on sequencing, requirements for

operators, links relating operators, conflicts between operators in a given plan.

For a plan:• Sound

– Plan steps obey constraints on sequencing– Successful

• Systematic– Doesn’t “waste” effort

• Complete– Generates a plan if one exists.

• Still may not terminate (cf. Halting problem)• Plan refinement

– Improvement of an existing plan to make it better meet the constraints

Page 19: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Links & Conflicts

ConsumerProducer

ConsumerProducer

Clobberer

A conflict involves a link, & a step that messes it up.

Page 20: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

RefinementFix conflicts by creating a new one from an old one.

– Keep old structures (links, producers, consumers, constraints) but add new constraints

• If there are conflicts, resolve them by adding constraints: move a clobberer before of after the link it’s hitting.– (if you can).

• If there are no conflicts, satisfy an unfulfilled requirement.

Page 21: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Applications of planning• Planning for Shakey the robot

– Climb boxes– Push things– Move around

• Blocks world– Moving blocks– Piling them onto one another– Clearing the tops of chosen blocks

• Really doing this suggested we need vision!

Page 22: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Configuration Space Planning

Page 23: CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

CS-424 Gregory Dudek

Issues