cs-424 gregory dudek today’s lecture reinforcement learning: further thoughts. planning
TRANSCRIPT
CS-424 Gregory Dudek
Today’s Lecture
• Reinforcement learning: further thoughts.
• Planning
CS-424 Gregory Dudek
Transition networks• How determine strategies (policy) in a problem defined by a
transition network. It was:– Deterministic or stochastic– Markovian (exhibited the Markov property).– Fully observable (RN: accessible): we can directly observe
(determine) exactly what what we are in during the update process.
• Computing the optimal policy is a Markov Decision Problem (MDP).
If we don’t know the current state for sure, but can only infer it (probabilistically), then we have a Partially Observable system.
Partially Observable Markov Decision Problem (POMDP).
– How hard is it to compute the optimal policy?
CS-424 Gregory Dudek
Specific details on reinforcementSimplest model:
Given we know all transition probabilities, and the immediate (short-term) reward R(i) associated with each state i.
We can compute the value function U() by solving a linear system
U(i) = R(i) + j M(i,j) U(j)
This approach is referred to as adaptive dynamic programming.
In contrast,
• Sampling and TD methods update this system intermittently based on partial information.
(Note we have omitted the less-effective LMS algorithm in the textbook.)
CS-424 Gregory Dudek
Types of learners• 2 classes wrt reinforcement learning:
– Passive learners: you just update the state transition/reward info for the states you are taken to, but do not control the sequence of states visited.
• Backgammon learner that merely observers another part of the system playing. A kid watching it’s parents.
– Active learners: the learner actively modifies the sequence of states visited in order to (presumably) acquire information.
CS-424 Gregory Dudek
Exploration versus Exploitation• Fundamental tradeoff.
• We want to maximize return:– Should we do what we know is best, based on incomplete
information– Or should we seek information about unknown things,
although this may not lead to rewards?
• Plenty of intuitive relevance. • How do we combine these two processes?
CS-424 Gregory Dudek
Planning: general approach• Use a (restrictive) formal language to describe
problems and goals.– Why restrictive? More precision and fewer states to
search
• Have a goal state specification and an initial state.
• Use a special-purpose planner to search for a solution.
CS-424 Gregory Dudek
Basic formalism• Basic logical formalism derived from STRIPS.
• State variables determine what actions can or should be taken: in this context they are conditions– Shoe_untied()– Door_open(MC)
• An operator (remember those?) is now a triplePreconditions
AdditionsDeletions
Together, called effects of an operator
Seen in thecontext of search
CS-424 Gregory Dudek
A plan is• 4 components
– A set of steps defined by a sequence of operator applications
– A set of constraints on the ordering of these steps. (Not necessarily a total ordering.)
– A set of variable binding constraints: set of things various operators can apply to.
– Set of causal links that specify what effects one action achieves that are needed by another.
CS-424 Gregory Dudek
Going forwards• All state variables are true or false,
but some may not be defined at a certain point on our
State Progression.A planner based on this is a progression planner.
Idea: In a state S,Can apply operator X=(P,A,D).Leads to new state T
T = fX(S) = (S-D) A
CS-424 Gregory Dudek
Constancy• Important caveat
• When we go from one state to another,we assume that the
only changes were thosethat resulted explicitly from the
Additions and Deletions.
Given this assumption, the operator X computes the strongest provable postconditions.
In reality, even more might be deleted.
CS-424 Gregory Dudek
Aside: FOL with time
• One approach is a variation of first-order logic called situation calculus [McCarthy].– Events take place at specific times.– Some predicates are fluents and only apply for certain
ranges in time.– A situation is a temporal interval over which all the
predicates remain fixed.
– Reference: read RN Sec 7.6 or DAA Ch. 6.
CS-424 Gregory Dudek
Going backwards• Remember backwards chaining?
• State at the goal G.• Assuming the deletions aren’t there for some
operator X– Why?
• Can chain backwards by adding what would have been deleted and removing what would have been added
S = f-1X(G) = (G-A) D
Maybe we added too much (with D), or deleted too little?
CS-424 Gregory Dudek
Means/ends analysis• How can we get from initial to final?
– Assume the states and operators are given.– What’s the right path? How to we measure distance?
• Means/ends analysis assumes we simply reduce the number of things that make our current state different from out goal.
CS-424 Gregory Dudek
STRIPS• STRIPS is an old planning language
• STanford Research Institute Problem Solver.– Less expressive than situation calculus
– Initial state:
At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels)
– Goal stateAt(Home) & Have(Video) & Have(Cooked-Popcorn)
CS-424 Gregory Dudek
Schemas• Basic operators assume a complete specification of
the state in which they are applied.
• This can be tedious– An operator schema is a “generic” operator that has
variables in it• Related to axiom schemas• Related to unification in logic (e.g. prolog)
E.g.
Tie_shoes(h), Tie_necktie(h), Tie_boat_rope(h), Tie_straightjacket(h)
might all be abstracted by Tie_object(X,h)
CS-424 Gregory Dudek
Least Commitment Planning• When we formulate a plan intuitively, we often think
of doing things in a specific sequenceeven when the sequencing is arbitrary.– This may not be wise.
• This can leads to re-shuffling actions...which is undesirable.
Generate plans such that we have sets of applicable actions, but we don’t order the actions unless there is something (conditions) that demands it.
CS-424 Gregory Dudek
Partially ordered plan
A
D
EGB
CF
CS-424 Gregory Dudek
Terminology• Constraints on sequencing, requirements for
operators, links relating operators, conflicts between operators in a given plan.
For a plan:• Sound
– Plan steps obey constraints on sequencing– Successful
• Systematic– Doesn’t “waste” effort
• Complete– Generates a plan if one exists.
• Still may not terminate (cf. Halting problem)• Plan refinement
– Improvement of an existing plan to make it better meet the constraints
CS-424 Gregory Dudek
Links & Conflicts
ConsumerProducer
ConsumerProducer
Clobberer
A conflict involves a link, & a step that messes it up.
CS-424 Gregory Dudek
RefinementFix conflicts by creating a new one from an old one.
– Keep old structures (links, producers, consumers, constraints) but add new constraints
• If there are conflicts, resolve them by adding constraints: move a clobberer before of after the link it’s hitting.– (if you can).
• If there are no conflicts, satisfy an unfulfilled requirement.
CS-424 Gregory Dudek
Applications of planning• Planning for Shakey the robot
– Climb boxes– Push things– Move around
• Blocks world– Moving blocks– Piling them onto one another– Clearing the tops of chosen blocks
• Really doing this suggested we need vision!
CS-424 Gregory Dudek
Configuration Space Planning
CS-424 Gregory Dudek
Issues