a hybridized planner for stochastic domains
DESCRIPTION
A Hybridized Planner for Stochastic Domains. Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento. Planning under Uncertainty (ICAPS’03 Workshop). Qualitative (disjunctive) uncertainty Which real problem can you solve?. - PowerPoint PPT PresentationTRANSCRIPT
A Hybridized Planner for Stochastic Domains
Mausam and Daniel S. WeldUniversity of Washington, Seattle
Piergiorgio BertoliITC-IRST, Trento
Planning under Uncertainty
(ICAPS’03 Workshop)
Qualitative (disjunctive) uncertainty
Which real problem can you solve?
Quantitative (probabilistic) uncertainty
Which real problem can you model?
The Quantitative View
Markov Decision Processmodels uncertainty with probabilistic
outcomesgeneral decision-theoretic frameworkalgorithms are slow
do we need the full power of decision theory?
is an unconverged partial policy any good?
The Qualitative View
Conditional PlanningModel uncertainty as logical disjunction of
outcomesexploits classical planning techniques FASTignores probabilities poor solutions
how bad are pure qualitative solutions?can we improve the qualitative policies?
HybPlan: A Hybridized Planner
combine probabilistic + disjunctive plannersproduces good solutions in intermediate timesanytime: makes effective use of resourcesbounds termination with quality guarantee
Quantitative View completes partial probabilistic policy by using qualitative
policies in some states
Qualitative View improves qualitative policies in more important regions
Outline
MotivationPlanning with Probabilistic Uncertainty
(RTDP)Planning with Disjunctive Uncertainty (MBP)Hybridizing RTDP and MBP (HybPlan)ExperimentsConclusions and Future Work
Markov Decision Process
< S, A, Pr, C, s0, G > S : a set of states
A : a set of actions
Pr : prob. transition model
C : cost model
s0 : start state
G : a set of goals
Find a policy (S ! A) minimizes expected cost
to reach a goal for an indefinite horizon for a fully observable Markov decision process.
Optimal cost function, J*, ~ optimal policy
s0 Goal
2
2
Example
Longer path
Wrong direction,but goal still reachable
All states aredead-ends
41
3
8
44
1
8
3 2
2
77
3 3
1
11
2
6
01
1
2
2
Goal
Optimal State Costs
43
3 2
20
1
1 Goal
Optimal Policy
Bellman Backup: Create better approximation to cost function @ s
Bellman Backup: Create better approximation to cost function @ s
Trial=simulate greedy policy & update visited states
Bellman Backup: Create better approximation to cost function @ s
Real Time Dynamic Programming
(Barto et al. ’95; Bonet & Geffner’03)
Repeat trials until cost function converges
Trial=simulate greedy policy & update visited states
Planning with Disjunctive Uncertainty
< S, A, T, s0, G > S : a set of states
A : a set of actions
T : disjunctive transition model
s0 : the start state
G : a set of goals
Find a strong-cyclic policy (S ! A) that guarantees
reaching a goal for an indefinite
horizon for a fully observable planning problem
Model Based Planner (Bertoli et. al.)
States, transitions, etc. represented logicallyUncertainty multiple possible successor states
Planning Algorithm Iteratively removes “bad” states.Bad = don’t reach anywhere or reach other bad states
Goal
MBP Policy
Sub-optimal solution
Outline
MotivationPlanning with Probabilistic Uncertainty
(RTDP)Planning with Disjunctive Uncertainty
(MBP)Hybridizing RTDP and MBP (HybPlan)ExperimentsConclusions and Future Work
HybPlan Top Level Code0. run MBP to find a solution to goal1. run RTDP for some time2. compute partial greedy policy
(rtdp)
3. compute hybridized policy (hyb) by1. hyb(s) = rtdp(s) if visited(s) > threshold
2. hyb(s) = mbp(s) otherwise
4. clean hyb by removing1. dead-ends2. probability 1 cycles
5. evaluate hyb 6. save best policy obtained so far
repeat until1) resources exhaust or2)a satisfactory policy found
00
0
0
00
0
0
0 0
0
00
0 0
0
00
0
0
Goal
0
0
2
2
First RTDP Trial
1. run RTDP for some time
0
0
0
00
0
0
0 0
0
00
0 0
0
00
0
0
Goal
0
0
2
2
Q1(s,N) = 1 + 0.5£ 0 + 0.5£ 0Q1(s,N) = 1Q1(s,S) = Q1(s,W) = Q1(s,E) = 1J1(s) = 1
Let greedy action be North
Bellman Backup
1. run RTDP for some time
10
0
0
00
0
0
0 0
0
00
0 0
0
00
0
0
Goal
0
0
2
2
Simulation of Greedy Action
1. run RTDP for some time
10
0
0
0
0
0
0 0
0
00
0 0
0
00
0
0
Goal
0
0
2
2
Continuing First Trial
1. run RTDP for some time
0
0
0
01
0
0
0 0
0
00
0
0
00
0
0
Goal
0
0
2
2
1
Continuing First Trial
1. run RTDP for some time
0
0
0
0
0
0
0 0
0
00
1 0
0
00
0
Goal
0
0
2
2
1
1
Finishing First Trial
1. run RTDP for some time
0
0
0
0
0
0
0 0
0
00
0
0
00
2
0
Goal
0
0
2
2
1
1
1
Cost Function after First Trial
1. run RTDP for some time
0
2
Goal
2
1
1
1
Partial Greedy Policy
2. compute greedy policy (rtdp)
0
2
Goal
2
1
1
1
0
Construct Hybridized Policy w/ MBP
3. compute hybridized policy (hyb)
(threshold = 0)
0
2
Goal
2
1
1
1
After first trial
0
J(hyb) = 5
5
4
3
2
4
3
Evaluate Hybridized Policy
5. evaluate hyb
6. store hyb
0
0
0
0
1
0
0 0
0
00
0
0
12
2
0
Goal
0
0
2
2
1
1
1
Second Trial
0
112 1
Partial Greedy Policy
0
112 1
Absence of MBP Policy
MBP Policy doesn’t exist!no path to goal
£01
01
2
Goal
2
0
0
1
0
1
0
0 0
0
01
0
0
1
2
3
Goal
0
0
2
2
1
1
1
Third Trial
2
1 0
1
3
2
1
Partial Greedy Policy
1 0
1
3
2
1
Probability 1 Cycles
0
repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken
1 0
1
3
2
1
Probability 1 Cycles
0
repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken
1 0
1
3
2
1
Probability 1 Cycles
0
repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken
1 0
1
3
2
1
Probability 1 Cycles
0
repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken
1 0
1
3
2
1
Probability 1 Cycles
0
01
01
2
Goal
2
repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken
0
2
Goal
2
1
1
1
After 1st trial
0
J(hyb) = 5
5
4
3
2
4
3
Error Bound
J*(s0) · 5J*(s0) ¸ 1) Error(hyb) = 5-1 = 4
Terminationwhen a policy of required error bound is foundwhen the planning time exhaustswhen the available memory exhausts
Propertiesoutputs a proper policyanytime algorithm (once MBP terminates)HybPlan = RTDP, if infinite resources availableHybPlan = MBP, if extremely limited resourcesHybPlan = better than both, otherwise
Outline
MotivationPlanning with Probabilistic Uncertainty
(RTDP)Planning with Disjunctive Uncertainty
(MBP)Hybridizing RTDP and MBP (HybPlan)Experiments
Anytime Properties Scalability
Conclusions and Future Work
Domains
NASA Rover Domain Factory Domain Elevator domain
Anytime Properties
RTDP
Anytime Properties
RTDP
Scalability
Problems Time before memory exhausts
J(rtdp) J(mbp) J(hyb)
Rov5 ~1100 sec 55.36 67.04 48.16
Rov2 ~800 sec 1 65.22 49.91
Mach9 ~1500 sec 143.95 66.50 48.49
Mach6 ~300 sec 1 71.56 71.56
Elev14 ~10000 sec
1 46.49 44.48
Elev15 ~10000 sec
1 233.07 87.46
Conclusions
First algorithm that integrates disjunctive and probabilistic planners.
Experiments show that HybPlan is anytime scales better than RTDP produces better quality solutions than MBP can interleaved planning and execution
Hybridized Planning: A General Notion
Hybridize other pairs of planners an optimal or close-to-optimal planner a sub-optimal but fast planner
to yield a planner that produces a good quality solution in intermediate running times
Examples POMDP : RTDP/PBVI with POND/MBP/BBSP Oversubscription Planning : A* with greedy solutions Concurrent MDP : Sampled RTDP with single-action RTDP