probabilistic planning backgroundai.ustc.edu.cn/en/robocup/2d/materials/09/prob_plan.pdf · 2020....

79
Probabilistic Planning Background Probabilistic Planning Background Zongzhang Zhang Multi-Agent Systems Labs., School of Computer Science University of Science and Technology of China August 13, 2009 [email protected] Probabilistic Planning Background 1 / 62

Upload: others

Post on 04-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Probabilistic Planning Background

Zongzhang Zhang

Multi-Agent Systems Labs., School of Computer ScienceUniversity of Science and Technology of China

August 13, 2009

[email protected] Probabilistic Planning Background 1 / 62

Page 2: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Question

What’s the relationship between Simulation 2D and POMDP?Simulation 2D’s server can be considered as POMDPmodel.Partially observable Markov decision processes(POMDPs) provide a principled mathematical frameworkfor planning under uncertainty in both action effects andstate observability.Existing POMDP algorithms can’t solve problems ascomplex as Simulation 2D.

[email protected] Probabilistic Planning Background 2 / 62

Page 3: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Question

What’s the relationship between Simulation 2D and POMDP?Simulation 2D’s server can be considered as POMDPmodel.Partially observable Markov decision processes(POMDPs) provide a principled mathematical frameworkfor planning under uncertainty in both action effects andstate observability.Existing POMDP algorithms can’t solve problems ascomplex as Simulation 2D.

[email protected] Probabilistic Planning Background 2 / 62

Page 4: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Question

What’s the relationship between Simulation 2D and POMDP?Simulation 2D’s server can be considered as POMDPmodel.Partially observable Markov decision processes(POMDPs) provide a principled mathematical frameworkfor planning under uncertainty in both action effects andstate observability.Existing POMDP algorithms can’t solve problems ascomplex as Simulation 2D.

[email protected] Probabilistic Planning Background 2 / 62

Page 5: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Question

What’s the relationship between Simulation 2D and POMDP?Simulation 2D’s server can be considered as POMDPmodel.Partially observable Markov decision processes(POMDPs) provide a principled mathematical frameworkfor planning under uncertainty in both action effects andstate observability.Existing POMDP algorithms can’t solve problems ascomplex as Simulation 2D.

[email protected] Probabilistic Planning Background 2 / 62

Page 6: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Outline

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 3 / 62

Page 7: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 4 / 62

Page 8: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Brief introduction to Probabilistic Planning

IntroductionProbabilistic planning describes a set of techniques that anintelligent agent can use to choose actions in the face ofuncertainty about its environment and the results of itsactions.Probabilistic means that the techniques model uncertaintyusing probability theory and select actions in order tomaximize expected utility.Planning means the techniques can reason aboutlong-term goals that require multiple actions to accomplish,often taking advantage of feedback from the environmentto reduce uncertainty and help select actions on the fly.

[email protected] Probabilistic Planning Background 5 / 62

Page 9: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Brief introduction to Probabilistic Planning

IntroductionProbabilistic planning describes a set of techniques that anintelligent agent can use to choose actions in the face ofuncertainty about its environment and the results of itsactions.Probabilistic means that the techniques model uncertaintyusing probability theory and select actions in order tomaximize expected utility.Planning means the techniques can reason aboutlong-term goals that require multiple actions to accomplish,often taking advantage of feedback from the environmentto reduce uncertainty and help select actions on the fly.

[email protected] Probabilistic Planning Background 5 / 62

Page 10: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Brief introduction to Probabilistic Planning

IntroductionProbabilistic planning describes a set of techniques that anintelligent agent can use to choose actions in the face ofuncertainty about its environment and the results of itsactions.Probabilistic means that the techniques model uncertaintyusing probability theory and select actions in order tomaximize expected utility.Planning means the techniques can reason aboutlong-term goals that require multiple actions to accomplish,often taking advantage of feedback from the environmentto reduce uncertainty and help select actions on the fly.

[email protected] Probabilistic Planning Background 5 / 62

Page 11: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Deterministic planning problem

Indoor navigation task

Consider an indoor navigation task in which a robot mustnavigate to a goal position at the end of a hallway withoutcolliding with the walls or other obstacles. To start with, assumethat the robot has a complete map of the hallway and that itknows at all times both its own position and the position of thegoal. Furthermore, assume that to a first approximation we canmodel the robots position as taking on discrete values in a mapgrid.

[email protected] Probabilistic Planning Background 6 / 62

Page 12: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Example 1Figure 1 shows an example map for such a task. The planningproblem is to choose a sequence of one-step motion actionsthat will cause the robot at the west side of the map to reach thegoal at the east side without colliding with walls or obstacles.One such sequence is east, east, north, east, east, south. Thisis a classical discrete deterministic planning problem.

Figure 1: A simple indoor navigation problem

[email protected] Probabilistic Planning Background 7 / 62

Page 13: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

General form of deterministic planning problemsIn general, this class of problem has world states drawnfrom a finite set S and actions available to the agent drawnfrom a finite set A. State changes occur in discrete stepsaccording to a transition function T : S ×A → S. Theworld starts in some known state s0 ∈ S.Under a deterministic world model it is natural to representthe agents policy as a sequence of actionsπ = a0, a1, a2, . . . . The corresponding world states theagent will reach can be calculated using the transitionfunction: s1 = T (s0, a0), s2 = T (s1, a1), etc.

[email protected] Probabilistic Planning Background 8 / 62

Page 14: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

General form of deterministic planning problemsIn general, this class of problem has world states drawnfrom a finite set S and actions available to the agent drawnfrom a finite set A. State changes occur in discrete stepsaccording to a transition function T : S ×A → S. Theworld starts in some known state s0 ∈ S.Under a deterministic world model it is natural to representthe agents policy as a sequence of actionsπ = a0, a1, a2, . . . . The corresponding world states theagent will reach can be calculated using the transitionfunction: s1 = T (s0, a0), s2 = T (s1, a1), etc.

[email protected] Probabilistic Planning Background 8 / 62

Page 15: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Definition: finite horizonSometimes there is a pre-specified maximum policy length; thisis called a finite horizon problem, and the maximum number ofactions is called the horizon length, denoted h.

Definition: infinite horizonInfinite-horizon problems have no pre-specified maximumsequence length, in which case we write h = ∞.

Definition: goal statesThere may also be a set G ⊆ S of absorbing goal states, alsocalled terminal states, such that the action sequence endswhen a terminal state is reached, regardless of the horizon.

[email protected] Probabilistic Planning Background 9 / 62

Page 16: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Definition: finite horizonSometimes there is a pre-specified maximum policy length; thisis called a finite horizon problem, and the maximum number ofactions is called the horizon length, denoted h.

Definition: infinite horizonInfinite-horizon problems have no pre-specified maximumsequence length, in which case we write h = ∞.

Definition: goal statesThere may also be a set G ⊆ S of absorbing goal states, alsocalled terminal states, such that the action sequence endswhen a terminal state is reached, regardless of the horizon.

[email protected] Probabilistic Planning Background 9 / 62

Page 17: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Definition: finite horizonSometimes there is a pre-specified maximum policy length; thisis called a finite horizon problem, and the maximum number ofactions is called the horizon length, denoted h.

Definition: infinite horizonInfinite-horizon problems have no pre-specified maximumsequence length, in which case we write h = ∞.

Definition: goal statesThere may also be a set G ⊆ S of absorbing goal states, alsocalled terminal states, such that the action sequence endswhen a terminal state is reached, regardless of the horizon.

[email protected] Probabilistic Planning Background 9 / 62

Page 18: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Definition: preference levelThere are many ways to formulate the planning problem. Onemight wish to find any action sequence that achieves a goalstate, or the shortest such action sequence, or the lowest costsequence (if actions have different costs). More generally, anyof these preferences can be captured by specifying a utilityfunction U that maps the sequence of actions andcorresponding states to a numeric score, such that largerscores indicate preferred plans.

preference level = U(h, π, s0) = U(s0, a0, s1, a1, . . . , sh−1, ah−1)(1)

Then optimal control means selecting a policy π that maximizesU(h, π, s0).

[email protected] Probabilistic Planning Background 10 / 62

Page 19: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Deterministic Planning

Weighted sum of immediate reward valuesWe focus on a restricted class of problems for which the utilityfunction can be expressed as a weighted sum of immediate rewardvalues. Immediate rewards are specified by a function R : S ×A → Rand

U(h, π, s0) :=h−1∑t=0

γtR(st , at), (2)

where γ ∈ (0, 1] is called the discount factor. Multiplying the reward atstep t by γt serves to place more weight on the earlier rewards in thesequence. If γ < 1 we say the problem is discounted; if γ = 1 all timesteps have equal weight and the problem is undiscounted.Discounting is sometimes motivated by real world effects such asinterest rates, but most often it is used as a convenient way ofensuring that the reward sum converges in infinite-horizon problems.

[email protected] Probabilistic Planning Background 11 / 62

Page 20: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Uncertainty in State Transitions

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 12 / 62

Page 21: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Uncertainty in State Transitions

Example 2A robots actions frequently have unexpected outcomes. Forexample, a robot attempting to move using dead reckoning willoften suffer from position errors due to slippage. We can addan (exaggerated) version of this error to the original navigationdomain by assuming that each motion action has only a 50%chance of achieving its desired effect, and leaves the robot inthe same cell the other 50% of the time. If the robots move isblocked by a wall or other obstacle, it fails 100% of the time.

[email protected] Probabilistic Planning Background 13 / 62

Page 22: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Uncertainty in State Transitions

Example 2 (cont’d)Figure 2 shows this kind of uncertainty graphically. The result ofapplying the east action can no longer be predicted withcertainty; instead, there are two possible outcomes withassociated probabilities. However, for the time being weassume that after the action is completed the robot learns withcertainty which outcome actually occurred. (Perhaps itperiodically receives accurate position information from radiobeacons in the hallway.)

Figure 2: Navigation problem with uncertainty in state [email protected] Probabilistic Planning Background 14 / 62

Page 23: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Uncertainty in State Transitions

General form of Uncertainty in State TransitionsIn general, this model requires a new type of transition functionT : S ×A →

∏(S) that maps a state/action pair to a probability

distribution of possible next states. T (s, a, s0) denotes theprobability of transitioning from s to s0 when action a is applied.Note that when T is specified in this form the system isMarkovian, meaning that its future behavior is conditionallyindependent of the history of states and actions given thecurrent state st . For this reason the model is called a Markovdecision process (MDP).

[email protected] Probabilistic Planning Background 15 / 62

Page 24: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Uncertainty in State Transitions

Graphical model for an MDPFigure 3 shows the graphical model for an MDP. Square nodesin a graphical model represent variables that are controlled bythe agent. Circular nodes represent uncontrolled dependentvariables. Directed edges in the graph are used to indicatedependence relationships between variables. Note the chainstructure of the MDP graphical model, which derives from theMarkovian structure.

Figure 3: Graphical model for an MDP

[email protected] Probabilistic Planning Background 16 / 62

Page 25: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Uncertainty in State Transitions

Preference levelAccording to the decision-theoretic definition of rationality, inthe presence of uncertainty the agent should choose the policythat maximizes expected utility

preference level = E [U(h, π, s0)] = E [h−1∑t=0

γtR(st , at)]. (3)

[email protected] Probabilistic Planning Background 17 / 62

Page 26: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 18 / 62

Page 27: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

IntroductionIn the last section we rather unrealistically assumed that thenavigating robot received perfectly accurate position informationafter every time step. In this section, we relax that assumption.In particular, suppose the robot cannot tell how far along thehallway it has traveled. Instead, it has a noisy obstacle sensorthat nominally returns an obstacle reading when the map cell tothe east of the robot is blocked and a clear reading otherwise,but gives an incorrect reading 10% of the time.

[email protected] Probabilistic Planning Background 19 / 62

Page 28: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Example 3As figure 4 shows, now we have the additional complication of the noisyobservation. Taken together there are four possible results, labeled conditionA (success, obstacle observation), B (success, clear observation), C (failure,clear observation), and D (failure, obstacle observation).

Figure 4: (left) Fully expanded view of state transition.(right) Information available to the robot.

[email protected] Probabilistic Planning Background 20 / 62

Page 29: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Example 3 (cont’d)Unfortunately, because the robot receives only theobservation, it cannot distinguish between conditions Aand D, nor between conditions B and C. As a result, itsavailable information is better represented by the transitiondiagram on the right. Conditions A and D have beencombined into condition AD, and B and C have beencombined into BC.Looking more closely at condition AD, we see that therobot cannot infer its position with certainty, since conditionA and condition D have the robot in different positions.Instead, the likelihoods of possible positions of the roverare marked on the map; the values correspond to therelative likelihood of condition A and condition D.

[email protected] Probabilistic Planning Background 21 / 62

Page 30: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Example 3 (cont’d)Unfortunately, because the robot receives only theobservation, it cannot distinguish between conditions Aand D, nor between conditions B and C. As a result, itsavailable information is better represented by the transitiondiagram on the right. Conditions A and D have beencombined into condition AD, and B and C have beencombined into BC.Looking more closely at condition AD, we see that therobot cannot infer its position with certainty, since conditionA and condition D have the robot in different positions.Instead, the likelihoods of possible positions of the roverare marked on the map; the values correspond to therelative likelihood of condition A and condition D.

[email protected] Probabilistic Planning Background 21 / 62

Page 31: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

partially observable Markov decision processThis type of model is called a partially observable Markovdecision process (POMDP). In general, a POMDP is anMDP extended to include an observation model. Ratherthan receiving complete state information st after eachstep of execution, in a POMDP model the agent isassumed to receive a noisy observation ot .POMDPs model observations probabilistically. The set ofpossible observations is a discrete set O, and eachobservation carries information about the preceding actionand current state according to the noisy observationfunction O : A× S →

∏(O), defined such that

O(a, s′, o) := Pr(Ot+1 = o|at = a, st+1 = s′) (4)

[email protected] Probabilistic Planning Background 22 / 62

Page 32: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

partially observable Markov decision processThis type of model is called a partially observable Markovdecision process (POMDP). In general, a POMDP is anMDP extended to include an observation model. Ratherthan receiving complete state information st after eachstep of execution, in a POMDP model the agent isassumed to receive a noisy observation ot .POMDPs model observations probabilistically. The set ofpossible observations is a discrete set O, and eachobservation carries information about the preceding actionand current state according to the noisy observationfunction O : A× S →

∏(O), defined such that

O(a, s′, o) := Pr(Ot+1 = o|at = a, st+1 = s′) (4)

[email protected] Probabilistic Planning Background 22 / 62

Page 33: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

BeliefThe agent is also assumed to have a probability distributionb0 ∈

∏(S) describing the initial state of the system, such that

b0(s) = Pr(s0 = s). In general, we describe a probabilitydistribution over S as a belief, and denote the space of beliefswith B =

∏(S). Beliefs can be thought of as length-|S| vectors,

and because the entries of a belief vector must sum to 1, B is asimplex of dimension |S|-1 embedded in R|S|.

[email protected] Probabilistic Planning Background 23 / 62

Page 34: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Graphical model for a POMDPFigure 4 shows the graphical model for a POMDP. The chainstructure of the state sequence from the MDP is retained, butthe agent no longer has direct access to the state information ateach time step. Instead, it can infer only uncertain stateinformation from the history of observations.

Figure 5: Graphical model for a POMDP

[email protected] Probabilistic Planning Background 24 / 62

Page 35: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Bayesian updating

At each time step, the agent can use Bayesian reasoning togenerate an updated belief. Let bao denote the agent’s updatedbelief at the next time step after taking action a and receivingobservation o. That is,

bao(s′) := Pr(st+1 = s′|bt = b, at = a, ot+1 = o) (5)

The updated belief can be calculated as follows:

bao(s′) = Pr(s′|b, a, o) =Pr(s′, o|b, a)

Pr(o|b, a)

=Pr(o|a, s′)

∑s Pr(s′|s, a)Pr(s|b)∑

s̄ Pr(o|a, s̄)∑

s Pr(s̄|s, a)Pr(s|b)

=O(a, s′, o)

∑s T (s, a, s′)b(s)∑

s̄ O(a, s̄, o)∑

s T (s, a, s̄)b(s)(6)

[email protected] Probabilistic Planning Background 25 / 62

Page 36: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Bayesian updating

At each time step, the agent can use Bayesian reasoning togenerate an updated belief. Let bao denote the agent’s updatedbelief at the next time step after taking action a and receivingobservation o. That is,

bao(s′) := Pr(st+1 = s′|bt = b, at = a, ot+1 = o) (5)

The updated belief can be calculated as follows:

bao(s′) = Pr(s′|b, a, o) =Pr(s′, o|b, a)

Pr(o|b, a)

=Pr(o|a, s′)

∑s Pr(s′|s, a)Pr(s|b)∑

s̄ Pr(o|a, s̄)∑

s Pr(s̄|s, a)Pr(s|b)

=O(a, s′, o)

∑s T (s, a, s′)b(s)∑

s̄ O(a, s̄, o)∑

s T (s, a, s̄)b(s)(6)

[email protected] Probabilistic Planning Background 25 / 62

Page 37: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

PolicyWhen planning in an MDP model, the agent’s policy could beconditioned on a history of states and actions. In a POMDPmodel, the agent has knowledge only of actions andobservations, so a deterministic policy with horizon h has theform

at = π(h, a0, o1, a1, o2, a2, ·, at−1, ot). (7)

With an MDP, the history could be safely discarded given thecurrent state because the system was Markovian. With aPOMDP, the agent does not have access to the current state, butit can use the current belief as a sufficient statistic for the history.Thus, given the current belief, the agent gains no advantagefrom conditioning on history, and it can restrict its reasoning topolicies in the form

at = π(h, bt). (8)

[email protected] Probabilistic Planning Background 26 / 62

Page 38: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

PolicyWhen planning in an MDP model, the agent’s policy could beconditioned on a history of states and actions. In a POMDPmodel, the agent has knowledge only of actions andobservations, so a deterministic policy with horizon h has theform

at = π(h, a0, o1, a1, o2, a2, ·, at−1, ot). (7)

With an MDP, the history could be safely discarded given thecurrent state because the system was Markovian. With aPOMDP, the agent does not have access to the current state, butit can use the current belief as a sufficient statistic for the history.Thus, given the current belief, the agent gains no advantagefrom conditioning on history, and it can restrict its reasoning topolicies in the form

at = π(h, bt). (8)

[email protected] Probabilistic Planning Background 26 / 62

Page 39: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Transform the POMDP into a belief MDPThis change of variables suggests transforming the POMDP into abelief MDP, as follows:

The POMDP belief simplex B plays the role of the MDP state set.

The action set, horizon, and discount factor of the POMDP areused unchanged for the belief MDP.

The transition function T̃ of the belief MDP gives the probabilityof transitioning from one belief to another according to theobservation model and belief update rule

T̃ (b, a, b′) := Pr(b′|b, a) =∑

oPr(b′|b, a, o)Pr(o|b, a)

=∑

oδ(b′, bao)

∑s,s′

O(a, s′, o)T (s, a, s′)b(s) (9)

where δ(x , y) = 1 if x = y and 0 otherwise.

[email protected] Probabilistic Planning Background 27 / 62

Page 40: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Transform the POMDP into a belief MDPThis change of variables suggests transforming the POMDP into abelief MDP, as follows:

The POMDP belief simplex B plays the role of the MDP state set.

The action set, horizon, and discount factor of the POMDP areused unchanged for the belief MDP.

The transition function T̃ of the belief MDP gives the probabilityof transitioning from one belief to another according to theobservation model and belief update rule

T̃ (b, a, b′) := Pr(b′|b, a) =∑

oPr(b′|b, a, o)Pr(o|b, a)

=∑

oδ(b′, bao)

∑s,s′

O(a, s′, o)T (s, a, s′)b(s) (9)

where δ(x , y) = 1 if x = y and 0 otherwise.

[email protected] Probabilistic Planning Background 27 / 62

Page 41: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Transform the POMDP into a belief MDPThis change of variables suggests transforming the POMDP into abelief MDP, as follows:

The POMDP belief simplex B plays the role of the MDP state set.

The action set, horizon, and discount factor of the POMDP areused unchanged for the belief MDP.

The transition function T̃ of the belief MDP gives the probabilityof transitioning from one belief to another according to theobservation model and belief update rule

T̃ (b, a, b′) := Pr(b′|b, a) =∑

oPr(b′|b, a, o)Pr(o|b, a)

=∑

oδ(b′, bao)

∑s,s′

O(a, s′, o)T (s, a, s′)b(s) (9)

where δ(x , y) = 1 if x = y and 0 otherwise.

[email protected] Probabilistic Planning Background 27 / 62

Page 42: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Transform the POMDP into a belief MDP (cont’d)

The reward function R̃ gives the expected immediatereward from applying an action in a given belief

R̃(b, a) := Ebt=b[R(st , a)] =∑

sR(s, a)b(s) (10)

Belief MDPNote that the state set of the belief MDP is the POMDP beliefsimplex, which is uncountably infinite. Although our earlierdiscussion of MDPs assumed that the state space was finite,the same theoretical results go through with an infinite state set.

[email protected] Probabilistic Planning Background 28 / 62

Page 43: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Partial Observability

Transform the POMDP into a belief MDP (cont’d)

The reward function R̃ gives the expected immediatereward from applying an action in a given belief

R̃(b, a) := Ebt=b[R(st , a)] =∑

sR(s, a)b(s) (10)

Belief MDPNote that the state set of the belief MDP is the POMDP beliefsimplex, which is uncountably infinite. Although our earlierdiscussion of MDPs assumed that the state space was finite,the same theoretical results go through with an infinite state set.

[email protected] Probabilistic Planning Background 28 / 62

Page 44: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 29 / 62

Page 45: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Piecewise-linear and convexThe value functions of belief MDP policies have a specialstructure that makes it possible to generalize valueiteration. For any finite horizon h, the optimal valuefunction V ∗

h is piecewise-linear and convex (PWLC).Formally, let π be any policy whose actions are notconditioned on the initial belief (such as a policy tree) andlet α be a length-|S| vector such that α(s) is the expectedvalue of following π starting from state s:

α(s) = Eπ,s0=s[h−1∑t=0

γtR(st , at)] (11)

[email protected] Probabilistic Planning Background 30 / 62

Page 46: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Piecewise-linear and convexThe value functions of belief MDP policies have a specialstructure that makes it possible to generalize valueiteration. For any finite horizon h, the optimal valuefunction V ∗

h is piecewise-linear and convex (PWLC).Formally, let π be any policy whose actions are notconditioned on the initial belief (such as a policy tree) andlet α be a length-|S| vector such that α(s) is the expectedvalue of following π starting from state s:

α(s) = Eπ,s0=s[h−1∑t=0

γtR(st , at)] (11)

[email protected] Probabilistic Planning Background 30 / 62

Page 47: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Piecewise-linear and convex (cont’d)Then the value of executing π starting from a belief b is

Jπ(b) = Eπ,b0=b[h−1∑t=0

γtR(st , at)]

=∑

s

b(s)Eπ,s0=s[h−1∑t=0

γtR(st , at)]

=∑

s

b(s)α(s) = αT b (12)

Abusing notation, we also write

α(b) = αT b (13)

[email protected] Probabilistic Planning Background 31 / 62

Page 48: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Piecewise-linear and convex (cont’d)Then the value of executing π starting from a belief b is

Jπ(b) = Eπ,b0=b[h−1∑t=0

γtR(st , at)]

=∑

s

b(s)Eπ,s0=s[h−1∑t=0

γtR(st , at)]

=∑

s

b(s)α(s) = αT b (12)

Abusing notation, we also write

α(b) = αT b (13)

[email protected] Probabilistic Planning Background 31 / 62

Page 49: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

NotationWhen the max operator is applied to a set of α vectors,each vector is interpreted in this second sense, as afunction over the belief simplex. In other words, if

V = max{α1, · · · , αn}, (14)

thenV (b) := max

αiαT

i b. (15)

[email protected] Probabilistic Planning Background 32 / 62

Page 50: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Example 4: TIGER problemIn the TIGER problem, you stand before two doors. Behind onedoor is a tiger and behind the other is a pot of gold, but you donot know which is which. Thus there are two equally likelystatesłthe tiger is located either behind the left or right door (thetiger-left or tiger-right state). You may try to learn more usingthe listen action, which provides information about which doorthe tiger is lurking behind, either the noise-left or noise-rightobservationłthe observation has an 85% chance of accuratelyindicating the location of the tiger. Alternately, you may chooseto open one of the two doors using the open-left or open-rightactions, at which point the game ends and you will eitherhappily receive the gold (reward +10) or be eaten (reward-100). The parameters of the POMDP are summarized in Table1; the discount factor γ = 0.95.

[email protected] Probabilistic Planning Background 33 / 62

Page 51: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Figure 6: the TIGER Problem

[email protected] Probabilistic Planning Background 34 / 62

Page 52: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

TIGER problem (cont’d)

s =tiger-left s =tiger-rightb0(s) 0.5 0.5

R(s, open − left) -100 10R(s, open − right) 10 -100R(s, listen) -1 -1

O(listen, s, noise − left) 0.85 0.15O(listen, s, noise − right) 0.15 0.85

Table 1: TIGER problem parameters

[email protected] Probabilistic Planning Background 35 / 62

Page 53: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Policy treesFigure 6 shows some example two-step policy trees for the TIGER POMDP and thecorresponding α vectors. The belief b varies along the x axis of the plot, from 100%certainty of tiger-left at the extreme left to 100% certainty of tiger-right at the extremeright. Each line in the plot relates to one of the policy trees as follows:

Figure 7: Example policy trees and corresponding α vectors for the TIGER POMDP

[email protected] Probabilistic Planning Background 36 / 62

Page 54: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Policy APolicy A: You select listen, then select open-right if noise-left isheard (and otherwise listen again). This is a good policy in thetiger-left state, since the most likely course of events is that youwill hear noise-left and then perform the open-right action,receiving the pot of gold. Policy A performs poorly in thetiger-right state, it usually causes you to receive a small penaltyfor listening twice, and in the worst case you get a falsenoise-left reading and are eaten by the tiger. This is reflected inthe policy A value function, which is at its highest in thetiger-left state and slopes down as the probability of being inthe tiger-right state increases.

[email protected] Probabilistic Planning Background 37 / 62

Page 55: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Policy BPolicy B: This policy is symmetrical with policy A, it works bestwhen there is a high probability of being in the tiger-right state.

Policy CPolicy C: You listen, then open whichever door you hear a noisebehind. This policy is clearly a bad idea regardless of what theyou believe, since it preferentially opens the door where thetiger is located! This is reflected in its uniformly poor valuefunction.

[email protected] Probabilistic Planning Background 38 / 62

Page 56: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Max-planes representation

Let Th = {π1, ..., πn} be the set of all policy trees of depth h asbefore, and let Γh = {α1, · · · , αn} be the corresponding set of αvectors. The optimal value for any particular belief b is thelargest value assigned to b by any policy tree, meaning

V ∗h (b) = max

π∈Jh

Jhπ(b) = max Γh := maxα∈Γh

αT b (16)

In other words, for any finite h, the function V ∗h can be written

as the maximum of a finite number of linear functions. We callthis the max-planes representation. The existence of thisrepresentation means that V ∗

h is a PWLC function.

[email protected] Probabilistic Planning Background 39 / 62

Page 57: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Optimal value functionsFigure 7 shows optimal value functions V ∗

h for a typicaltwo-state POMDP, for several values of h. For each value of h,the thin lines are individual α vectors, each corresponding to adepth h policy tree, and V ∗

h is the upper surface of all these αvectors, represented with a thicker line. The number of αvectors needed to represent V ∗

h tends to increase with h, and inthe limit V ∗

∞ may contain a countably infinite number of αvectors, in which case it is still convex but no longerpiecewise-linear.

[email protected] Probabilistic Planning Background 40 / 62

Page 58: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Figure 8: Optimal value functions at different time horizons for atypical two state POMDP

[email protected] Probabilistic Planning Background 41 / 62

Page 59: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Pruning

Note that the value function for policy C in Figure 6 iseverywhere dominated by policies A and B. It turns out that in atypical POMDP most policy trees from Th are, like policy C,sub-optimal for every belief. This means that the correspondingα vectors are not part of the upper surface of V ∗

h . Thesedominated α vectors can be pruned from the set Γh withoutaffecting the value function. Pruning dominated vectors canexponentially reduce the size of the value functionrepresentation; thus it plays an important role in practicalPOMDP value iteration algorithms.

[email protected] Probabilistic Planning Background 42 / 62

Page 60: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

V ∗ for an example three-state POMDPSo far we have graphed value functions for POMDPs with justtwo states. POMDPs with more states have the same valuefunction structure, but since the value function is definen harderto visualize. Figure 8 provides some geometrical intuition byshowing the upper surface V ∗

h for an example three-statePOMDP. The hatched area below the surface represents thedomain over which V ∗

h is defined.

[email protected] Probabilistic Planning Background 43 / 62

Page 61: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Belief MDP Value Function Structure

Figure 9: V ∗ for an example three-state POMDP

[email protected] Probabilistic Planning Background 44 / 62

Page 62: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 45 / 62

Page 63: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Exact value iteration

Exact value iterationMany early POMDP solution algorithms performed exact valueiteration using piecewise linear convex value functionrepresentations (Sondik, 1971; Monahan, 1982; Cheng, 1988;White, 1991). These algorithms had high computationalcomplexity both in theory and in practice; they were largelyimpractical given the computing hardware available at the time.

[email protected] Probabilistic Planning Background 46 / 62

Page 64: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Exact value iteration

Exact value iteration (cont’d)The process of computing the Bellman update H used by valueiteration can be broken up into (1) choosing a set of importantbeliefs, and (2) at each selected belief, performing a local“point-based” update that calculates the optimal α vector forthat belief from HV (Cheng, 1988). The Witness algorithm(Littman, 1996) uses this idea to compute an exact Bellmanupdate using a series of local updates at points selected usinglinear programs. Witness is more tractable than earlier exact VItechniques because it generates far fewer dominated α vectors.

[email protected] Probabilistic Planning Background 47 / 62

Page 65: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Exact value iteration

Exact value iteration (cont’d)The Incremental Pruning algorithm (Cassandra et al., 1997)achieved similar performance improvement. IncrementalPruning breaks up calculation of the Bellman update into aseries of batch operations and interleaves these operationswith pruning steps so that dominated α vectors are prunedearlier in the calculation.Larger POMDPs require approximation techniques discussedbelow, including more compact approximate value functionrepresentations and point-based value iteration techniques,which break up the global Bellman update into smaller focusedupdates.

[email protected] Probabilistic Planning Background 48 / 62

Page 66: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Exact value iteration

Exact value iteration (cont’d)Websites about POMDP exact value iteration algorithms:

1 http://www.pomdp.org/pomdp/index.shtml

[email protected] Probabilistic Planning Background 49 / 62

Page 67: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Point-based value iteration

Point-based value iterationThe Witness algorithm computes an exact Bellman updateusing a series of pointbased updates at carefully selectedpoints. On the other hand, if we are willing to accept someapproximation error in the Bellman update, we can get awaywith updating fewer beliefs and selecting them less carefully.This is the key idea behind point-based value iterationalgorithms.

[email protected] Probabilistic Planning Background 50 / 62

Page 68: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Point-based value iteration

Point-based value iteration (cont’d)The first algorithms in this class, Cheng’s Point-Based DynamicProgramming (Cheng, 1988) and Zhang’s Point-Based ValueIteration (Zhang and Zhang, 2001), maintain their valuefunction in the form of a set of α vectors and, associated witheach vector, a witness belief where the vector dominates allother vectors. They interleave many (cheap) point-basedupdates at the witness points with occasional (very expensive)exact Bellman updates that generate new α vectors andwitness points. The algorithms differ mainly in their process forselecting witness points.

[email protected] Probabilistic Planning Background 51 / 62

Page 69: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Point-based value iteration

Point-based value iteration (cont’d)Pineau’s Point-Based Value Iteration (PBVI) (Pineau et al.,2003b, 2006) is perhaps the most widely used point-basedalgorithm. PBVI selects a finite set B̃ of beliefs and performspoint-based updates on all points of B̃ in synchronous batchesuntil a convergence condition is reached, then uses a heuristicto expand B̃ and continues the process. They show that eachbatch update approximates the Bellman update and repeatedbatch updates converge to an approximation of V ∗ whose erroris related to the sample spacing of B̃. In extensions of thiswork, they improved the heuristic for selecting new beliefs(Pineau and Gordon, 2005) and improved update efficiency bystoring witness points in a metric tree (Pineau et al., 2003a).

[email protected] Probabilistic Planning Background 52 / 62

Page 70: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Point-based value iteration

Point-based value iteration (cont’d)The PERSEUS algorithm (Spaan and Vlassis, 2005) usesbatch updates that are slightly weaker but considerably fasterthan those of PBVI. During each update, PERSEUS backs uprandomly sampled points from B̃, stopping when all points of B̃have higher values than they did according to the previousvalue function. Since a single α vector often improves the valueof several points, the batch usually requires many fewer than|B̃| point-based updates.

[email protected] Probabilistic Planning Background 53 / 62

Page 71: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Point-based value iteration

Point-based value iteration (cont’d)Websites about POMDP point-based value iteration algorithms:

1 The Symbolic Perseus POMDP Solver (MATLAB/Java source code),written by Pascal Poupart:http://www.cs.uwaterloo.ca/˜ppoupart/index.html

2 The Perseus POMDP Solver (MATLAB source code), written by MatthijsSpaan:http://staff.science.uva.nl/˜mtjspaan/software/approx/

[email protected] Probabilistic Planning Background 54 / 62

Page 72: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Heuristic search

Heuristic searchHeuristic search played a role in many of the POMDP solutionalgorithms already discussed; this section is specificallyconcerned with heuristic search techniques for selectingPOMDP beliefs to update.The first example of this technique is the RTDP-BEL algorithm(Geffner and Bonet, 1998), based on the Real-Time DynamicProgramming (RTDP) algorithm for MDPs (Barto et al., 1995).RTDP-BEL is a straightforward extension of RTDP to POMDPsrepresented as belief-MDPs, using a grid-based value functionapproximation.

[email protected] Probabilistic Planning Background 55 / 62

Page 73: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Heuristic search

Heuristic search (cont’d)The same authors later developed improved algorithms toaddress some of RTDP’s limitations. Labeled RTDP labelsstates once it knows from tracking residuals that they haveapproximately optimal values; labeled states can be skipped inlater trials (Bonet and Geffner, 2003b). The Heuristic DynamicProgramming algorithm also labels finished states, but oftenmore efficiently, and it fits better into a broad framework offind-and-revise algorithms that applies across many types ofproblems (Bonet and Geffner, 2003a). Bonet and Geffnerapplied these algorithms only to MDPs, but they can begeneralized to POMDPs in the same fashion as RTDP.

[email protected] Probabilistic Planning Background 56 / 62

Page 74: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Heuristic search

Heuristic search (cont’d)The Bounded RTDP (BRTDP) algorithm (McMahan et al., 2005)is another MDP heuristic search algorithm based on RTDP.More recently, Shani et al. have modified a number ofpoint-based value iteration algorithms to use prioritizedasynchronous updates similar to HSVI algorithm (Smith andSimmons, 2004), and have developed several novel heuristicsfor selecting good belief points (Shani et al., 2006, 2007; Virinet al., 2007).

[email protected] Probabilistic Planning Background 57 / 62

Page 75: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Heuristic search

Heuristic search (cont’d)Websites about POMDP heuristic search value iterationalgorithms:

1 ZMDP Software for POMDP and MDP Planning, written by Trey Smith:http://www.cs.cmu.edu/˜trey/zmdp/

[email protected] Probabilistic Planning Background 58 / 62

Page 76: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Decentralized POMDPs

Decentralized POMDPsRecently, there has been growing interest in multi-agentsystems. The partially observable stochastic game (POSG)framework models planning for multiple agents with differentinformation resources; the agents may or not cooperate witheach other (Hansen et al., 2004). A decentralized POMDP(DEC-POMDP) is a special type of POSG in which the agentsare assumed to be purely cooperative, but still have limitedability to share information (Bernstein et al., 2002).

[email protected] Probabilistic Planning Background 59 / 62

Page 77: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

Prior Research on POMDPs

Decentralized POMDPs

Decentralized POMDPs (cont’d)Unlike single-agent POMDPs, DEC-POMDPs can be used todecide when agents should communicate (Roth et al., 2006).Solving general DEC-POMDPs is known to be intractable, butapproximate POMDP solution techniques such as BPI andPERSEUS have been adapted to the DEC-POMDP framework(Bernstein et al., 2005; Spaan et al., 2006).

[email protected] Probabilistic Planning Background 60 / 62

Page 78: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

References

1 Deterministic Planning

2 Uncertainty in State Transitions

3 Partial Observability

4 Belief MDP Value Function Structure

5 Prior Research on POMDPsExact value iterationPoint-based value iterationHeuristic searchDecentralized POMDPs

6 References

[email protected] Probabilistic Planning Background 61 / 62

Page 79: Probabilistic Planning Backgroundai.ustc.edu.cn/en/robocup/2D/materials/09/Prob_Plan.pdf · 2020. 1. 7. · Simulation 2D’s server can be considered as POMDP model. Partially observable

Probabilistic Planning Background

References

References

Trey SmithProbabilistic Planning for Robotic Exploration.PhD thesis, The Robotics Institute, Carnegie MellonUniversity, Pittsburgh, PA, July 2007.

[email protected] Probabilistic Planning Background 62 / 62