artificial intelligence beyond plain mdps** · artificial intelligence beyond plain mdps** marc...
TRANSCRIPT
![Page 1: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/1.jpg)
Artificial Intelligence
Beyond Plain MDPs**
Marc ToussaintUniversity of StuttgartWinter 2019/20
![Page 2: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/2.jpg)
Motivation:We discussed dynamic programming and reinforcement learning in standard MDPs. There aresome extensions which address relational state representations, multi-agent situations, partialobservability and continuous time. We address this briefly.Especially partial observability is an very important aspect, as any real-world decision processis necessarily partially observable, and it provides a potential bridge to think of perception asbeing fundamentally part of planning. The respective model is called POMDP (partiallyobservable MDP). We’ll discuss the basic approach to address such problems, namely beliefspace planning, and emphasize the hardness of such problems.
Beyond Plain MDPs** – – 1/34
![Page 3: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/3.jpg)
Overview of common domain models
model / state representation prob. rel. multi PO cont.time
plain deterministic - - - - -
plain MDP + - - - -
PDDL (STRIPS rules) - (+) + - (+) - -
NDRs + + - (+) - -
relational MDP + + - - -
POMDP + - - + -
DEC-POMDP + - + + -
Games - + + - -
stochastic diff. eqns. (SOC) + - + +
[probabilistic, relational, multi-agent, partially observable, continuous time]
PDDL: Planning Domain Definition Language, STRIPS: STanford Research Institute Problem Solver,NDRs: Noisy Deictic Rules, MDP: Markov Decision Process, POMDP: Partially Observable MDP,DEC-POMDP: Decentralized POMDP, SOC: Stochastic Optimal Control
Beyond Plain MDPs** – – 2/34
![Page 4: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/4.jpg)
Relational State Representations
Beyond Plain MDPs** – Relational State Representations – 3/34
![Page 5: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/5.jpg)
• Types of state representation:– Discrete, continuous, hybrid– Factored– Structured/relational
Beyond Plain MDPs** – Relational State Representations – 4/34
![Page 6: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/6.jpg)
Relational representations of state
• The world is composed of objects; its state described in terms ofproperties and relations of objects. Formally
– A set of constants (referring to objects)– A set of predicates (referring to object properties or relations)– A set of functions (mapping to constants)
• A (grounded) state can then described by a conjunction of predicates(and functions). For example:
– Constants: C1, C2, P1, P2, SFO, JFK
– Predicates: At(., .), Cargo(.), P lane(.), Airport(.)– A state description:At(C1, SFO)∧At(C2, JFK)∧At(P1, SFO)∧At(P2, JFK)∧Cargo(C1)∧Cargo(C2) ∧ Plane(P1) ∧ Plane(P2) ∧Airport(JFK) ∧Airport(SFO)
Beyond Plain MDPs** – Relational State Representations – 5/34
![Page 7: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/7.jpg)
PDDL
• Planning Domain Definition LanguageDeveloped for the 1998/2000 International Planning Competition (IPC)
(from Russel & Norvig)
• PDDL describes a deterministic mapping (s, a) 7→ s′, but– using a set of action schema (rules) of the form
ActionName(...) : PRECONDITION→ EFFECT– where action arguments are variables and the preconditions and effects
are conjunctions of predicatesBeyond Plain MDPs** – Relational State Representations – 6/34
![Page 8: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/8.jpg)
PDDL
Beyond Plain MDPs** – Relational State Representations – 7/34
![Page 9: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/9.jpg)
PDDL
• The state-of-the-art solvers are actually A∗methods. But the heuristics!!
• Scale to huge domains
• Fast-Downward is great
Beyond Plain MDPs** – Relational State Representations – 8/34
![Page 10: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/10.jpg)
Noisy Deictic Rules (NDRs)
• Noisy Deictic Rules (Pasula, Zettlemoyer, & Kaelbling, 2007)
• A probabilistic extension of “PDDL rules”:
• These rules define a probabilistic transition probability
P (s′|s, a)
Namely, if (s, a) has a unique covering rule r, then
P (s′|s, a) = P (s′|s, r) =
mr∑i=0
pr,i P (s′|Ωr,i, s)
where P (s′|Ωr,i, s) describes the deterministic state transition of the ith outcome (seeLang & Toussaint, JAIR 2010).
Beyond Plain MDPs** – Relational State Representations – 9/34
![Page 11: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/11.jpg)
• While such rule based domain models originated from classical AIresearch, the following were strongly influenced also from stochastics,decision theory, Machine Learning, etc..
Beyond Plain MDPs** – Relational State Representations – 10/34
![Page 12: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/12.jpg)
Partially Observable MDPs
Beyond Plain MDPs** – Partially Observable MDPs – 11/34
![Page 13: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/13.jpg)
“Systems that take decisions based on availableinformation”
• We assume the agent is in interaction with a domain.– The world is in a state st ∈ S
– The agent senses observations yt ∈ O
– The agent decides on an action at ∈ A
– The world transitions in a new state st+1
agent
s0 s1
a0
s2
a1
s3
a2 a3y0 y1 y2 y3
• Generally, an agent maps the history to an action,ht = (y0:t, a0:t-1) 7→ at
Beyond Plain MDPs** – Partially Observable MDPs – 12/34
![Page 14: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/14.jpg)
POMDPs
• Partial observability adds a totally new level of complexity!
• Basic alternative agent models:– The agent maps yt 7→ at
(stimulus-response mapping.. non-optimal)– The agent stores all previous observations and maps
f : y0:t, a0:t-1 7→ at
f is called agent function. This is the most general model, including theothers as special cases.
– The agent stores only the recent history and mapsyt−k:t, at−k:t-1 7→ at (crude, but may be a good heuristic)
– The agent is some machine with its own internal state nt, e.g., acomputer, a finite state machine, a brain... The agent maps (nt-1, yt) 7→ nt(internal state update) and nt 7→ at
– The agent maintains a full probability distribution (belief) bt(st) over thestate, maps (bt-1, yt) 7→ bt (Bayesian belief update), and bt 7→ at
Beyond Plain MDPs** – Partially Observable MDPs – 13/34
![Page 15: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/15.jpg)
POMDP coupled to a state machine agent
agent
s0 s1 s2
r1r0 r2
a2y2a1y1a0y0
n0 n1 n2
Beyond Plain MDPs** – Partially Observable MDPs – 14/34
![Page 16: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/16.jpg)
http://www.darpa.mil/grandchallenge/index.asp
Beyond Plain MDPs** – Partially Observable MDPs – 15/34
![Page 17: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/17.jpg)
• The tiger problem: a typical POMDP example:
(from the a “POMDP tutorial”)
Beyond Plain MDPs** – Partially Observable MDPs – 16/34
![Page 18: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/18.jpg)
Solution via Dynamic Programming in Belief Spacea0
s0 s1 s2 s3
a1 a2y1 y2 y3y0
r0 r1 r2
y1
r0
y2 y3
b0 b1 b2 b3
y0 a2a1a0
r1 r2
• Consider the belief as state of the Markov(!) decision process!The value function is a function over the belief (δ ∼ belief update)
V (b) = maxa
[R(b, a) + γ
∑y
∫b′P (y′|a, b) δ(b′|y′, a, b) V (b′)
]
• Sondik 1971: V is piece-wise linear and convex: Can be described bym vectors (α1, .., αm), each αi = αi(s) is a function over discrete s
V (b) = maxi
∑s αi(s)b(s)
Exact dynamic programming possible, see Pineau et al., 2003
Beyond Plain MDPs** – Partially Observable MDPs – 17/34
![Page 19: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/19.jpg)
Truely Optimal Policies
• The value function assigns a value (maximal achievable expectedreturn) to a state of knowledge
• Optimal policies optimally “navigate through belief space”– This automatically implies/combines “exploration” and “exploitation”– There is no need to explicitly address “exploration vs. exploitation” or
decide for one against the other. Optimal policies will automatically do this.
• Computationally heavy: bt is a probability distribution, Vt a functionover probability distributions
• See appendix on bandits to make equations more explicit
Beyond Plain MDPs** – Partially Observable MDPs – 18/34
![Page 20: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/20.jpg)
Approximations & Heuristics
• Point-based Value Iteration (Pineau et al., 2003)– Compute V (b) only for a finite set of belief points
• Discard the idea of using belief to “aggregate” history– Policy directly maps history (window) to actions– Optimize finite state controllers (Meuleau et al. 1999, Toussaint et al.2008)
Beyond Plain MDPs** – Partially Observable MDPs – 19/34
![Page 21: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/21.jpg)
Further reading
• Point-based value iteration: An anytime algorithm for POMDPs.Pineau, Gordon & Thrun, IJCAI 2003.
• The standard references on the “POMDP page”http://www.cassandra.org/pomdp/
• Bounded finite state controllers. Poupart & Boutilier, NIPS 2003.
• Hierarchical POMDP Controller Optimization by LikelihoodMaximization. Toussaint, Charlin & Poupart, UAI 2008.
Beyond Plain MDPs** – Partially Observable MDPs – 20/34
![Page 22: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/22.jpg)
Decentralized POMDPs
• Finally going multi agent!
(from Kumar et al., IJCAI 2011)
• This is a special type (simplification) of a general DEC-POMDP
• Generally, this level of description is very general, but NEXP-hardApproximate methods can yield very good results, though
Beyond Plain MDPs** – Partially Observable MDPs – 21/34
![Page 23: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/23.jpg)
Belief Space Planning for Bandits**
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 22/34
![Page 24: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/24.jpg)
Back to the Bandits
• Can Dynamic Programming also be applied to the Bandit problem?We learnt UCB as the standard approach to address Bandits – butwhat would be the optimal policy?
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 23/34
![Page 25: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/25.jpg)
Bandits recap
• Let at ∈ 1, .., n be the choice of machine at time tLet yt ∈ R be the outcome with mean 〈yat〉A policy or strategy maps all the history to a new choice:
π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at
• Problem: Find a policy π that
max〈∑Tt=1 yt〉
ormax〈yT 〉
• “Two effects” of choosing a machine:– You collect more data about the machine→ knowledge– You collect rewardBeyond Plain MDPs** – Belief Space Planning for Bandits** – 24/34
![Page 26: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/26.jpg)
The Belief State
• “Knowledge” can be represented in two ways:– as the full history
ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]
– as the beliefbt(θ) = P (θ|ht)
where θ are the unknown parameters θ = (θ1, .., θn) of all machines
• In the bandit case:– The belief factorizes bt(θ) = P (θ|ht) =
∏i bt(θi|ht)
e.g. for Gaussian bandits with constant noise, θi = µi
bt(µi|ht) = N(µi|yi, si)
e.g. for binary bandits, θi = pi, with prior Beta(pi|α, β):
bt(pi|ht) = Beta(pi|α+ ai,t, β + bi,t)
ai,t =∑t−1s=1[as= i][ys=0] , bi,t =
∑t−1s=1[as= i][ys=1]
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 25/34
![Page 27: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/27.jpg)
The Belief MDP• The process can be modelled as
a1 a2 a3y1 y2 y3
θ θ θ θ
or as Belief MDPa1 a2 a3y1 y2 y3
b0 b1 b2 b3
P (b′|y, a, b) =
1 if b′ = b′[b,a,y]
0 otherwise, P (y|a, b) =
∫θab(θa) P (y|θa)
• The Belief MDP describes a different process: the interaction between theinformation available to the agent (bt or ht) and its actions, where the agentuses his current belief to anticipate observations, P (y|a, b).
• The belief (or history ht) is all the information the agent has avaiable; P (y|a, b)the “best” possible anticipation of observations. If it acts optimally in the BeliefMDP, it acts optimally in the original problem.
Optimality in the Belief MDP ⇒ optimality in the original problemBeyond Plain MDPs** – Belief Space Planning for Bandits** – 26/34
![Page 28: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/28.jpg)
Optimal policies via Dynamic Programming in BeliefSpace
• The Belief MDP:a1 a2 a3y1 y2 y3
b0 b1 b2 b3
P (b′|y, a, b) =
1 if b′ = b′[b,a,y]
0 otherwise, P (y|a, b) =
∫θab(θa) P (y|θa)
• Belief Planning: Dynamic Programming on the value function
∀b : Vt-1(b) = maxπ〈∑Tt=t yt〉
= maxat
∫ytP (yt|at, b)
[yt + Vt(b
′[b,at,yt]
)]
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 27/34
![Page 29: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/29.jpg)
V ∗t (h) := maxπ
∫θP (θ|h) V π,θt (h) (1)
V πt (b) :=
∫θb(θ) V π,θt (b) (2)
V ∗t (b) := maxπ
V πt (b) = maxπ
∫θb(θ) V π,θt (b) (3)
= maxπ
∫θP (θ|b)
[R(π(b), b) +
∫b′P (b′|b, π(b), θ) V π,θt+1 (b′)
](4)
= maxa
maxπ
∫θP (θ|b)
[R(a, b) +
∫b′P (b′|b, a, θ) V π,θt+1 (b′)
](5)
= maxa
[R(a, b) + max
π
∫θ
∫b′P (θ|b) P (b′|b, a, θ) V π,θt+1 (b′)
](6)
P (b′|b, a, θ) =
∫yP (b′, y|b, a, θ) (7)
=
∫y
P (θ|b, a, b′, y) P (b′, y|b, a)
P (θ|b, a)(8)
=
∫y
b′(θ) P (b′, y|b, a)
b(θ)(9)
V ∗t (b) = maxa
[R(a, b) + max
π
∫θ
∫b′
∫yb(θ)
b′(θ) P (b′, y|b, a)
b(θ)V π,θt+1 (b′)
](10)
= maxa
[R(a, b) + max
π
∫b′
∫yP (b′, y|b, a)
∫θb′(θ) V π,θt+1 (b′)
](11)
= maxa
[R(a, b) + max
π
∫yP (y|b, a)
∫θb′[b,a,y](θ) V
π,θt+1 (b′[b,a,y])
](12)
= maxa
[R(a, b) + max
π
∫yP (y|b, a) V π(b′[b,a,y])
](13)
= maxa
[R(a, b) +
∫yP (y|b, a) max
πV π(b′[b,a,y])
](14)
= maxa
[R(a, b) +
∫yP (y|b, a) V ∗t+1(b′[b,a,y])
](15)
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 28/34
![Page 30: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/30.jpg)
Optimal policies
• The value function assigns a value (maximal achievable expectedreturn) to a state of knowledge
• While UCB approximates the value of an action by an optimisticestimate of immediate return; Belief Planning acknowledges that thisreally is a sequencial decision problem that requires to plan
• Optimal policies “navigate through belief space”– This automatically implies/combines “exploration” and “exploitation”– There is no need to explicitly address “exploration vs. exploitation” or
decide for one against the other. Optimal policies will automatically do this.
• Computationally heavy: bt is a probability distribution, Vt a functionover probability distributions
• The term∫ytP (yt|at, b)
[yt + Vt(b′[b,at,yt]
)]
is related to the Gittins Index: it can becomputed for each bandit separately.
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 29/34
![Page 31: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/31.jpg)
Example exercise
• Consider 3 binary bandits for T = 10.– The belief is 3 Beta distributions Beta(pi|α+ ai, β + bi) → 6 integers– T = 10 → each integer ≤ 10
– Vt(bt) is a function over 0, .., 106
• Given a prior α = β = 1,a) compute the optimal value function and policy for the final rewardand the average reward problems,b) compare with the UCB policy.
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 30/34
![Page 32: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/32.jpg)
• The concept of Belief Planning transfers to other uncertain domains:Whenever decisions influence also the state of knowledge
– Active Learning– Optimization– Reinforcement Learning (MDPs with unknown environment)– POMDPs
• Planning in Belief Space is fundamental– Describes optimal solutions to Bandits, POMDPs, RL, etc– But computationally heavy– Silver’s MCTS for POMDPs annotates nodes with history and belief
representatives
Beyond Plain MDPs** – Belief Space Planning for Bandits** – 31/34
![Page 33: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/33.jpg)
Relation to Stochastic Optimal Control**
Beyond Plain MDPs** – Relation to Stochastic Optimal Control** – 32/34
![Page 34: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/34.jpg)
Controlled System
• Time is continuous, t ∈ R
• The system state, actions and observations are continuous,x(t) ∈ Rn, u(t) ∈ Rd, y(t) ∈ Rm
• A controlled system can be described as
linear:
x = Ax+Bu
y = Cx+Du
with matrices A,B,C,D
non-linear:
x = f(x, u)
y = h(x, u)
with functions f, h
• A typical “agent model” is a feedback regulator (stimulus-response)
u = Ky
Beyond Plain MDPs** – Relation to Stochastic Optimal Control** – 33/34
![Page 35: Artificial Intelligence Beyond Plain MDPs** · Artificial Intelligence Beyond Plain MDPs** Marc Toussaint University of Stuttgart Winter 2019/20. Motivation: We discussed dynamic](https://reader034.vdocument.in/reader034/viewer/2022050506/5f97b3d9c28b4f79bc173aa9/html5/thumbnails/35.jpg)
Stochastic Control
• The differential equations become stochastic
dx = f(x, u) dt+ dξx
dy = h(x, u) dt+ dξy
dξ is a Wiener processes with 〈dξ, dξ〉 = Cij(x, u)
• This is the control theory analogue to POMDPs
Beyond Plain MDPs** – Relation to Stochastic Optimal Control** – 34/34