bayesian reinforcement learning in continuous …motivationbayesian reinforcement...
TRANSCRIPT
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian Reinforcement Learning inContinuous POMDPs
Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1
1School of Computer Science, McGill University, Canada2Department of Computer Science, Laval University, Canada
December 19th, 2007
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Motivation
How should robots make decisions when :Environment is partially observable and continuousPoor model of sensors and actuatorsParts of the model has to be learnt entirely during execution (e.g.users’ preferences/behavior)
such as to maximize expected long-term rewards ?
Typical Examples :
[Rottmann]
Solution : Bayesian Reinforcement Learning !
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 2 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Partially Observable Markov Decision Processes
POMDP : (S,A,T ,R,Z ,O, γ, b0)
S : Set of statesA : Set of actionsT (s,a, s′) = Pr(s′|s,a), the transition probabilitiesR(s,a) ∈ R, the immediate rewardsZ : Set of observationsO(s′,a, z) = Pr(z|s′,a), the observation probabilitiesγ : discount factorb0 : Initial state distribution
Belief monitoring via Bayes rule :bt (s′) = ηO(s′,at−1, zt )
∑s∈S T (s,at−1, s′)bt−1(s)
Value function :V ∗(b) = maxa∈A
[R(b,a) + γ
∑z∈Z Pr(z|b,a)V ∗(τ(b,a, z))
]Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 3 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian Reinforcement Learning
General Idea :Define prior distributions over all unknown parameters.Maintain posteriors via Baye’s rule as experience is acquired.Plan considering posterior distribution over model.
Allows us to :Learn the system at same time we achieve the task efficiently.Tradeoff optimally exploration and exploitation.Consider model uncertainty during planning.Include prior knowledge explicitly.
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 4 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Finite MDPs
In Finite MDPs (T unknown) : ([Dearden 99], [Duff 02], [Poupart 06])
To learn T : Maintain counts φass′ of number of times s a→ s′ observed,
starting from prior φ0.
Counts define Dirichlet prior/posterior over T .
Planning according to φ is a MDP problem itself :S′ : physical state (s ∈ S) + information state (φ)T ′(s, φ,a, s′, φ′) = Pr(s′, φ′|s, φ,a)
= Pr(s′|s, φ,a) Pr(φ′|φ, s,a, s′)=
φass′∑
s′′∈S φass′′
I(φ′, φ+ δass′)
V ∗(s, φ) = maxa∈A
[R(s,a) + γ
∑s′∈S
φass′∑
s′′∈S φass′′
V ∗(s′, φ+ δass′)]
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 5 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Finite POMDPs
In Finite POMDPs (T ,O unknown) : ([Ross 07])
Let :φa
ss′ : number of times s a→ s′ observed.ψa
sz : number of times z observed in s after doing a.
Given action-observation sequence, use Bayes rule to maintain beliefover (s, φ, ψ).
⇒ Decision under partial observability of (s, φ, ψ) is a POMDP itself :S′ : physical state (s ∈ S) + information state (φ, ψ)P ′(s, φ, ψ, a, s′, φ′, ψ′, z) = Pr(s′, φ′, ψ′, z|s, φ, ψ,a)= Pr(s′|s, φ,a) Pr(z|ψ, s′,a) Pr(φ′|φ, s,a, s′) Pr(ψ′|ψ,a, s′, z)
=φa
ss′∑s′′∈S φ
ass′′
ψas′z∑
z′∈Z ψas′z′
I(φ′, φ+ δass′)I(ψ
′, ψ + δas′z)
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 6 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Example
Tiger domain with unknown sensor accuracy :Suppose prior ψ0 = (5,3), b0 = (0.5,0.5)
Sequence of action-observation is :{ (Listen,l), (Listen,l), (Listen,l), (Right,-) }
b0 : Pr(L, < 5, 3 >) = 12
Pr(R, < 5, 3 >) = 12
b1 : Pr(L, < 6, 3 >) = 58
Pr(R, < 5, 4 >) = 38
b3 : Pr(L, < 8, 3 >) = 79
Pr(R, < 5, 6 >) = 29
b4 : Pr(L, < 8, 3 >) = 718
Pr(L, < 5, 6 >) = 218
Pr(R, < 8, 3 >) = 718
Pr(R, < 5, 6 >) = 218
0 0.2 0.4 0.6 0.8 10
1
2
3
4
accuracy
b0(L,⋅)
b3(L,⋅)
b3(R,⋅)
b4(L,⋅)
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 7 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Continuous Domains
In robotics, continuous domains are common (continuous state,continuous action, continuous observations).
Could discretize the problem and apply our current method, but :Combinatorial explosion or poor precisionCan require lots of training data (visit every small cell)
Can we extend Bayesian RL to continuous domains ?
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 8 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Continuous Domains ?
Can’t use counts (Dirichlet distribution) to learn about the model.
We assume a parametric form for transtion and observation model.
For instance, in the Gaussian case :S ⊂ Rm, A ⊂ Rn, Z ⊂ Rp
st+1 = gT (st ,at ,Xt )
zt+1 = gO(st+1,at ,Yt )
where Xt ∼ N(µX ,ΣX ), Yt ∼ N(µY ,ΣY ), and gT ,gO are arbitraryfunctions (possibly non-linear).
We assume gT ,gO are known, but that the parameters µX ,ΣX , µY ,ΣYare unknown.
Relevant statistics depends on the parametric form.
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 9 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Continuous Domains ?
µ,Σ can be learned by maintaining sample mean µ̂ and samplecovariance Σ̂.
These define a Normal-Wishart posterior over µ,Σ :µ|Σ = R ∼ N(µ̂, R
ν )
Σ−1 ∼Wishart(α, τ−1)
where :ν : number of observations for µ̂α : degree of freedom of Σ̂
τ = αΣ̂
These can be updated easily after observation X = x :µ̂′ = νµ̂+x
ν+1
ν′ = ν + 1
α′ = α + 1τ ′ = τ + ν
ν+1 (µ̂− x)(µ̂− x)T
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 10 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Continuous POMDP
Let’s defineφ = (µ̂X , νX , αX , τX ) : the posterior over (µX ,ΣX )
ψ = (µ̂Y , νY , αY , τY ) : the posterior over (µY ,ΣY )
U : the update function of φ,ψ, i.e. U(φ, x) = φ′ and U(ψ, y) = ψ′
Bayes-Adaptive Continuous POMDP : (S′,A′,Z ′,P ′,R′)
S′ = S × R|X |+|X |2+2 × R|Y |+|Y |2+2
A′ = AZ ′ = ZP ′(s, φ, ψ,a, s′, φ′, ψ′, z) =I(gT (s,a, x), s′)I(gO(s′,a, y), z)I(φ′,U(φ, x))I(ψ′,U(ψ, y))fX |φ(x)fY |ψ(y)
R′(s, φ, ψ, a) = R(s,a)
where x = (νX + 1)µ̂′X − νX µ̂X and y = (νY + 1)µ̂′Y − νY µ̂Y .
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 11 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Continuous POMDP
Monte Carlo Belief monitoring : (1 extra assumption : gO(s,a, ·) is 1-1transformation of Y )
1 Sample (s, φ, ψ) ∼ bt
2 Sample (µX ,ΣX ) ∼ NW (φ)
3 Sample X ∼ N(µX ,ΣX )
4 Compute s′ = gT (s,at ,X )
5 Find unique Y s.t.zt+1 = gO(s′,at ,Y )
6 Compute φ′ = U(φ,X ),ψ′ = U(ψ,Y )
7 Sample (µY ,ΣY ) ∼ NW (ψ)
8 Add f (Y |µY ,ΣY ) to particlebt+1(s′, φ′, ψ′)
9 Repeat until K particles inbt+1
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 12 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Bayesian RL in Continuous POMDP
Monte Carlo Online Planning (Receding Horizon Control) :
b0
b1
b2
b3
a1
a2
an
o1
o2
on
...
...
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 13 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Experiments
Simple Robot Navigation Task :S : (x , y) positionA : (v , θ) (velocity v ∈ [0,1] and angleθ ∈ [0,2π])Z : Noisy (x , y) position
gT (s,a,X ) = s + v[
cos θ − sin θsin θ cos θ
]X
gO(s′,a,Y ) = s′ + YR(s,a) = I(||s − sGOAL||2 < 0.25)
γ = 0.85
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 14 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Robot Navigation Task
We choose exact parameters :
µX =
[0.80.3
]µY =
[00
] ΣX =
[0.04 −0.01−0.01 0.01
]ΣY =
[0.01 00 0.01
]Starting with prior based on 10 "artificial" samples :
µ̂X =
[10
]µ̂Y =
[00
] Σ̂X =
[0.04 −0.01−0.01 0.16
]Σ̂Y =
[0.16 00 0.16
]Such that φ0 = (µ̂X ,10,9,9Σ̂X ), ψ0 = (µ̂Y ,10,9,9Σ̂Y ).Each time robot reaches the goal, a new goal is chosen randomly at 5distance unit from previous goal.
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 15 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Robot Navigation TaskAverage evolution of the return over time :
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Training Steps
Ave
rage
Ret
urn
Prior modelExact ModelLearning
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 16 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Robot Navigation Task
Average accuracy of the model over time :
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Training Steps
WL1
Model Accuracy is measured as follows :WL1(b) =∑
(s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1]
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 17 / 18
Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion
Conclusion
We presented a new framework for planning in partially observableand continuous domains with uncertain model parameters.
Optimal policy maximizes long-term return given the prior over modelparameters.
Monte Carlo methods can be used for more tractable approximatebelief monitoring and and planning.
Interesting future applications human-computer interactions androbotics.
Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 18 / 18