safe reinforcement learning - stanford university€¦ · creating a safe reinforcement learning...
TRANSCRIPT
![Page 1: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/1.jpg)
Safe Reinforcement Learning Philip S. Thomas
Stanford CS234: Reinforcement Learning, Guest Lecture
May 24, 2017
![Page 2: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/2.jpg)
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
![Page 3: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/3.jpg)
What does it mean for a reinforcement learning algorithm to be safe?
![Page 4: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/4.jpg)
![Page 5: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/5.jpg)
![Page 6: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/6.jpg)
Changing the objective
-50 +0 +20 +20
+0 +0 +0 +0 +0 +0 +20
+20 +20 +20
Policy 1
Policy 2
![Page 7: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/7.jpg)
Changing the objective
• Policy 1: • Reward = 0 with probability 0.999999
• Reward = 109 with probability 1-0.999999
• Expected reward approximately 1000
• Policy 2: • Reward = 999 with probability 0.5
• Reward = 1000 with probability 0.5
• Expected reward 999.5
![Page 8: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/8.jpg)
Another notion of safety
![Page 9: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/9.jpg)
Another notion of safety (Munos et. al)
![Page 10: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/10.jpg)
Another notion of safety
![Page 11: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/11.jpg)
![Page 12: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/12.jpg)
The Problem
• If you apply an existing method, do you have confidence that it will work?
![Page 13: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/13.jpg)
Reinforcement learning successes
![Page 14: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/14.jpg)
A property of many real applications
• Deploying “bad” policies can be costly or dangerous.
![Page 15: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/15.jpg)
Deploying bad policies can be costly
![Page 16: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/16.jpg)
Deploying bad policies can be dangerous
![Page 17: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/17.jpg)
What property should a safe algorithm have?
• Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝛿, I will not change your policy
to one that is worse than the current policy.”
• You get to choose 𝛿
• This guarantee is not contingent on the tuning of any hyperparameters
![Page 18: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/18.jpg)
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
![Page 19: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/19.jpg)
Notation
• Policy, 𝜋 𝜋 𝑎 𝑠 = Pr(𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠)
• History: 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿
• Historical data: 𝐷 = 𝐻1, 𝐻2, … , 𝐻𝑛
• Historical data from behavior policy, 𝜋b
• Objective:
𝐽 𝜋 = 𝐄 𝛾𝑡𝑅𝑡
𝐿
𝑡=1
𝜋 19
Agent
Environment
Action, 𝑎
State, 𝑠 Reward, 𝑟
![Page 20: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/20.jpg)
Safe reinforcement learning algorithm
• Reinforcement learning algorithm, 𝑎
• Historical data, 𝐷, which is a random variable
• Policy produced by the algorithm, 𝑎(𝐷), which is a random variable
• A safe reinforcement learning algorithm, 𝑎, satisfies:
Pr 𝐽 𝑎 𝐷 ≥ 𝐽 𝜋b ≥ 1 − 𝛿
or, in general: Pr 𝐽 𝑎 𝐷 ≥ 𝐽min ≥ 1 − 𝛿
![Page 21: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/21.jpg)
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
![Page 22: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/22.jpg)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
![Page 23: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/23.jpg)
Off-policy policy evaluation (OPE)
Historical Data, 𝐷
Proposed Policy, 𝜋e Estimate of 𝐽(𝜋e)
![Page 24: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/24.jpg)
Importance Sampling (Intuition)
24
Probability of history
Evaluation Policy, 𝜋e Behavior Policy, 𝜋b
𝐽 𝜋𝑒 =1
𝑛 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
𝐽 𝜋e =1
𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
Math Slide 2/3
𝑤𝑖 =Pr 𝐻𝑖 𝜋e
Pr 𝐻𝑖 𝜋b
Math Slide 2/3
• Reminder: • History, 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿
• Objective, 𝐽 𝜋e = 𝐄 𝛾𝑡𝑅𝑡𝐿𝑡=1 𝜋e
= 𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
![Page 25: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/25.jpg)
Importance sampling (History)
• Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278 • Let 𝑋 = 0 with probability 1 − 10−10 and 𝑋 = 1010 with probability 10−10
• 𝐄 𝑋 = 1
• Monte Carlo estimate from 𝑛 ≪ 1010 samples of 𝑋 is almost always zero
• Idea: Sample 𝑋 from some other distribution and use importance sampling to “correct” the estimate
• Can produce lower variance estimates.
• Josiah Hannah et. al, ICML 2017 (to appear).
![Page 26: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/26.jpg)
Importance sampling (History, continued)
• Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann
![Page 27: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/27.jpg)
Importance sampling (Proof)
• Estimate 𝐄𝑝[𝑓 𝑋 ] given a sample of 𝑋~𝑞
• Let 𝑃 = supp 𝑝 , 𝑄 = supp(𝑞), and 𝐹 = supp(𝑓)
• Importance sampling estimate: 𝑝 𝑋
𝑞 𝑋𝑓 𝑋
𝐄𝑞
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋) = 𝑞(𝑋)
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋)
𝑥∈𝑄
= 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
= 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
+ 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃 ∩𝑄
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
![Page 28: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/28.jpg)
Importance sampling (Proof)
• Assume 𝑃 ⊆ 𝑄 (can relax assumption to 𝑃 ⊆ 𝑄 ∪ 𝐹 )
• Importance sampling is an unbiased estimator of 𝐄𝑝 𝑓 𝑋
𝐄𝑞
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
= 𝐄𝑝 𝑓 𝑋
= 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
![Page 29: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/29.jpg)
Importance sampling (proof)
• Assume 𝑓 𝑥 ≥ 0 for all 𝑥
• Importance sampling is a negative-bias estimator of 𝐄𝑝 𝑓 𝑋
𝐄𝑞
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
≤ 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
= 𝐄𝑝 𝑓 𝑋
![Page 30: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/30.jpg)
Importance sampling (reminder)
IS 𝐷 =1
𝑛
𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
𝐄 IS(𝐷) = 𝐽 𝜋e
![Page 31: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/31.jpg)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
![Page 32: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/32.jpg)
High confidence off-policy policy evaluation (HCOPE)
Historical Data, 𝐷
Proposed Policy, 𝜋𝑒 1 − 𝛿 confidence lower bound on 𝐽(𝜋𝑒)
Probability, 1 − 𝛿
![Page 33: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/33.jpg)
• Let 𝑋1, … , 𝑋𝑛 be 𝑛 independent identically distributed random variables such that𝑋i ∈ [0, 𝑏]
• Then with probability at least 1 − 𝛿:
𝐄 𝑋𝑖 ≥1
𝑛 𝑋𝑖 −
𝑛
𝑖=1
𝑏ln 1
𝛿
2𝑛
Hoeffding’s inequality
Math Slide 3/3
1
𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
![Page 34: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/34.jpg)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
![Page 35: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/35.jpg)
Safe policy improvement (SPI)
Historical Data, 𝐷 New policy 𝜋, or No Solution Found Probability, 1 − 𝛿
![Page 36: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/36.jpg)
Historical Data
Training Set (20%)
Candidate Policy, 𝜋
Testing Set (80%)
Safety Test
36
Safe policy improvement (SPI)
Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?
![Page 37: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/37.jpg)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
WON’T WORK
![Page 38: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/38.jpg)
Off-policy policy evaluation (revisited)
• Importance sampling (IS):
IS 𝐷 =1
𝑛
𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
• Per-decision importance sampling (PDIS)
PDIS 𝐷 = 𝛾𝑡
𝐿
𝑡=1
1
𝑛
𝜋e 𝑎𝜏 𝑠𝜏
𝜋b 𝑎𝜏 𝑠𝜏
𝑡
𝜏=1
𝑅𝑡𝑖
𝑛
𝑖=1
![Page 39: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/39.jpg)
Off-policy policy evaluation (revisited)
• Importance sampling (IS):
IS 𝐷 =1
𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
• Weighted importance sampling (WIS)
WIS 𝐷 =1
𝑤𝑖𝑛𝑖=1
𝑤𝑖 𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
![Page 40: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/40.jpg)
Off-policy policy evaluation (revisited)
• Weighted importance sampling (WIS)
1
𝑤𝑖𝑛𝑖=1
𝑤𝑖 𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
• Not unbiased. When 𝑛 = 1, 𝐄 WIS = 𝐽 𝜋b
• Strongly consistent estimator of 𝐽 𝜋e
• i.e., Pr lim𝑛→∞
WIS(𝐷) = J 𝜋e = 1
• If • Finite horizon • One behavior policy, or bounded rewards
![Page 41: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/41.jpg)
Off-policy policy evaluation (revisited)
• Weighted per-decision importance sampling • Also called consistent weighted per-decision importance sampling
• A fun exercise!
![Page 42: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/42.jpg)
Control variates
• Given: 𝑋
• Estimate: 𝜇 = 𝐄 𝑋
• 𝜇 = 𝑋
• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 = 𝜇
• Variance: Var 𝜇 = Var(𝑋)
![Page 43: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/43.jpg)
Control variates
• Given: 𝑋, 𝑌, 𝐄 𝑌
• Estimate: 𝜇 = 𝐄 𝑋
• 𝜇 = 𝑋 − 𝑌 + 𝐄[Y]
• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 − 𝑌 + 𝐄[𝑌] = 𝐄 𝑋 − 𝐄 𝑌 + 𝐄 𝑌 = 𝐄 𝑋 = 𝜇
• Variance: Var 𝜇 = Var 𝑋 − 𝑌 + 𝐄[𝑌] = Var 𝑋 − 𝑌 = Var 𝑋 + Var 𝑌 − 2Cov(𝑋, 𝑌)
• Lower variance if 2Cov 𝑋, 𝑌 > Var(𝑌)
• We call 𝑌 a control variate.
![Page 44: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/44.jpg)
Off-policy policy evaluation (revisited)
• Idea: add a control variate to importance sampling estimators • 𝑋 is the importance sampling estimator • 𝑌 is a control variate build from an approximate model of the MDP
• 𝐄 𝑌 = 0 in this case
• PDISCV 𝐷 = PDIS 𝐷 − CV(𝐷)
• Called the doubly robust estimator (Jiang and Li, 2015) • Robust to 1) poor approximate model, and 2) error in estimates of 𝜋b
• If the model is poor, the estimates are still unbiased • If the sampling policy is unknown, but the model is good, MSE will still be low
• DR 𝐷 = PDISCV 𝐷
• Non-recursive and weighted forms, as well as control variate view provided by Thomas and Brunskill (2016)
![Page 45: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/45.jpg)
Off-policy policy evaluation (revisited)
DR 𝜋𝑒 𝒟) = 1
𝑛 𝛾𝑡𝑤𝑡
𝑖 𝑅𝑡𝑖 −𝑞 𝜋e 𝑆𝑡
𝑖 , 𝐴𝑡𝑖 + 𝛾𝑡𝜌𝑡−1
𝑖 𝑣 𝜋𝑒 𝑆𝑡𝑖
∞
𝑡=0
𝑛
𝑖=1
• Recall: we want the control variate, 𝑌, to cancel with 𝑋: 𝑅 − 𝑞 𝑆, 𝐴 + 𝛾𝑣 𝑆′
Per-decision importance sampling (PDIS) 𝑤𝑡
𝑖 = 𝜋e 𝑎𝜏 𝑠𝜏
𝜋b 𝑎𝜏 𝑠𝜏
𝑡
𝜏=1
![Page 46: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/46.jpg)
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
AM
Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)
![Page 47: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/47.jpg)
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
AM
![Page 48: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/48.jpg)
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
DR
AM
![Page 49: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/49.jpg)
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
WIS
CWPDIS
DR
AM
![Page 50: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/50.jpg)
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
WIS
CWPDIS
DR
AM
WDR
![Page 51: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/51.jpg)
Off-policy policy evaluation (revisited)
• What if supp 𝜋e ⊂ supp 𝜋b ?
• There is a state-action pair, 𝑠, 𝑎 , such that
𝜋𝑒 𝑎 𝑠 = 0, but 𝜋𝑏 𝑎 𝑠 ≠ 0
• If we see a history where (𝑠, 𝑎) occurs, what weight should we give it?
IS 𝐷 =1
𝑛
𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
![Page 52: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/52.jpg)
Off-policy policy evaluation (revisited)
• What if there are zero samples (𝑛 = 0)? • The importance sampling estimate is undefined
• What if no samples are in supp 𝜋e (or supp(𝑝) in general)? • Importance sampling says: the estimate is zero • Alternate approach: undefined
• Importance sampling estimator is unbiased if 𝑛 > 0
• Alternate approach will be unbiased given that at least one sample is in the support of 𝑝.
• Alternate approach detailed in Importance Sampling with Unequal Support (Thomas and Brunskill, AAAI 2017)
![Page 53: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/53.jpg)
Off-policy policy evaluation (revisited)
![Page 54: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/54.jpg)
Off-policy policy evaluation (revisited)
• Thomas et. al. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)
![Page 55: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/55.jpg)
Off-policy policy evaluation (revisited)
• Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning (ICML 2016)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
WIS
CWPDIS
DR
AM
WDR
MAGIC
0.01
0.1
1
10
1 10 100 1,000 10,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
DR
AM
WDR
MAGIC
![Page 56: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/56.jpg)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
![Page 57: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/57.jpg)
High-confidence off-policy policy evaluation (revisited) • Consider using IS + Hoeffding’s inequality for HCOPE on mountain car
Natural Temporal Difference Learning, Dabney and Thomas, 2014
![Page 58: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/58.jpg)
High-confidence off-policy policy evaluation (revisited) • Using 100,000 trajectories
• Evaluation policy’s true performance is 0.19 ∈ [0,1].
• We get a 95% confidence lower bound of:
−5,831,000
![Page 59: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/59.jpg)
What went wrong?
𝑤𝑖 = 𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
![Page 60: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/60.jpg)
What went wrong?
𝐄 𝑋𝑖 ≥1
𝑛 𝑋𝑖 −
𝑛
𝑖=1
𝑏ln 1
𝛿
2𝑛
𝑏 ≈ 109.4
Largest observed importance weighted return:316.
![Page 61: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/61.jpg)
High-confidence off-policy policy evaluation (revisited) • Removing the upper tail only decreases the expected value.
![Page 62: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/62.jpg)
High-confidence off-policy policy evaluation (revisited)
• Thomas et. al, High confidence off-policy evaluation, AAAI 2015
![Page 63: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/63.jpg)
High-confidence off-policy policy evaluation (revisited)
![Page 64: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/64.jpg)
High-confidence off-policy policy evaluation (revisited) • Use 20% of the data to optimize 𝑐.
• Use 80% to compute lower bound with optimized 𝑐.
• Mountain car results:
CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al.
95% Confidence lower bound on
the mean
0.145 −5,831,000 −129,703 0.055 −.046
![Page 65: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/65.jpg)
High-confidence off-policy policy evaluation (revisited) • Digital Marketing:
![Page 66: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/66.jpg)
High-confidence off-policy policy evaluation (revisited) • Cognitive dissonance
𝐄 𝑋𝑖 ≥1
𝑛 𝑋𝑖 −
𝑛
𝑖=1
𝑏ln 1
𝛿
2𝑛
![Page 67: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/67.jpg)
High-confidence off-policy policy evaluation (revisited) • Student’s 𝑡-test
• Assumes that IS(𝐷) is normally distributed
• By the central limit theorem, it (is as 𝑛 → ∞)
• Efron’s Bootstrap methods (e.g., BCa) • Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017
![Page 68: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/68.jpg)
High-confidence off-policy policy evaluation (revisited)
P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)
![Page 69: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/69.jpg)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
![Page 70: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/70.jpg)
Historical Data
Training Set (20%)
Candidate Policy, 𝜋
Testing Set (80%)
Safety Test
70
Safe policy improvement (revisited)
Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?
• Thomas et. al, ICML 2015
![Page 71: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/71.jpg)
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
![Page 72: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/72.jpg)
Empirical Results
• What to look for: • Data efficiency
• Error rates
![Page 73: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/73.jpg)
Empirical Results: Mountain Car
![Page 74: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/74.jpg)
Empirical Results: Mountain Car
![Page 75: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/75.jpg)
Empirical Results: Mountain Car
![Page 76: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/76.jpg)
Empirical Results: Digital Marketing
Agent
Environment
Action, 𝑎
State, 𝑠 Reward, 𝑟
![Page 77: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/77.jpg)
Empirical Results: Digital Marketing
0.002715
0.003832
n=10000 n=30000 n=60000 n=100000
Expecte
d N
orm
aliz
ed R
etu
rn
None, CUT None, BCa k-Fold, CUT k-Fold, Bca
![Page 78: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/78.jpg)
Empirical Results: Digital Marketing
![Page 79: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/79.jpg)
Empirical Results: Digital Marketing
![Page 80: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/80.jpg)
Example Results : Diabetes Treatment
80
Blood Glucose (sugar)
Eat Carbohydrates Release Insulin
![Page 81: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/81.jpg)
Example Results : Diabetes Treatment
81
Blood Glucose (sugar)
Eat Carbohydrates Release Insulin
Hyperglycemia
![Page 82: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/82.jpg)
Example Results : Diabetes Treatment
82
Blood Glucose (sugar)
Eat Carbohydrates Release Insulin
Hypoglycemia
Hyperglycemia
![Page 83: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/83.jpg)
Example Results : Diabetes Treatment
83
injection =bloodglucose − targetbloodglucose
𝐶𝐹+
mealsize
𝐶𝑅
![Page 84: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/84.jpg)
Example Results : Diabetes Treatment
84
Intelligent Diabetes Management
![Page 85: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/85.jpg)
Example Results : Diabetes Treatment
85
Pro
bab
ility
Po
licy
Ch
ange
d
Pro
bab
ility
Po
licy
Wo
rse
![Page 86: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/86.jpg)
Future Directions
• How to deal with long horizons?
• How to deal with importance sampling being “unfair”?
• What to do when the behavior policy is not known?
• What to do when the behavior policy is deterministic?
![Page 87: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/87.jpg)
Summary
• Safe reinforcement learning • Risk-sensitive • Learning from demonstration • Asymptotic convergence even if data is off-policy • Guaranteed (with probability 𝟏 − 𝜹) not to make the policy worse
• Designing a safe reinforcement learning algorithm: • Off-policy policy evaluation (OPE)
• IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC
• High confidence off-policy policy evaluation (HCOPE) • Hoeffding, CUT inequality, Student’s 𝑡-test, BCa
• Safe policy improvement (SPI) • Selecting the candidate policy
![Page 88: Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert](https://reader036.vdocument.in/reader036/viewer/2022070714/5ed400698d46b66d22634051/html5/thumbnails/88.jpg)
Takeaway
• Safe reinforcement learning is tractable! • Not just polynomial amounts of data – practical amounts of data