motivation - people.csail.mit.edupeople.csail.mit.edu/ajshah/publication/shah-2019... · motivation...
TRANSCRIPT
Formulation• Given:
• 𝒙 ∈ 𝑿: Learner’s state representations.• 𝜶 = 𝑓(𝒙) ; 𝜶 ∈ 0,1 𝑛𝑝𝑟𝑜𝑝 : Learner’s labeling function and task
propositions.• 𝑨: Learners set of available actions• 𝑃 𝜑 : The task specification as belief over formulas with support {𝜑}.
• Expected output:• 𝜋𝑃 𝜑 (𝒙): A stochastic policy that best satisfies 𝑃 𝜑 .
Motivation• Linear temporal logic (LTL) formulas are an expressive means for specifying
non-Markovian tasks.
• Prior research relies on LTL to automaton compilation for planning. However this is restricted to a single LTL formula.
• In many application there is an inherent uncertainty in specifications.[1],[2]
• In general specifications are expressed as a belief 𝑃 𝜑 over support 𝜑 .
Question 1: What does satisfying a belief 𝑃 𝜑 mean?
Question 2: How do we plan for a collection of LTL formulas {𝜑}?
Planning with Uncertain Specifications
Evaluation Criteria
RSS 2019 Workshop on Combining Learning with Reasoning
Ankit Shah, Julie Shah
{ajshah, arnoldj}@mit.edu
Visit our website for more!
http://interactive.mit.edu
[1] Shah, A., Kamath, P., Shah, J. A., & Li, S. (2018). Bayesian inference of temporal task specifications from demonstrations. In Advances in Neural Information Processing Systems (pp. 3804-3813).[2] Kim, J., Banks, C. J., & Shah, J. A. (2017, February). Collaborative planning with encoding of users' high-level strategies. In Thirty-First AAAI Conference on Artificial Intelligence.
Most Likely Maximum Coverage Minimum Regret Chance Constrained
𝟙 𝛼 ⊨ 𝜑∗
𝜑∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜑∈{𝜑}𝑃 𝜑
1
𝜑
𝜑∈{𝜑}
𝟙 𝛼 ⊨ 𝜑∗
𝜑∈{𝜑}
𝑃(𝜑)𝟙 𝛼 ⊨ 𝜑∗
𝜑∈𝜑𝛿
𝑃(𝜑)𝟙 𝛼 ⊨ 𝜑∗
Satisfy only the most likely formula. Satisfy the largest set of unique formulas.
Maximize satisfaction weighted by probability.
𝛿 is the maximum failure probability.
Automata/MDP Compilation𝑃 𝜑1 = 0.05 𝑃 𝜑2 = 0.15 𝑃 𝜑3 = 0.8
𝑮¬𝑇0 ∧ 𝑭𝑊3 ∧ ¬𝑊3 𝑼𝑊2 ∧ ¬𝑊2 𝑼𝑊1 𝑮¬𝑇0 ∧ 𝑭𝑊1 ∧ 𝑭𝑊2 ∧ 𝑭𝑊3 𝑮¬𝑇0 ∧ 𝑭𝑊3
Composite automaton ℳ 𝜑 = ⟨{ 𝜑′ }, 0,1 𝑛𝑝𝑟𝑜𝑝 , 𝑇 𝜑 , 𝑅⟩
• Naïve cross-product: 135 states• Minimal automaton: 11 states• Determine task success and reward.
5 states 9 states 3 states
Environment MDP ℳ𝑒𝑛𝑣 = ⟨𝑿, 𝑨, 𝑇𝑒𝑛𝑣⟩:• Determines available actions.• Determines changes to environment state.
Cross product ℳ 𝜑 ×ℳ𝑒𝑛𝑣 = ℳ𝑠𝑝𝑒𝑐
ℳ𝑠𝑝𝑒𝑐 = ⟨{ 𝜑′ } × 𝑿, 𝐴, 𝑇𝑠𝑝𝑒𝑐 , 𝑅⟩
𝑇𝑠𝑝𝑒𝑐 ⟨ 𝜑1′ , 𝑥1⟩, ⟨ 𝜑2
′ ⟩, 𝑥2 = 𝑇 𝜑 𝜑1′ , ⟨𝜑2
′ ⟩ × 𝑇𝑒𝑛𝑣(𝑥1, 𝑥2)
Results Discussion
• MDP compilation admits formulas of the Obligation classof temporal properties.
• Any RL algorithm can be used to solve the compiled MDP,but exploration vs exploitation considerations are stillimportant.
Future Work• Algorithms to exploit the composition of ℳ 𝜑 and ℳ𝑒𝑛𝑣
• Scaffolding of reward based on automaton structure.• Allowing temporal properties like Recurrence, Persistence
and Reactivity
Most Likely Max coverage/Min regret Chance constrained 𝛅 = 𝟎. 𝟏
Classes of belief distributions:Nature of task executions depends on:• Nature of distribution.• Choice of Evaluation criterion.• Exploration strategy in RL algorithm.