motivation - people.csail.mit.edupeople.csail.mit.edu/ajshah/publication/shah-2019... · motivation...

Formulation• Given:

• 𝒙 ∈ 𝑿: Learner’s state representations.• 𝜶 = 𝑓(𝒙) ; 𝜶 ∈ 0,1 𝑛𝑝𝑟𝑜𝑝 : Learner’s labeling function and task

propositions.• 𝑨: Learners set of available actions• 𝑃 𝜑 : The task specification as belief over formulas with support {𝜑}.

• Expected output:• 𝜋𝑃 𝜑 (𝒙): A stochastic policy that best satisfies 𝑃 𝜑 .

Motivation• Linear temporal logic (LTL) formulas are an expressive means for specifying

non-Markovian tasks.

• Prior research relies on LTL to automaton compilation for planning. However this is restricted to a single LTL formula.

• In many application there is an inherent uncertainty in specifications.[1],[2]

• In general specifications are expressed as a belief 𝑃 𝜑 over support 𝜑 .

Question 1: What does satisfying a belief 𝑃 𝜑 mean?

Question 2: How do we plan for a collection of LTL formulas {𝜑}?

Planning with Uncertain Specifications

Evaluation Criteria

RSS 2019 Workshop on Combining Learning with Reasoning

Ankit Shah, Julie Shah

{ajshah, arnoldj}@mit.edu

Visit our website for more!

http://interactive.mit.edu

[1] Shah, A., Kamath, P., Shah, J. A., & Li, S. (2018). Bayesian inference of temporal task specifications from demonstrations. In Advances in Neural Information Processing Systems (pp. 3804-3813).[2] Kim, J., Banks, C. J., & Shah, J. A. (2017, February). Collaborative planning with encoding of users' high-level strategies. In Thirty-First AAAI Conference on Artificial Intelligence.

Most Likely Maximum Coverage Minimum Regret Chance Constrained

𝟙 𝛼 ⊨ 𝜑∗

𝜑∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜑∈{𝜑}𝑃 𝜑

1

𝜑

𝜑∈{𝜑}

𝟙 𝛼 ⊨ 𝜑∗

𝜑∈{𝜑}

𝑃(𝜑)𝟙 𝛼 ⊨ 𝜑∗

𝜑∈𝜑𝛿

𝑃(𝜑)𝟙 𝛼 ⊨ 𝜑∗

Satisfy only the most likely formula. Satisfy the largest set of unique formulas.

Maximize satisfaction weighted by probability.

𝛿 is the maximum failure probability.

Automata/MDP Compilation𝑃 𝜑1 = 0.05 𝑃 𝜑2 = 0.15 𝑃 𝜑3 = 0.8

𝑮¬𝑇0 ∧ 𝑭𝑊3 ∧ ¬𝑊3 𝑼𝑊2 ∧ ¬𝑊2 𝑼𝑊1 𝑮¬𝑇0 ∧ 𝑭𝑊1 ∧ 𝑭𝑊2 ∧ 𝑭𝑊3 𝑮¬𝑇0 ∧ 𝑭𝑊3

Composite automaton ℳ 𝜑 = ⟨{ 𝜑′ }, 0,1 𝑛𝑝𝑟𝑜𝑝 , 𝑇 𝜑 , 𝑅⟩

• Naïve cross-product: 135 states• Minimal automaton: 11 states• Determine task success and reward.

5 states 9 states 3 states

Environment MDP ℳ𝑒𝑛𝑣 = ⟨𝑿, 𝑨, 𝑇𝑒𝑛𝑣⟩:• Determines available actions.• Determines changes to environment state.

Cross product ℳ 𝜑 ×ℳ𝑒𝑛𝑣 = ℳ𝑠𝑝𝑒𝑐

ℳ𝑠𝑝𝑒𝑐 = ⟨{ 𝜑′ } × 𝑿, 𝐴, 𝑇𝑠𝑝𝑒𝑐 , 𝑅⟩

𝑇𝑠𝑝𝑒𝑐 ⟨ 𝜑1′ , 𝑥1⟩, ⟨ 𝜑2

′ ⟩, 𝑥2 = 𝑇 𝜑 𝜑1′ , ⟨𝜑2

′ ⟩ × 𝑇𝑒𝑛𝑣(𝑥1, 𝑥2)

Results Discussion

• MDP compilation admits formulas of the Obligation classof temporal properties.

• Any RL algorithm can be used to solve the compiled MDP,but exploration vs exploitation considerations are stillimportant.

Future Work• Algorithms to exploit the composition of ℳ 𝜑 and ℳ𝑒𝑛𝑣

• Scaffolding of reward based on automaton structure.• Allowing temporal properties like Recurrence, Persistence

and Reactivity

Most Likely Max coverage/Min regret Chance constrained 𝛅 = 𝟎. 𝟏

Classes of belief distributions:Nature of task executions depends on:• Nature of distribution.• Choice of Evaluation criterion.• Exploration strategy in RL algorithm.

motivation - people.csail.mit.edupeople.csail.mit.edu/ajshah/publication/shah-2019... · motivation...

Documents