lecture 1: introduction to reinforcement learning [1]today ... · t world updates given action a t,...
TRANSCRIPT
Lecture 1: Introduction to Reinforcement Learning 1
Emma Brunskill
CS234 Reinforcement Learning
Winter 2019
1Today the 3rd part of the lecture is based on David Silver’s introduction to RL slidesEmma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 1 / 74
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 2 / 74
Reinforcement Learning
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 3 / 74
Repeated Interactions with World
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 4 / 74
Reward for Sequence of Decisions
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 5 / 74
Don’t Know in Advance How World Works
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 6 / 74
Fundamental challenge in artificial intelligence and machine learning islearning to make good decisions under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 7 / 74
RL, Behavior & Intelligence
Figure: Example from Yael Niv
Childhood: primitive brain & eye, swims around, attaches to a rock
Adulthood: digests brain, sits
Suggests brain is helping guide decisions (no more decisions, no needfor brain?)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 8 / 74
Atari
Figure: DeepMind Nature, 2015
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 9 / 74
Robotics
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 10 / 74
Educational Games
Figure: RL used to optimize Refraction 1, Madel, Liu, Brunskill, Popvic AAMAS2014.
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 11 / 74
Healthcare
Figure: Adaptive control of epileptiform excitability in an in vitro model of limbicseizures. Panuccio, Quez, Vincent, Avoli, Pineau
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 12 / 74
NLP, Vision, ...
Figure: Yeung, Russakovsky, Mori, Li 2016.
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 13 / 74
Reinforcement Learning Involves
Optimization
Delayed consequences
Exploration
Generalization
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 14 / 74
Optimization
Goal is to find an optimal way to make decisions
Yielding best outcomes
Or at least a very good strategy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 15 / 74
Delayed Consequences
Decisions now can impact things much later...
Saving for retirementFinding a key in Montezuma’s revenge
Introduces two challenges1 When planning: decisions involve reasoning about not just immediate
benefit of a decision but also its longer term ramifications2 When learning: temporal credit assignment is hard (what caused later
high or low rewards?)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 16 / 74
Exploration
Learning about the world by making decisions
Agent as scientistLearn to ride a bike by trying (and failing)Finding a key in Montezuma’s revenge
Censored data
Only get a reward (label) for decision madeDon’t know what would have happened if we had taken red pill insteadof blue pill (Matrix movie reference)
Decisions impact what we learn about
If we choose to go to Stanford instead of MIT, we will have differentlater experiences...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 17 / 74
Policy is mapping from past experience to action
Why not just pre-program a policy?
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 18 / 74
Generalization
Policy is mapping from past experience to action
Why not just pre-program a policy?
Figure: DeepMind Nature, 2015
How many possible images are there?(256100×200
)3Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 19 / 74
Reinforcement Learning Involves
Optimization
Exploration
Generalization
Delayed consequences
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 20 / 74
AI Planning (vs RL)
Optimization
Generalization
Exploration
Delayed consequences
Computes good sequence of decisions
But given model of how decisions impact world
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 21 / 74
Supervised Machine Learning (vs RL)
Optimization
Generalization
Exploration
Delayed consequences
Learns from experience
But provided correct labels
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 22 / 74
Unsupervised Machine Learning (vs RL)
Optimization
Generalization
Exploration
Delayed consequences
Learns from experience
But no labels from world
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 23 / 74
Imitation Learning (vs RL)
Optimization
Generalization
Exploration
Delayed consequences
Learns from experience...of others
Assumes input demos of good policies
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 24 / 74
Imitation Learning
Figure: Abbeel, Coates and Ng helicopter team, Stanford
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 25 / 74
Imitation Learning
Reduces RL to supervised learning
BenefitsGreat tools for supervised learningAvoids exploration problemWith big data lots of data about outcomes of decisions
LimitationsCan be expensive to captureLimited by data collected
Imitation learning + RL promising!
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 26 / 74
How Do We Proceed?
Explore the world
Use experience to guide future decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 27 / 74
Other Issues
Where do rewards come from?
And what happens if we get it wrong?
Robustness / Risk sensitivity
We are not alone...
Multi-agent RL
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 28 / 74
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 29 / 74
Basic Logistics
Instructor: Emma Brunskill
CA’s: Ramtin Keramati (Head CA), Patrick Cho, Anchit Gupta, BoLiu, Sudarshan Seshadri, Jay Whang, Yongshang Wu, Andrea Zanette
Time: Monday, Wednesday 11:30-12:50
Location: Gates B1
Additional information
Course webpage: http://cs234.stanford.eduSchedule, Piazza, lecture slides
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 30 / 74
Prerequisites
Python proficiency
Basic probability and statistics
Multivariate calculus and linear algebra
Machine learning or AI (e.g. CS229, CS221)
Loss functions, deritatives, gradient descent should be familiar
Have heard of Markov decision processes and RL befor in an AI or MLclass
We will cover the basics, but quickly
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 31 / 74
End of Class Goals
Define the key features of reinforcement learning that distinguish it from AIand non-interactive machine learning (as assessed by the exam)
Given an application problem (e.g. from computer vision, robotics, etc)decide if it should be formulated as a RL problem, if yes be able to define itformally (in terms of the state space, action space, dynamics and rewardmodel), state what algorithm (from class) is best suited to addressing it, andjustify your answer. (as assessed by the project and the exam)
Implement (in code) common RL algorithms including a deep RL algorithm(as assessed by the homeworks)
Describe (list and define) multiple criteria for analyzing RL algorithms andevaluate algorithms on these metrics: e.g. regret, sample complexity,computational complexity, empirical performance, convergence, etc. (asassessed by homeworks and the exam)
Describe the exploration vs exploitation challenge and compare and contrastat least two approaches for addressing this challenge (in terms ofperformance, scalability, complexity of implementation, and theoreticalguarantees) (as assessed by an assignment and the exam)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 32 / 74
Grading
Assignment 1
Assignment 2
Assignment 3
Midterm
Quiz
Final Project
ProposalMilestonePoster presentationFinal Report
10%
20%
15%
25%
5%
25%
1%
3%
5%
16%
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 33 / 74
Communication
We believe students often learn an enormous amount from each otheras well as from us, the course staff.
We will use Piazza to facilitate discussion and peer learning
Please use Piazza for all questions related to lectures, homeworks,and projects
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 34 / 74
Grading
Late policy
6 free late daysSee webpage for details on how may per assignment/project andpenalties for using more
Collaboration: see webpage and reach out to us if you have anyquestions about what is considered permissible collaboration
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 35 / 74
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 36 / 74
Sequential Decision Making
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 37 / 74
Example: Web Advertising
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 38 / 74
Example: Robot Unloading Dishwasher
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 39 / 74
Example: Blood Pressure Control
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 40 / 74
Sequential Decision Process: Agent & the World (DiscreteTime)
Each time step t:
Agent takes an action atWorld updates given action at , emits observation ot and reward rtAgent receives observation ot and reward rt
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 41 / 74
History: Sequence of Past Observations, Actions &Rewards
History ht = (a1, o1, r1, . . . , at , ot , rt)
Agent chooses action based on history
State is information assumed to determine what happens next
Function of history: st = (ht)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 42 / 74
World State
This is true state of the world used to determine how world generatesnext observation and reward
Often hidden or unknown to agent
Even if known may contain information not needed by agent
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 43 / 74
Agent State: Agent’s Internal Representation
What the agent / algorithm uses to make decisions about how to act
Generally a function of the history: st = f (ht)
Could include meta information like state of algorithm (how manycomputations executed, etc) or decision process (how many decisionsleft until an episode ends)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 44 / 74
Markov Assumption
Information state: sufficient statistic of history
State st is Markov if and only if:
p(st+1|st , at) = p(st+1|ht , at)
Future is independent of past given present
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 45 / 74
Why is Markov Assumption Popular?
Can always be satisfied
Setting state as history always Markov: st = ht
In practice often assume most recent observation is sufficient statisticof history: st = otState representation has big implications for:
Computational complexityData requiredResulting performance
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 46 / 74
Full Observability / Markov Decision Process (MDP)
Environment and world state st = ot
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 47 / 74
Partial Observability / Partially Observable MarkovDecision Process (POMDP)
Agent state is not the same as the world state
Agent constructs its own state, e.g.
Use history st = ht , or beliefs of world state, or RNN, ...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 48 / 74
Partial Observability Examples
Poker player (only sees own cards)
Healthcare (don’t see all physiological processes)
Agent state is not the same as the world state
Agent constructs its own state, e.g.Use history st = ht , or beliefs of world state, or RNN, ...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 49 / 74
Types of Sequential Decision Processes: Bandits
Bandits: actions have no influence on next observations
No delayed rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 50 / 74
Types of Sequential Decision Processes: MDPs andPOMDPs
Actions influence future observations
Credit assignment and strategic actions may be needed
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 51 / 74
Types of Sequential Decision Processes: How the WorldChanges
Deterministic: Given history and action, single observation & rewardCommon assumption in robotics and controls
Stochastic: Given history and action, many potential observations &reward
Common assumption for customers, patients, hard to model domains
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 52 / 74
RL Agent Components
Often include one or more of
Model: Agent’s representation of how the world changes in responseto agent’s actionPolicy: function mapping agent’s states to actionValue function: future rewards from being in a state and/or actionwhen following a particular policy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 53 / 74
Model
Agent’s representation of how the world changes in response toagent’s action
Transition / dynamics model predicts next agent state
p(st+1 = s ′|st = s, at = a)
Reward model predicts immediate reward
r(st = s, at = a) = E[rt |st = s, at = a]
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 54 / 74
Policy
Policy π determines how the agent chooses actions
π : S → A, mapping from states to actions
Deterministic policy:π(s) = a
Stochastic policy:
π(a|s) = Pr(at = a|st = s)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 55 / 74
Value
Value function V π: expected discounted sum of future rewards undera particular policy π
V π(st = s) = Eπ[rt + γrt+1 + γ2rt+2 + γ3rt+3 + · · · |st = s]
Discount factor γ weighs immediate vs future rewards
Can be used to quantify goodness/badness of states and actions
And decide how to act by comparing policies
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 56 / 74
Example: Mars Rover Decision Process
!" !# !$ !% !& !' !(
Figure: Mars rover image: NASA/JPL-Caltech
States: Location of rover (s1, . . . , s7)
Actions: Left or Right
Rewards:
+1 in state s1+10 in state s70 in all other states
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 57 / 74
Example: Mars Rover Policy
!" !# !$ !% !& !' !(
Policy represented by arrow
π(s1) = π(s2) = · · · = π(s7) = right
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 58 / 74
Example: Mars Rover Value Function
!" !# !$ !% !& !' !(
)* !" = +1 )* !# = 0 )* !$ = 0 )* !% = 0 )* !& = 0 )* !' = 0 )* !( = +10
Discount factor, γ = 0
π(s1) = π(s2) = · · · = π(s7) = right
Numbers show value V π(s) for this policy and this discount factor
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 59 / 74
Example: Mars Rover Model
!" !# !$ !% !& !' !(
*̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0
Agent can construct its own estimate of the world models (dynamicsand reward)
In the above the numbers show the agent’s estimate of the rewardmodel
Agent’s transition model
0.5 = P(s1|s1, right) = P(s2|s1, right) = · · ·
Model may be wrong
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 60 / 74
Types of RL Agents: What the Agent (Algorithm) Learns
Valued-based
Explicit: Value functionImplicit: Policy (can derive a policy from value function)
Policy-based
Explicit: policyNo value function
Actor-Critic
Explicit: PolicyExplicit: Value function
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 61 / 74
Types of RL Agents
Model-based
Explicit: ModelMay or may not have policy and/or value function
Model-free
Explicit: Value function and/or policy functionNo model
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 62 / 74
RL Agents
Figure: Figure from David Silver RL course
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 63 / 74
Key Challengers in Learning to Make Sequences of GoodDecisions
Planning (Agent’s internal computation)Given model of how the world works
Dynamics and reward model
Algorithm computes how to act in order to maximize expected reward
With no interaction with real environment
Reinforcement learning
Agent doesn’t know how world worksInteracts with world to implicitly/explicitly learn how world worksAgent improves policy (may involve planning)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 64 / 74
Planning Example
Solitaire: single player card game
Know all rules of game / perfect model
If take action a from state s
Can compute probability distribution over next stateCan compute potential score
Can plan ahead to decide on optimal action
E.g. dynamic programming, tree search, ...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 65 / 74
Reinforcement Learning Example
Solitaire with no rule book
Learn directly by taking actions and seeing what happens
Try to find a good policy over time (that yields high reward)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 66 / 74
Exploration and Exploitation
Agent only experiences what happens for the actions it tries
Mars rover trying to drive left learns the reward and next state fortrying to drive left, but not for trying to drive rightObvious! But leads to a dilemma
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 67 / 74
Exploration and Exploitation
Agent only experiences what happens for the actions it tries
How should an RL agent balance its actions?
Exploration: trying new things that might enable the agent to makebetter decisions in the futureExploitation: choosing actions that are expected to yield good rewardgiven past experience
Often there may be an exploration-exploitation tradeoff
May have to sacrifice reward in order to explore & learn aboutpotentially better policy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 68 / 74
Exploration and Exploitation Examples
Movies
Exploitation: Watch a favorite movie you’ve seen beforeExploration: Watch a new movie
Advertising
Exploitation: Show most effective ad so farExploration: Show a different ad
Driving
Exploitation: Try fastest route given prior experienceExploration: Try a different route
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 69 / 74
Evaluation and Control
Evaluation
Estimate/predict the expected rewards from following a given policy
Control
Optimization: find the best policy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 70 / 74
Example: Mars Rover Policy Evaluation
!" !# !$ !% !& !' !(
Policy represented by arrows
π(s1) = π(s2) = · · · = π(s7) = right
Discount factor, γ = 0
What is the value of this policy?
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 71 / 74
Example: Mars Rover Policy Control
!" !# !$ !% !& !' !(
Discount factor, γ = 0
What is the policy that optimizes the expected discounted sum ofrewards?
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 72 / 74
Course Outline
Markov decision processes & planning
Model-free policy evaluation
Model-free control
Value function approximation & Deep RL
Policy Search
Exploration
Advanced Topics
See website for more details
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 73 / 74
Summary
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 74 / 74