robot reinforcement learning
TRANSCRIPT
![Page 1: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/1.jpg)
Robot Reinforcement Learning
1
![Page 2: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/2.jpg)
2
Robot Decision Making and Planning
![Page 3: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/3.jpg)
Robots need to make various decisions and construct different plans, for example:
• Decision making• Task planning• Motion planning (e.g., for robotic arms)• Path planning (e.g., for mobile robotics)
3
Robot Decision Making and Planning
![Page 4: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/4.jpg)
Perspectives to consider robot planning methods• Reactive (one-time) decision making versus sequential
planning• Certain and uncertain scenarios• Observable versus partially observable space
4
Robot Decision Making and Planning
We will focus on sequential decision making
![Page 5: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/5.jpg)
• Deterministic, fully observable• Agent knows exactly which state it will be in• Agent action is executed as expected
• Dynamic, partially observable• Observations provide new information about current
state with uncertainty• Robot actions may not be successfully executed
• Non-observable• Agent may have no
idea where it is
5
Decision Making/Planning Types
![Page 6: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/6.jpg)
6
Example: Vacuum World
![Page 7: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/7.jpg)
7
Example: Vacuum World (observable)
(Also called reward, penalty, or utility)
![Page 8: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/8.jpg)
8
Example: Vacuum World (non-observable)
![Page 9: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/9.jpg)
• How about in an uncertain scenario?• Uncertainty in action outcomes• Uncertainty in state of knowledge• Any combination of the two
9
Planning with Uncertainty
![Page 10: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/10.jpg)
• Solutions based on decision tree
10
Planning with Uncertainty
![Page 11: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/11.jpg)
•Utility (i.e., reward or cost) function associates a real-valued utility (reward or cost) with each outcome (state or state-action pair)•With utilities, we can compute and optimize
expected utilities for planning under uncertainty• For example, the expected utility of decision d in the
state s is defined as
• The principle of maximum expected utility states that the optimal decision under uncertainty is the one that has greatest expected utility
11
Planning with Uncertainty
![Page 12: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/12.jpg)
Two fundamental problems in sequential decision making•Reinforcement Learning:
• The environment is initially unknown• The agent interacts with the environment• The agent improves its policy
•Planning:• A model of the environment is known• The agent performs computations with its model
(without any external interaction)
12
Reinforcement Learning for Planning
![Page 13: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/13.jpg)
13
Reinforcement Learning
• Branches of Machine Learning
![Page 14: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/14.jpg)
14
Characteristics of Reinforcement Learning
•What makes reinforcement learning different from other machine learning paradigms?• There is no supervisor, only a reward signal• Feedback is delayed, not instantaneous• Time really matters (sequential, non i.i.d data)• Agent's actions affect the subsequent data it receives
![Page 15: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/15.jpg)
15
Applications of Reinforcement Learning
![Page 16: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/16.jpg)
16
Applications of Reinforcement Learning
![Page 17: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/17.jpg)
17
Reinforcement Learning
•Reinforcement learning is based on the reward hypothesis
•Definition (Reward Hypothesis): All goals can be described by the maximization of expected cumulative reward• A reward 𝑅𝑡 is a scalar feedback signal• Indicates how well agent is doing at step 𝑡• The agent's job is to maximize cumulative reward
•Actions may have long term consequences, thus reward may be delayed• It may be better to sacrifice immediate reward to
gain more long-term reward
![Page 18: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/18.jpg)
18
Agent and Environment
The slides of RL are from Dr. David Silver: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
![Page 19: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/19.jpg)
19
History and State
![Page 20: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/20.jpg)
20
Environment State
![Page 21: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/21.jpg)
21
Agent State
![Page 22: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/22.jpg)
22
Information State
![Page 23: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/23.jpg)
23
Fully Observable Environments
![Page 24: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/24.jpg)
24
Markov Property
![Page 25: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/25.jpg)
25
State Transition Matrix
![Page 26: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/26.jpg)
26
Markov Process
![Page 27: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/27.jpg)
27
Markov Process: Example
![Page 28: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/28.jpg)
28
Markov Process: Example
![Page 29: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/29.jpg)
29
Markov Reward Process
![Page 30: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/30.jpg)
30
Markov Reward Process: Example
![Page 31: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/31.jpg)
31
Markov Decision Process
![Page 32: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/32.jpg)
32
Markov Decision Process: Example
![Page 33: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/33.jpg)
33
MDP and Reinforcement Learning
![Page 34: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/34.jpg)
34
Components of an RL Agent
Policy
(and greedy)
![Page 35: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/35.jpg)
35
Components of an RL Agent
Model
Value Function
![Page 36: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/36.jpg)
36
Maze Example
![Page 37: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/37.jpg)
37
Maze Example: Policy
![Page 38: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/38.jpg)
38
Maze Example: Value Function
![Page 39: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/39.jpg)
39
Maze Example: Model
![Page 40: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/40.jpg)
40
Model-Based and Model-Free RL
Model-based RL Model-free RL
![Page 41: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/41.jpg)
41
Model-Based and Model-Free RL
Model-based RL Model-free RL
![Page 42: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/42.jpg)
42
Q-Learning: Integrating Learning and Planning
•We’re going to learn model-free RL (although knowing a model also works) •We will focus on finding a way to estimate the
value function directly• The value function is not necessary to directly
associate with the world and represent the world
•The value function is the Q-function• A recursive way to approximate a value function
•The process of estimation the Q-function is called Q-learning• Q-Learning integrates learning and planning
![Page 43: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/43.jpg)
43
Q-Learning Basics
•Given a sequence of states, actions, and rewards defined by an MDP:
we define a unit experience as
• At each step s, choose the action a which maximizes the Q-function Q(s, a)•Q is the estimated value function • It tells us how good an action is given a state •Q(s, a) = immediate reward for making an action +
best value (Q) from the resulting (future) states
![Page 44: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/44.jpg)
44
Q-Learning Formal Definition
•Q-function has a recursive definition:
•Q-learning is about maintaining and updating the table of Q-values, called Q-table.• Only updates Q-values related to the state-action pairs
that are visited.
![Page 45: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/45.jpg)
45
Q-Learning Algorithm
•The Q-learning algorithm is also recursive:• Consider the unit experience
![Page 46: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/46.jpg)
46
Q-Learning Example
![Page 47: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/47.jpg)
47
Q-Learning Example
![Page 48: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/48.jpg)
48
Q-Learning Example
![Page 49: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/49.jpg)
49
Q-Learning Example
![Page 50: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/50.jpg)
50
Q-Learning Example
![Page 51: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/51.jpg)
51
Q-Learning Example
![Page 52: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/52.jpg)
52
Q-Learning Example
![Page 53: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/53.jpg)
53
Q-Learning Example
![Page 54: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/54.jpg)
54
Q-Learning Example
![Page 55: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/55.jpg)
55
Q-Learning ExampleNew episode
![Page 56: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/56.jpg)
56
Q-Learning Example
![Page 57: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/57.jpg)
57
Q-Learning Example
![Page 58: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/58.jpg)
58
Q-Learning
•The Q-learning algorithm is also recursive:• Consider the unit experience
(Greedy selection of the action)
(Complete overwriting old Q values)
![Page 59: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/59.jpg)
59
Q-Learning: Temporal Difference Learning
•Temporal Difference (TD) algorithms enable the agent to incrementally update its Q-table through every single action it takes.
▪ The value Target-OldEstimate is called the target error.
▪ StepSize is usually denoted by 𝛼 is also called the learning rate, between 0 and 1, and 1 means completely overwrites the old Q value.
![Page 60: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/60.jpg)
60
Q-Learning: Temporal Difference Learning
• The algorithm of Q-learning becomes:
• Q-Learning is an off-policy learning algorithm, because it directly finds the optimal Q-value without any dependency on the policy being followed (due to the max operation).
Ref: Introduction to Reinforcement learning by Sutton and Barto - Chapter 6.8
![Page 61: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/61.jpg)
61
Q-Learning Example
![Page 62: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/62.jpg)
62
𝜺-greedy policy for Action Selection
• The 𝜀-greedy policy is most widely used (for both on-and off-policy) to choose an action given a state:
• The value of 𝜀 determines the exploration-exploitation of the agent▪ A larger 𝜀 results in more exploration and less exploitation▪ As a rule of thumb, ε is usually chosen to be 0.9 and
decreased over time
𝜺-greedy policy:
1. Generate a random number 𝑟 ∈ [0,1]2. If 𝑟 > 𝜀, choose an action derived from the Q values
(which yields the maximum reward)
3. Else choose a random action
![Page 63: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/63.jpg)
63
SARSA
• SARSA is acronym for State-Action-Reward-State-Action.
• SARSA is an on-policy TD learning algorithm, because it evaluates and improves the same policy that is being used to select actions.
![Page 64: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/64.jpg)
64
SARSA
•The final algorithm of SARSA:
Ref: Introduction to Reinforcement learning by Sutton and Barto - Chapter 6.7
![Page 65: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/65.jpg)
65
SARSA VS. Q-Learning
![Page 66: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/66.jpg)
66
RL Challenges: A Pole-Balancing Example
• Problem: Pole-balancing • Move car left/right to keep the
pole balanced
• State representations• Position and velocity of car• Angle and angular velocity of pole• They are all continuous variables
•Action representations• Move the cart left or right with certain speed• Speed is also continuous variables
• Straightforward solutions• Coarse discretization of 3 state variables:
left, center, right• Course discretization of speed as well
![Page 67: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/67.jpg)
67
RL Challenges
•Rewards • Indicate what we want to accomplish•NOT how we want to accomplish it
•Robot in a maze• Episodic task• Big reward when out, -1 for each step
•Chess• GOOD: +1 for winning, -1 losing• BAD: +0.25 for taking opponent’s pieces
• It is hard to design rewards for general tasks
• An active research topic is reward learning
![Page 68: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/68.jpg)
68
Deep Q-Learning
•Q-learning is a simple yet powerful algorithm
•However, Q-learning easily suffers from the curse of dimensionality
•When the number of states and actions becomes larger, the Q-table becomes intractable:
1. The amount of memory required to save and update the Q-table would increase as the number of states and actions increases
2. The amount of time required to explore each state to create the required Q-table would be unrealistic
![Page 69: Robot Reinforcement Learning](https://reader034.vdocument.in/reader034/viewer/2022050211/626e1635b4daa1001b6bb16e/html5/thumbnails/69.jpg)
69
Deep Q-Learning
• Idea of Deep Q-Learning: approximating the Q-table with a neural network
▪ The state is given as the input and the Q-value of all possible actions is generated as the output.