![Page 1: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/1.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
University of Hamburg
Faculty of Mathematics, Informatics and Natural Science
Department of Informatics
Technical Aspects of Multimodal Systems
PPO Reinforcement Learning for Dexterous In-Hand Manipulation
![Page 2: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/2.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 1
Goals of this presentation
Introduction to Reinforcement Learning in Robotics
What is PPO and what are its advantages
Using PPO to improve Robot Dexterity
![Page 3: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/3.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 2
Reinforcement Learning - Introduction
Examples:• Deep Q learning ATARI games• AlphaGo• Berkeley robot stacking Legos• Physically-simulated quaruped movement
Source: [1]
![Page 4: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/4.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 3
Reinforcement Learning – Policy Gradients
Source: [1]
• On-policy vs Off-policy• Do action, act like gradient probability is 1.0, backpropagate• PG‘s do not require a label like in supervised learning
![Page 5: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/5.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 4
Reinforcement Learning – Problems
• Generated training data dependent on existing policy
• Does not scale easily to settings with difficult exploration (Robotics!)
• Difficult with high-dimensional actions
• Requires many samples & long training times
→ Higher complexity problems need more sophisticated approaches
![Page 6: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/6.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 5
Reinforcement Learning – The way to improvement
• Using all data so far available and computing the bestpolicy
• We always want the maximum discounted rewardwhen deciding what action to take
• Optimizing the loss function without overfitting
![Page 7: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/7.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 6
TRPO - Visualization
Source: [2]
![Page 8: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/8.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 7
Trust Region Policy Optimization (TRPO)
• Define a region in which we trust an optimized policy to not have changedcatastrophically
• Measuring the change between policies using KL-divergence
Source: [3]
![Page 9: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/9.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 8
Problems with TRPO
• Performs poor on tasks withCNN‘s and RNN‘s
• Implementation is difficult• Hard to use with
architecture with multiple outputs
Source: [4]
![Page 10: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/10.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 9
Proximal Policy Optimization
![Page 11: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/11.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 10
PPO – goals as set by the creators @OpenAI
• Simple implementation
• Sample efficient
• Easily tunable
![Page 12: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/12.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 11
PPO – In practice
• Training of two separate networks (with similar architecture):– Policy network, mapping observations to actions
– Value network, prediction the future reward from the current state
• Value network is only used in training and has access to moreinformation than the life version, such as:– Hand joint angles + velocities
– Object velocity
– Object + Target orientation
• Live system uses only the policy network
![Page 13: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/13.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 12
PPO – Clipping aka the „Magic“
Source: [5]
![Page 14: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/14.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 13
VIDEO
Source: [6]
![Page 15: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/15.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 14
Real-World vs Simulation
Source: [5]
![Page 16: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/16.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 15
Real-World
Source: [5]
• Wide range of tasks• Diverse environments• Fixed physics
→ extremely hard so simulate
![Page 17: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/17.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 16
How to make Simulations transferable
Source: [5]
Limiting control policy observations:• Simplification of pose
estimator• i.e. only fingertip pose, not
pressure
Randomizations:• Observation noise• Physics randomizations• Unmodeled effects
![Page 18: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/18.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 17
Examples of different simulation environments
Source: [5]
![Page 19: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/19.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 18
Complete System - Overview
Source: [5]
![Page 20: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/20.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 19
Complete System - Overview
Source: [5]
• 3 real cameras feed CNN‘s and predict object pose• Control policy takes that object pose & the fingertip locations• Control policy outputs actions
![Page 21: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/21.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 20
Conclusion
Main benefits while using Reinforcement Learning in Robotics:– No previous real life experience needed
– Robot agent does not have to collect data in the real world
– Simulations can make use of scaling and distributed infrastructure
Main challenges for using Reinforcement Learning in Robotics– Differences between reality and simulation need to be bridged
through randomizations
– Policies need to overcome optimization obstacles, i.e. through PPO
![Page 22: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/22.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 21
Outlook
Similar approaches to be tested
• KFAC
– Blockwise approximation fo FisherInformation matrix (FIM)
– Very fast in optimizing objectivesthrough momentum
• ACKTR
– Advantage actor-critic model
– Extremely high sample efficiencySource: [7]
![Page 23: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/23.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 22
Sources
[1] Andrej Karpathy (2016). Deep Reinforcement Learning: Pong from Pixels, http://karpathy.github.io/2016/05/31/rl/
[2] Photo from www.pexels.com, free for personal and commercial use under CC0 Licence https://www.pexels.com/photo/adventure-cliff-lookout-people-6763/ and paint3D by the author
[3] John Schulmann, Sergey Levine, Philipp Moritz, Michael Jordan, Peter Abeel(2015). Trust Region Policy Optimization, Volume 37: International Conference on Machine Learning, 7-9 July 2015, Lille, France
[4] GIF from https://giphy.com/, original link not functional, free for personal and non-commercial use as per https://giphy.com/terms
[5] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov(2017). Proximal Policy Optimization Algorithms. https://arxiv.org/pdf/1707.06347.pdf
![Page 24: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/24.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 23
Sources cont.
[6] Learning Dexterity (2018), as seen on the OpenAI Youtube channel: https://www.youtube.com/watch?v=jwSbzNHGflM
[7] John Schulman (2017). Advanced Policy Gradient Methods:
Natural Gradient, TRPO, and More, https://drive.google.com/file/d/0BxXI_RttTZAhMVhsNk5VSXU0U3c/
![Page 25: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/25.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 24
Addendum
• Policy and Value Network
• GAE
• Distributed Training Structure
• Effect of Memory
• Physics vs Simulation Results
• Values for different Physics Parameters
• Different Training Environments Compared
![Page 26: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/26.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 25
Policy and Value network
Source: [5]
![Page 27: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/27.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 26
GAE – Generalized Advantage Estimator
![Page 28: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/28.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 27
Distributed Training Structure
Source: [5]
![Page 29: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/29.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 28
Effect of Memory in Simulation
Source: [5]
![Page 30: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/30.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 29
Simulated vs Physical Results
Source: [5]
![Page 31: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/31.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 30
Ranges of physics parameter randomizations
Source: [5]
![Page 32: PPO Reinforcement Learning for Dexterous In-Hand Manipulation · Main challenges for using Reinforcement Learning in Robotics –Differences between reality and simulation need to](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec678a012b5ea1569197496/html5/thumbnails/32.jpg)
v19.11.18 | FLORIAN SCHMALZL| 6SCHMALZ@INFORMATIK
MIN FacultyDepartment of Informatics
PPO-RL IN ROBOTICS PAGE 31
Different training environments compared
Source: [5]