powerpoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2018/lecture/reward (v3).pdf · title...
TRANSCRIPT
![Page 1: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/1.jpg)
Sparse RewardHung-yi Lee
![Page 2: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/2.jpg)
Sparse RewardReward Shaping
![Page 3: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/3.jpg)
Reward Shaping
Take “Play”, 𝑟𝑡+1 = 1, 𝑟𝑡+100 = −100
Take “Study”, 𝑟𝑡+1 = −1, 𝑟𝑡+100 = 100
𝑟𝑡+1 = 1
![Page 4: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/4.jpg)
Reward Shaping
https://openreview.net/pdf?id=Hk3mPK5gg
Get reward, when closer
Need domain knowledge
VizDoomhttps://openreview.net/forum?id=Hk3mPK5gg¬eId=Hk3mPK5gg
![Page 5: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/5.jpg)
Curiosity https://arxiv.org/abs/1705.05363
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 =
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
updatedupdated
ICM = intrinsic curiosity module +𝑟𝑡𝑖
ICM ICM
𝑟1𝑖 𝑟2
𝑖
![Page 6: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/6.jpg)
Intrinsic Curiosity Module𝑟𝑡𝑖
𝑠𝑡𝑎𝑡 𝑠𝑡+1
Network 1
Ƹ𝑠𝑡+1 diff
Large reward if 𝑠𝑡+1 is hard to predict
鼓勵冒險
Some states is hard to predict, but not
important.
樹葉飄動
![Page 7: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/7.jpg)
Intrinsic Curiosity Module𝑟𝑡𝑖
𝑠𝑡𝑎𝑡 𝑠𝑡+1
Network 1
Feature Ext
Feature Ext
Network 2
𝜙 𝑠𝑡 𝜙 𝑠𝑡+1
𝜙 𝑠𝑡+1
𝑎𝑡
ො𝑎𝑡
diff
𝜙 is useful features related to actions
![Page 8: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/8.jpg)
Reward from Auxiliary Task
https://arxiv.org/abs/1611.05397
![Page 9: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/9.jpg)
Demo
![Page 10: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/10.jpg)
Sparse RewardCurriculum Learning
![Page 11: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/11.jpg)
Curriculum Learning
• Starting from simple training examples, and then becoming harder and harder.
VizDoom
![Page 12: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/12.jpg)
Reverse Curriculum Generation
𝑠𝑔
➢Given a goal state 𝑠𝑔.
➢ Sample some states 𝑠1 “close” to 𝑠𝑔
➢ Start from states 𝑠1, each trajectory has reward R 𝑠1
𝑠1
![Page 13: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/13.jpg)
Reverse Curriculum Generation
𝑠𝑔
➢Delete 𝑠1 whose reward is too large (already learned) or too small (too difficult at this moment)
➢ Sample 𝑠2 from 𝑠1, start from 𝑠2
𝑠2
𝑠1
![Page 14: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/14.jpg)
Sparse RewardHierarchical
Reinforcement Learning
![Page 15: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/15.jpg)
Hierarchical RL
➢ If lower agent cannot achieve the goal, the upper agent would get penalty.
➢ If an agent get to the wrong goal, assume the original goal is the wrong one.
校長 教授
下面這個例子純屬虛構,跟真實的狀況完全不同
菸酒生
provide goal provide subgoal action
https://arxiv.org/abs/1805.08180
![Page 16: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/16.jpg)
https://arxiv.org/abs/1805.08180
![Page 17: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM](https://reader036.vdocument.in/reader036/viewer/2022071217/604cedda03b4b647f74a33b6/html5/thumbnails/17.jpg)
Acknowledgement
•感謝芮祥麟博士發現課程網頁上拼字的錯誤