lecture29:machinelearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. •...

200
Lecture 29: Machine Learning Marvin Zhang 08/10/2015 (Some images borrowed from CS 188.)

Upload: others

Post on 05-Jul-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Lecture  29:  Machine  Learning

Marvin  Zhang  08/10/2015

(Some  images  borrowed  from  CS  188.)

Page 2: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements

Page 3: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

Page 4: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).

Page 5: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

Page 6: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

Page 7: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.

Page 8: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

Page 9: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

• Final  on  Thursday,  3-­‐6pm  in  2050  VLSB.

Page 10: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

• Final  on  Thursday,  3-­‐6pm  in  2050  VLSB.• Review  session  tomorrow,  5-­‐8pm  in  the  Woz.

Page 11: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

• Final  on  Thursday,  3-­‐6pm  in  2050  VLSB.• Review  session  tomorrow,  5-­‐8pm  in  the  Woz.

• Tuesday,  Wednesday,  and  Thursday  sections  canceled.

Page 12: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

• Final  on  Thursday,  3-­‐6pm  in  2050  VLSB.• Review  session  tomorrow,  5-­‐8pm  in  the  Woz.

• Tuesday,  Wednesday,  and  Thursday  sections  canceled.• Instead,  TAs  will  lead  topic-­‐themed  discussions  Tuesday  and  Wednesday.

Page 13: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

• Final  on  Thursday,  3-­‐6pm  in  2050  VLSB.• Review  session  tomorrow,  5-­‐8pm  in  the  Woz.

• Tuesday,  Wednesday,  and  Thursday  sections  canceled.• Instead,  TAs  will  lead  topic-­‐themed  discussions  Tuesday  and  Wednesday.• Details  will  be  posted  on  Piazza.

Page 14: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Announcements• Project  3  composition  revisions  due  Wednesday  night.

• Project  4  due  tonight  (Monday).• Office  hours  from  4-­‐7pm  today  in  the  Woz.

• Homework  11  due  tonight  -­‐  just  a  survey!

• Project  4  contest  due  tomorrow  (Tuesday)  night.• Top  3  entries  in  each  category  get  extra  credit!  Only  one  entry  so  far.

• Final  on  Thursday,  3-­‐6pm  in  2050  VLSB.• Review  session  tomorrow,  5-­‐8pm  in  the  Woz.

• Tuesday,  Wednesday,  and  Thursday  sections  canceled.• Instead,  TAs  will  lead  topic-­‐themed  discussions  Tuesday  and  Wednesday.• Details  will  be  posted  on  Piazza.• Chris  and  Cale's  9:30-­‐11am  labs  on  Tuesday  are  NOT  canceled.

Page 15: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?

Page 16: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

Page 17: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

Page 18: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

Page 19: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

Page 20: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

Page 21: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

Page 22: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

Page 23: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

• Game  Playing

Page 24: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

• Game  Playing

Page 25: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

• Game  Playing

• and  much  more!

Page 26: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

What  is  Machine  Learning?• Natural  Language  Processing

• Computer  Vision

• Robotics

• Game  Playing

• and  much  more!

• What  do  these  all  have  in  common?

Page 27: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

Page 28: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

Page 29: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

• A  subfield  of  computer  science.

Page 30: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

• A  subfield  of  computer  science.

• The  study  of  algorithms  that  analyze  data  to  make  decisions.

Page 31: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

• A  subfield  of  computer  science.

• The  study  of  algorithms  that  analyze  data  to  make  decisions.

• Examples  of  decisions:

Page 32: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

• A  subfield  of  computer  science.

• The  study  of  algorithms  that  analyze  data  to  make  decisions.

• Examples  of  decisions:

• Is  this  email  ham  or  spam?

Page 33: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

• A  subfield  of  computer  science.

• The  study  of  algorithms  that  analyze  data  to  make  decisions.

• Examples  of  decisions:

• Is  this  email  ham  or  spam?

• How  do  I  translate  this  sentence?

Page 34: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning

What  is  machine  learning?

• A  subfield  of  computer  science.

• The  study  of  algorithms  that  analyze  data  to  make  decisions.

• Examples  of  decisions:

• Is  this  email  ham  or  spam?

• How  do  I  translate  this  sentence?

• Will  this  user  like  this  restaurant?

Page 35: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Page 36: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

K-­‐means  Clustering

Page 37: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

K-­‐means  Clustering

Page 38: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

K-­‐means  Clustering

Page 39: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

K-­‐means  Clustering

• The  data:  restaurant  locations

Page 40: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

K-­‐means  Clustering

• The  data:  restaurant  locations

• The  decision:  which  cluster  does  each  belong  to?

Page 41: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

K-­‐means  Clustering

• The  data:  restaurant  locations

• The  decision:  which  cluster  does  each  belong  to?

Called  unsupervised  learning,  because  no  one  tells  it  what  the  correct  decision  is.

Page 42: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Page 43: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Linear  Regression

Page 44: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Linear  Regression

Page 45: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Linear  Regression

Page 46: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Linear  Regression

• The  data:  user  ratings

Page 47: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Linear  Regression

• The  data:  user  ratings

• The  decision:  what  rating  would  the  user  give  a  new  restaurant?

Page 48: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Machine  Learning  Example:  Maps

Linear  Regression

• The  data:  user  ratings

• The  decision:  what  rating  would  the  user  give  a  new  restaurant?

Called  supervised  learning,  because  some  correct  decisions  are  given.

Page 49: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Outline

Page 50: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Outline

• So  far,  we’ve  looked  at  two  specific  machine  learning  algorithms  from  two  different  domains.

Page 51: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Outline

• So  far,  we’ve  looked  at  two  specific  machine  learning  algorithms  from  two  different  domains.

• Today,  we  will  focus  on  a  subclass  of  problems  in  machine  learning,  known  as  reinforcement  learning  problems,  and  algorithms  for  these  problems.

Page 52: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Reinforcement  Learning

Page 53: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Reinforcement  Learning

What  is  reinforcement  learning?

Page 54: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Reinforcement  Learning

What  is  reinforcement  learning?

• Concerned  with  learning  behavior  through  experience.

Page 55: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Reinforcement  Learning

What  is  reinforcement  learning?

• Concerned  with  learning  behavior  through  experience.

• Two  main  components:  the  agent  and  the  environment.

Page 56: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Reinforcement  Learning

What  is  reinforcement  learning?

• Concerned  with  learning  behavior  through  experience.

• Two  main  components:  the  agent  and  the  environment.

• The  agent  lives  in  and  interacts  with  the  environment,  and  through  this  experience  learns  a  good  pattern  of  behavior.

Page 57: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Page 58: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

Page 59: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

SomeoneYou

Page 60: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

SomeoneYou

Page 61: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

SomeoneYou

Page 62: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

SomeoneYou

Page 63: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

SomeoneYou

Do youlike cats?

Page 64: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

SomeoneYou

Ew, no.

Page 65: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

SomeoneYou

Oh… yeah,me neither.

Page 66: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

SomeoneYou

Page 67: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

SomeoneYou

So… do youlike hamsters?

Page 68: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

SomeoneYou

Omg yesI love hamsters!

Page 69: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

SomeoneYou

Omg me too!!

Page 70: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

• If  you’re  a  good  agent,  you  may  even  learn  how  to  behave  really,  really  well!

SomeoneYou

Page 71: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

• If  you’re  a  good  agent,  you  may  even  learn  how  to  behave  really,  really  well!

SomeoneYou

45 minutes of talking about hamsters later…

Page 72: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

• If  you’re  a  good  agent,  you  may  even  learn  how  to  behave  really,  really  well!

SomeoneYou

Wow, you’re a really great person.

Page 73: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

An  Analogy

Suppose  you  go  on  a  date  with  someone.

• In  reinforcement  learning  terms,  you  are  the  agent.

• Everything  else  (the  other  person,  the  setting,  etc.)  is  the  environment.

• At  the  beginning  of  the  date,  you  might  not  know  how  to  act,  so  you  try  different  things  to  see  how  the  other  person  responds.

• As  the  date  goes  on,  you  slowly  figure  out  how  you  should  behave  based  on  what  you’ve  tried  so  far,  and  how  it  went.

• If  you’re  a  good  agent,  you  may  even  learn  how  to  behave  really,  really  well!

SomeoneYou

DATE:SUCCESS

Page 74: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

RL  Example:  Gridworld

Page 75: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

RL  Example:  Gridworld

What  is  the  environment?  

What  is  the  agent?

Page 76: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

RL  Example:  Gridworld

The  Problem:  How  do  we  get  to  the  goal  (green)  from  the  start  (blue)  as  quickly  as  possible  while  avoiding  the  obstacles  (red)?

What  is  the  environment?  

What  is  the  agent?

Page 77: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

Page 78: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

The  environment:

Page 79: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

The  environment:• States  (s):

Configuration  of  the  agent  and  environment.

Page 80: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

The  environment:• States  (s):

Configuration  of  the  agent  and  environment.• Actions  (a):

What  can  the  agent  do  in  a  state?

Page 81: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

The  environment:• States  (s):

Configuration  of  the  agent  and  environment.• Actions  (a):

What  can  the  agent  do  in  a  state?• Reward  function  (R):

What  reward  does  the  agent  get  for  each  state?

Page 82: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

The  environment:• States  (s):

Configuration  of  the  agent  and  environment.• Actions  (a):

What  can  the  agent  do  in  a  state?• Reward  function  (R):

What  reward  does  the  agent  get  for  each  state?

The  agent:

Page 83: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

The  RL  Setting

The  environment:• States  (s):

Configuration  of  the  agent  and  environment.• Actions  (a):

What  can  the  agent  do  in  a  state?• Reward  function  (R):

What  reward  does  the  agent  get  for  each  state?

The  agent:• Policy  (𝜋):

Given  a  state,  what  action  will  the  agent  take?

Page 84: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited

Page 85: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

Page 86: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):

Page 87: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

Page 88: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

• Actions  (a):

Page 89: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

• Actions  (a):Go  to  an  adjacent  square,  or  stay  put.

Page 90: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

• Actions  (a):Go  to  an  adjacent  square,  or  stay  put.

• Reward  function:  ???

Page 91: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

• Actions  (a):Go  to  an  adjacent  square,  or  stay  put.

• Reward  function:  ???

The  Problem:How  do  we  get  to  the  goal  from  the  start  as  quickly  as  possible  while  avoiding  the  obstacles?

Page 92: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

• Actions  (a):Go  to  an  adjacent  square,  or  stay  put.

• Reward  function:  ???

The  Problem:How  do  we  get  to  the  goal  from  the  start  as  quickly  as  possible  while  avoiding  the  obstacles?

In  RL  terminology:What  is  the  optimal  policy  𝜋*  that  maximizes  my  expected  reward  over  time?

Page 93: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Revisited• The  environment  of  Gridworld,  in  more  detail:

• States  (s):What  square  is  the  agent  in?

• Actions  (a):Go  to  an  adjacent  square,  or  stay  put.

• Reward  function:  ???

The  Problem:How  do  we  get  to  the  goal  from  the  start  as  quickly  as  possible  while  avoiding  the  obstacles?

In  RL  terminology:What  is  the  optimal  policy  𝜋*  that  maximizes  my  expected  reward  over  time?

Page 94: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Reward  Function

Page 95: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Reward  Function

0 0 0 0 0

0 0 0

0 0 0 1

Page 96: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Reward  Function

-1 -1 -1 -1 -1

-1 -1 -1

-1 -1 -1 0

Page 97: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function

Page 98: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s

Page 99: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s

Page 100: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

Page 101: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

• How  do  we  determine  where  to  go  after  s?

Page 102: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

• How  do  we  determine  where  to  go  after  s?• We  use  our  policy  𝜋  to  determine  which  actions  to  take.

Page 103: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

• How  do  we  determine  where  to  go  after  s?• We  use  our  policy  𝜋  to  determine  which  actions  to  take.• So,  the  value  function  also  depends  on  our  policy.

Page 104: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

• How  do  we  determine  where  to  go  after  s?• We  use  our  policy  𝜋  to  determine  which  actions  to  take.• So,  the  value  function  also  depends  on  our  policy.

• How  do  we  determine  the  value  of  a  state?

Page 105: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

• How  do  we  determine  where  to  go  after  s?• We  use  our  policy  𝜋  to  determine  which  actions  to  take.• So,  the  value  function  also  depends  on  our  policy.

• How  do  we  determine  the  value  of  a  state?• The  value  of  a  state  is  the  reward  of  the  state  plus  the  value  

of  the  state  we  end  up  in  next.

Page 106: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function• Reward  function:  R(s)  =  reward  of  being  in  state  s• Value  function:  V(s)  =  value  of  being  in  state  s• The  value  of  s  is  the  long-­‐term  expected  reward  starting  from  s.

• How  do  we  determine  where  to  go  after  s?• We  use  our  policy  𝜋  to  determine  which  actions  to  take.• So,  the  value  function  also  depends  on  our  policy.

• How  do  we  determine  the  value  of  a  state?• The  value  of  a  state  is  the  reward  of  the  state  plus  the  value  

of  the  state  we  end  up  in  next.

Page 107: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function

Page 108: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function

• How  do  we  solve  this  equation?  Use  recursion!

Page 109: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function

• How  do  we  solve  this  equation?  Use  recursion!• What’s  our  base  case?

Page 110: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function

• How  do  we  solve  this  equation?  Use  recursion!• What’s  our  base  case?

• If  we’re  at  our  goal,  then  there  is  no  next  state,  so  the  value  is  just  the  reward.

Page 111: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Function

• How  do  we  solve  this  equation?  Use  recursion!• What’s  our  base  case?

• If  we’re  at  our  goal,  then  there  is  no  next  state,  so  the  value  is  just  the  reward.

Page 112: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

Page 113: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

Page 114: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

Page 115: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

• Arrows  denote  the  policy  𝜋.

Page 116: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 117: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 118: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-1

0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 119: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-2

-1

0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 120: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-3 -2

-1

0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 121: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-4 -3 -2

-1

0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 122: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-5 -4 -3 -2

-5 -1

0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 123: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-6 -5 -4 -3 -2

-5 -1

-6 0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 124: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-6 -5 -4 -3 -2

-5 -1

-7 -6 0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 125: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-6 -5 -4 -3 -2

-5 -1

-8 -7 -6 0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 126: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Value  Function  Example

-6 -5 -4 -3 -2

-9 -5 -1

-8 -7 -6 0

• Arrows  denote  the  policy  𝜋.

• What  are  the  values  of  the  states,  assuming  no  random  movements?

Page 127: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Evaluation

Page 128: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Evaluation

• What  we  just  did  was  policy  evaluation,  determining  the  value  of  states  given  our  policy  𝜋.  Simply  put,  we  figured  out  how  good  our  policy  is.

Page 129: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Evaluation

• What  we  just  did  was  policy  evaluation,  determining  the  value  of  states  given  our  policy  𝜋.  Simply  put,  we  figured  out  how  good  our  policy  is.

• But  remember,  what  we  are  really  interested  in  is  the  optimal  policy  𝜋*!  How  do  we  find  this?

Page 130: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Evaluation

• What  we  just  did  was  policy  evaluation,  determining  the  value  of  states  given  our  policy  𝜋.  Simply  put,  we  figured  out  how  good  our  policy  is.

• But  remember,  what  we  are  really  interested  in  is  the  optimal  policy  𝜋*!  How  do  we  find  this?

• We  need  one  more  step  -­‐  policy  iteration.

Page 131: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Evaluation

• What  we  just  did  was  policy  evaluation,  determining  the  value  of  states  given  our  policy  𝜋.  Simply  put,  we  figured  out  how  good  our  policy  is.

• But  remember,  what  we  are  really  interested  in  is  the  optimal  policy  𝜋*!  How  do  we  find  this?

• We  need  one  more  step  -­‐  policy  iteration.

Page 132: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

Page 133: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

Page 134: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

Page 135: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

-6 -5 -4 -3 -2

-9 -5 -1

-8 -7 -6 0

Page 136: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

-6 -5 -4 -3 -2

-9 -5 -1

-8 -7 -6 0

• Arrows  denote  the  policy.

Page 137: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

-6 -5 -4 -3 -2

-9 -5 -1

-8 -7 -6 0

• Arrows  denote  the  policy.

• Based  on  the  value  function,  which  action  of  the  current  policy  should  we  change?

Page 138: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Gridworld  Policy  Iteration  Example

-6 -5 -4 -3 -2

-9 -5 -1

-8 -7 -6 0

• Arrows  denote  the  policy.

• Based  on  the  value  function,  which  action  of  the  current  policy  should  we  change?

Page 139: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

Page 140: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

• Now  that  we  know  V(s),  we  improve  our  policy  𝜋  to  a  new  policy  𝜋’  as  follows:

Page 141: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

• Now  that  we  know  V(s),  we  improve  our  policy  𝜋  to  a  new  policy  𝜋’  as  follows:• For  every  state  s,  𝜋’  picks  the  action  that  leads  

to  the  next  state  s’  with  the  highest  value.

Page 142: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

• Now  that  we  know  V(s),  we  improve  our  policy  𝜋  to  a  new  policy  𝜋’  as  follows:• For  every  state  s,  𝜋’  picks  the  action  that  leads  

to  the  next  state  s’  with  the  highest  value.

Page 143: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

• Now  that  we  know  V(s),  we  improve  our  policy  𝜋  to  a  new  policy  𝜋’  as  follows:• For  every  state  s,  𝜋’  picks  the  action  that  leads  

to  the  next  state  s’  with  the  highest  value.

• This  is  called  policy  iteration.

Page 144: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

• Now  that  we  know  V(s),  we  improve  our  policy  𝜋  to  a  new  policy  𝜋’  as  follows:• For  every  state  s,  𝜋’  picks  the  action  that  leads  

to  the  next  state  s’  with  the  highest  value.

• This  is  called  policy  iteration.

Page 145: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Policy  Iteration

• Now  that  we  know  V(s),  we  improve  our  policy  𝜋  to  a  new  policy  𝜋’  as  follows:• For  every  state  s,  𝜋’  picks  the  action  that  leads  

to  the  next  state  s’  with  the  highest  value.

• This  is  called  policy  iteration.

Page 146: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy

Page 147: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

Page 148: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.

Page 149: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

Page 150: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

Page 151: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

Page 152: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

• Find  a  better  policy  𝜋’  using  policy  iteration.

Page 153: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

• Find  a  better  policy  𝜋’  using  policy  iteration.

Page 154: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

• Find  a  better  policy  𝜋’  using  policy  iteration.

• If  𝜋  ==  𝜋’,  return  𝜋.

Page 155: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

• Find  a  better  policy  𝜋’  using  policy  iteration.

• If  𝜋  ==  𝜋’,  return  𝜋.• Otherwise,  set  𝜋  equal  to  𝜋’.

Page 156: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

• Find  a  better  policy  𝜋’  using  policy  iteration.

• If  𝜋  ==  𝜋’,  return  𝜋.• Otherwise,  set  𝜋  equal  to  𝜋’.

• We  can  prove  that  this  𝜋  we  return  is  optimal,  i.e.  𝜋  ==  𝜋*!We  won’t  do  the  math,  though.

Page 157: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Optimal  Policy• So,  to  find  the  optimal  policy:

• Initialize  some  policy  𝜋.• Repeat:

• Determine  V(s)  using  policy  evaluation.

• Find  a  better  policy  𝜋’  using  policy  iteration.

• If  𝜋  ==  𝜋’,  return  𝜋.• Otherwise,  set  𝜋  equal  to  𝜋’.

• We  can  prove  that  this  𝜋  we  return  is  optimal,  i.e.  𝜋  ==  𝜋*!We  won’t  do  the  math,  though.

Page 158: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far

Page 159: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

Page 160: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)

Page 161: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.

Page 162: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

Page 163: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:

Page 164: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

Page 165: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

Page 166: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

• Policy  iteration:

Page 167: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

• Policy  iteration:• Using  our  value  function  V  to  get  a  better  policy  𝜋’.

Page 168: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

• Policy  iteration:• Using  our  value  function  V  to  get  a  better  policy  𝜋’.

Page 169: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

• Policy  iteration:• Using  our  value  function  V  to  get  a  better  policy  𝜋’.

• Basically  every  RL  algorithm  is  a  combination  of  policy  evaluation  and  policy  iteration!  Let’s  take  a  closer  look  at  two  such  algorithms.

Page 170: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Terminology  So  Far• Reward  function  (R):  how  good  is  a  state?  (short-­‐term)

• Value  function  (V):  how  good  is  a  state?  (long-­‐term)• Measures  long-­‐term  expected  reward.• Depends  on  the  current  policy  𝜋.

• Policy  evaluation:• Evaluating  our  current  policy  𝜋  to  get  a  value  function  V.

• Policy  iteration:• Using  our  value  function  V  to  get  a  better  policy  𝜋’.

• Basically  every  RL  algorithm  is  a  combination  of  policy  evaluation  and  policy  iteration!  Let’s  take  a  closer  look  at  two  such  algorithms.

Page 171: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration

Page 172: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

Page 173: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:

Page 174: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

Page 175: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

Page 176: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

• If  V  doesn’t  change,  return  the  policy  𝜋  that  acts  according  to  the  maximum  value  of  V.

Page 177: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

• If  V  doesn’t  change,  return  the  policy  𝜋  that  acts  according  to  the  maximum  value  of  V.

Page 178: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

• If  V  doesn’t  change,  return  the  policy  𝜋  that  acts  according  to  the  maximum  value  of  V.

• Again,  we  can  show  that  this  policy  is  optimal,  i.e.  𝜋  ==  𝜋*!Again,  let’s  not  do  the  math.

Page 179: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

• If  V  doesn’t  change,  return  the  policy  𝜋  that  acts  according  to  the  maximum  value  of  V.

• Again,  we  can  show  that  this  policy  is  optimal,  i.e.  𝜋  ==  𝜋*!Again,  let’s  not  do  the  math.

(demo)

Page 180: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Value  Iteration• Value  iteration  is  an  algorithm  that  combines  the  policy  evaluation  and  

policy  iteration  steps  into  one  single  step.

• Repeat:• For  all  states  s,  determine  V(s),  and  set  V(s)  to  its  maximum  

possible  value.

• If  V  doesn’t  change,  return  the  policy  𝜋  that  acts  according  to  the  maximum  value  of  V.

• Again,  we  can  show  that  this  policy  is  optimal,  i.e.  𝜋  ==  𝜋*!Again,  let’s  not  do  the  math.

(demo)

Page 181: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

Page 182: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?

Page 183: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?• We  have  to  find  a  way  to  only  look  at  a  subset  of  states  

and  actions,  and  we  also  need  to  reasonably  approximate  their  values.

Page 184: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?• We  have  to  find  a  way  to  only  look  at  a  subset  of  states  

and  actions,  and  we  also  need  to  reasonably  approximate  their  values.

• What  if  we  don’t  know  how  the  environment  works?

Page 185: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?• We  have  to  find  a  way  to  only  look  at  a  subset  of  states  

and  actions,  and  we  also  need  to  reasonably  approximate  their  values.

• What  if  we  don’t  know  how  the  environment  works?• It’s  reasonable  to  think  that  the  agent  doesn’t  

completely  understand  what  the  next  possible  states  are  from  taking  an  action.

Page 186: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?• We  have  to  find  a  way  to  only  look  at  a  subset  of  states  

and  actions,  and  we  also  need  to  reasonably  approximate  their  values.

• What  if  we  don’t  know  how  the  environment  works?• It’s  reasonable  to  think  that  the  agent  doesn’t  

completely  understand  what  the  next  possible  states  are  from  taking  an  action.

• In  this  case,  we  have  to  try  different  things  in  order  to  figure  out  our  environment  -­‐  this  is  called  exploration.

Page 187: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?• We  have  to  find  a  way  to  only  look  at  a  subset  of  states  

and  actions,  and  we  also  need  to  reasonably  approximate  their  values.

• What  if  we  don’t  know  how  the  environment  works?• It’s  reasonable  to  think  that  the  agent  doesn’t  

completely  understand  what  the  next  possible  states  are  from  taking  an  action.

• In  this  case,  we  have  to  try  different  things  in  order  to  figure  out  our  environment  -­‐  this  is  called  exploration.

• Sometimes,  we  also  want  to  just  keep  doing  what  we  know  is  good  -­‐  this  is  called  exploitation.

Page 188: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Questions  and  Limitations

• What  if  there  are  way  too  many  states  and  actions  to  try?• We  have  to  find  a  way  to  only  look  at  a  subset  of  states  

and  actions,  and  we  also  need  to  reasonably  approximate  their  values.

• What  if  we  don’t  know  how  the  environment  works?• It’s  reasonable  to  think  that  the  agent  doesn’t  

completely  understand  what  the  next  possible  states  are  from  taking  an  action.

• In  this  case,  we  have  to  try  different  things  in  order  to  figure  out  our  environment  -­‐  this  is  called  exploration.

• Sometimes,  we  also  want  to  just  keep  doing  what  we  know  is  good  -­‐  this  is  called  exploitation.

Page 189: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration

Page 190: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

Page 191: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

• This  addresses  the  limitations  on  the  previous  slide.

Page 192: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

• This  addresses  the  limitations  on  the  previous  slide.• The  subset  of  states  that  we  look  at  are  the  

states  we  encounter  during  our  rollouts.

Page 193: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

• This  addresses  the  limitations  on  the  previous  slide.• The  subset  of  states  that  we  look  at  are  the  

states  we  encounter  during  our  rollouts.• The  approximation  of  the  value  of  the  states  is  by  

measuring  our  total  reward  over  of  course  of  our  rollout.

Page 194: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

• This  addresses  the  limitations  on  the  previous  slide.• The  subset  of  states  that  we  look  at  are  the  

states  we  encounter  during  our  rollouts.• The  approximation  of  the  value  of  the  states  is  by  

measuring  our  total  reward  over  of  course  of  our  rollout.

• We  balance  exploration  and  exploitation  by  sometimes  randomly  selecting  our  action.

Page 195: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

• This  addresses  the  limitations  on  the  previous  slide.• The  subset  of  states  that  we  look  at  are  the  

states  we  encounter  during  our  rollouts.• The  approximation  of  the  value  of  the  states  is  by  

measuring  our  total  reward  over  of  course  of  our  rollout.

• We  balance  exploration  and  exploitation  by  sometimes  randomly  selecting  our  action.

(demo)

Page 196: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

Rollout-­‐based  Policy  Iteration• In  RBPI,  instead  of  full  policy  evaluation,  we  run  

simulations,  or  rollouts,  to  approximate  our  value  function.

• This  addresses  the  limitations  on  the  previous  slide.• The  subset  of  states  that  we  look  at  are  the  

states  we  encounter  during  our  rollouts.• The  approximation  of  the  value  of  the  states  is  by  

measuring  our  total  reward  over  of  course  of  our  rollout.

• We  balance  exploration  and  exploitation  by  sometimes  randomly  selecting  our  action.

(demo)

Page 197: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

A  Parting  Thought

Page 198: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

A  Parting  Thought

• We’ve  talked  about  three  of  the  projects  in  this  class,  and  how  machine  learning  applies  to  them.

Page 199: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

A  Parting  Thought

• We’ve  talked  about  three  of  the  projects  in  this  class,  and  how  machine  learning  applies  to  them.

• What  about  the  last  one?  How  might  machine  learning  apply  to  the  Scheme  project?

Page 200: Lecture29:MachineLearningcs61a/su15/slides/29.pdf · 2015. 8. 20. · the environment. • Atthebeginningofthedate,youmightnotknowhowto act,soyoutrydifferentthingstoseehowtheother

A  Parting  Thought

• We’ve  talked  about  three  of  the  projects  in  this  class,  and  how  machine  learning  applies  to  them.

• What  about  the  last  one?  How  might  machine  learning  apply  to  the  Scheme  project?