welcome!

Welcome!

Hierarchical organization of behavior

•Thank you for coming

•Apologies to the skiers…

•Why we will be strict about timing

•Why we want the workshop to be

interactive

Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem

RL: Decision making

Goal: maximize reward (minimize punishment)Goal: maximize reward (minimize punishment)

RL in a nutshell: formalization

states - actions - transitions - rewards - policy - long term values

ts of a

Policy: p(S,a)State values: V(S)State-action values: Q(S,a)

44 00 22 22

RL in a nutshell: forward search

Model b

learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical

Model = T(ransitions) and R(ewards)

44 00 22 22

Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)

RL in a nutshell: cached valuesM

odel-fre

temporal difference learning

Q(S,a) = r(S,a) + max Q(S’,a’)

TD learning:start with initial (wrong) Q(S,a)

PE = r(S,a) + max Q(S’,a’) - Q(S,a)

Q(S,a)new = Q(S,a)old + PE

44 00 22 22

RL in a nutshell: cached valuesM

odel-fre

choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia

temporal difference learning

44 00 22 22

Trick #2: Can learn values without a model

Q(S1,L) 4

Q(S1,R) 2

Q(S2,L) 4

Q(S2,R) 0

Q(S3,L) 2

Q(S3,R) 2

RL in real world tasks…

model based vs. model free learning and control

Q(S1,L) 4

Q(S1,R) 2

Q(S2,L) 4

Q(S2,R) 0

Q(S3,L) 2

Q(S3,R) 2 S1

44 00 22 22

Scaling problem!

Real-world behavior is hierarchicalH

rchica

l RL: W

hat is

1. set water temp

2. get wet

3. shampoo

4. soap

5. turn off water

6. dry off

add hot

success

add coldwait 5sec

too hot

changejust right

simplified control, disambiguation, encapsulation

1. pour coffee

2. add sugar

3. add milk

4. stir

HRL: (in)formal framework

Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)

ical R

options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)

Option: set water temperature

0.80.10.1

0.10.10.8

S1 (0.1)

S2 (0.1)

S3 (0.9)

…initiation set policy

termination

conditions

S: start G: goalOptions: going to doorsActions: + 2 door options

HRL: a toy exampleH

rchica

l RL: W

hat is

Advantages of HRL1. Faster learning

(mitigates scaling problem)

ical R

RL: no longer ‘tabula rasa’

2. Transfer of knowledge from previous tasks(generalization, shaping)

Disadvantages (or: the cost) of HRLH

rchica

l RL: W

hat is

1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;

habits)3. More complex learning/control structure

no free lunches…

welcome!

model freehierarchical

rl taskpolicy

hierarchicalhierarchical

model free learning

astate values

longterm values

toy examplehierarchical

scaling problemhierarchical

Documents

welcome and welcome back 2015

welcome. welcome to welcome to dr. chus welcome to dr. chus...

welcome! welcome! welcome!

welcome letter welcome - craig counts

welcome to welcome

september 2016 -...

welcome!!welcome!! welcome central hs! welcome west brook...

welcome to 6 th grade welcome, welcome, welcome,welcome

company logo welcome welcome mathematics educators

prospective cchs students & parents welcome! welcome!

welcome from the president welcome

welcome morning. welcome evening...

welcome to…. welcome to school. welcome to the restaurant

welcome to health… welcome to hope… welcome to manna!

welcome listing...welcome listing

insurance research report - criminology / welcome | welcome...

welcome my house - kijapan · 2021. 2. 9. · welcome my...

welcome_banks welcome banks. organisers welcome party

welcome welcome pack 2014/2015

welcome board 1 - welcome