welcome!
Post on 05-Feb-2016
21 Views
Preview:
DESCRIPTION
TRANSCRIPT
NIP
S 2
00
7 W
orksh
op
Welcome!
Hierarchical organization of behavior
•Thank you for coming
•Apologies to the skiers…
•Why we will be strict about timing
•Why we want the workshop to be
interactive
Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem
RL: Decision making
Goal: maximize reward (minimize punishment)Goal: maximize reward (minimize punishment)
RL in a nutshell: formalization
states - actions - transitions - rewards - policy - long term values
Com
ponen
ts of a
n R
L ta
sk
Policy: p(S,a)State values: V(S)State-action values: Q(S,a)
Policy: p(S,a)State values: V(S)State-action values: Q(S,a)
S1
S3S2
44 00 22 22
RL
RL in a nutshell: forward search
S1
S3
S2L
R
L
RL
R
= 4
= 0
= 2
= 2
Model b
ase
d R
L
learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical
Model = T(ransitions) and R(ewards)
S1
S3S2
44 00 22 22
RL
Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)
Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)
RL in a nutshell: cached valuesM
odel-fre
e R
L
temporal difference learning
Q(S,a) = r(S,a) + max Q(S’,a’)
TD learning:start with initial (wrong) Q(S,a)
PE = r(S,a) + max Q(S’,a’) - Q(S,a)
Q(S,a)new = Q(S,a)old + PE
S1
S3S2
44 00 22 22
RL
RL in a nutshell: cached valuesM
odel-fre
e R
L
choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia
temporal difference learning
S1
S3S2
44 00 22 22
RL
Trick #2: Can learn values without a model
Trick #2: Can learn values without a model
Q(S1,L) 4
Q(S1,R) 2
Q(S2,L) 4
Q(S2,R) 0
Q(S3,L) 2
Q(S3,R) 2
RL in real world tasks…
model based vs. model free learning and control
Q(S1,L) 4
Q(S1,R) 2
Q(S2,L) 4
Q(S2,R) 0
Q(S3,L) 2
Q(S3,R) 2 S1
S3
S2L
R
L
RL
R
= 4
= 0
= 2
= 2
S1
S3S2
44 00 22 22
RL
Scaling problem!
Scaling problem!
Real-world behavior is hierarchicalH
iera
rchica
l RL: W
hat is
it?
1. set water temp
2. get wet
3. shampoo
4. soap
5. turn off water
6. dry off
add hot
success
add coldwait 5sec
too c
old
too hot
changejust right
simplified control, disambiguation, encapsulation
1. pour coffee
2. add sugar
3. add milk
4. stir
HRL: (in)formal framework
Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)
Hie
rarch
ical R
L: Wh
at is
it?
options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)
Option: set water temperature
S1
S2S8
…
S1
0.80.10.1
S2
0.10.10.8
S3
010
S1 (0.1)
S2 (0.1)
S3 (0.9)
…initiation set policy
termination
conditions
S: start G: goalOptions: going to doorsActions: + 2 door options
HRL: a toy exampleH
iera
rchica
l RL: W
hat is
it?
Advantages of HRL1. Faster learning
(mitigates scaling problem)
Hie
rarch
ical R
L: Wh
at is
it?
RL: no longer ‘tabula rasa’
2. Transfer of knowledge from previous tasks(generalization, shaping)
Disadvantages (or: the cost) of HRLH
iera
rchica
l RL: W
hat is
it?
1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;
habits)3. More complex learning/control structure
no free lunches…
top related