policies and exploration and eligibility, oh my!

29
Policies and exploration and eligibility, oh my!

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Policies and exploration and eligibility, oh my!

Policies and exploration and

eligibility, oh my!

Page 2: Policies and exploration and eligibility, oh my!

Administrivia•Reminder: R3 due Thurs

•Anybody not have a group?

•Reminder: final project

•Written report: due Thu, May 11, by noon

•Oral reports: Apr 27, May 2, May 4

•15 people registered ⇒ 5 pres/session ⇒ 15 min/pres

•Volunteers?

Page 3: Policies and exploration and eligibility, oh my!

River of time•Last time:

•The Q function

•The Q-learning algorithm

•Q-learning in action

•Today:

•Notes on writing & presenting

•Action selection & exploration

•The off-policy property

•Use of experience; eligibility traces

•Radioactive breadcrumbs

Page 4: Policies and exploration and eligibility, oh my!

Final project writing FAQ•Q: How formal a document should this be?

•A: Very formal. This should be as close in style to the papers we have read as possible.

•Pay attention to the sections that they have -- introduction, background, approach, experiments, etc.

•Try to establish a narrative -- “tell the story”

•As always, use correct grammar, spelling, etc.

Page 5: Policies and exploration and eligibility, oh my!

Final project writing FAQ•Q: How long should the final report be?

•A: As long as necessary, but no longer.

•A’: I would guess that it would take ~10-15 pages to describe your work well.

Page 6: Policies and exploration and eligibility, oh my!

Final project writing FAQ•Q: Any particular document format?

•A: I prefer:

•8.5”x11” paper

•1” margins

•12pt font

•double-spaced

•In LaTeX: \renewcommand{baselinestretch}{1.6}

•Stapled!

Page 7: Policies and exploration and eligibility, oh my!

Final project writing FAQ•Q: Any other tips?

•A: Yes:

•DON’T BE VAGUE -- be as specific and concrete as possible about what you did/what other people did/etc.

Page 8: Policies and exploration and eligibility, oh my!

The Q-learning algorithmAlgorithm: Q_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Repeat {

s=get_current_world_state()

a=pick_next_action(Q,s)

(r,s’)=act_in_world(a)

Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))

} Until (bored)

Page 9: Policies and exploration and eligibility, oh my!

Why does this work?•Still... Why should that weighted avg be the right thing?

•Compare w/ Bellman eqn:

Page 10: Policies and exploration and eligibility, oh my!

Why does this work?•Still... Why should that weighted avg be the right thing?

•Compare w/ Bellman eqn...

•I.e., the update is based on a sample of the true distribution, T, rather than the full expectation that is used in the Bellman eqn/policy iteration alg

•First time agent finds a rewarding state, sr, α of

that reward will be propagated back by one step via Q update to s

r-1, a state one step away from s

r

•Next time, the state two away from sr will be

updated, and so on...

Page 11: Policies and exploration and eligibility, oh my!

Picking the action•One critical step underspecified in Q learn alg:

•a=pick_next_action(Q,s)

•How should you pick an action at each step?

Page 12: Policies and exploration and eligibility, oh my!

Picking the action•One critical step underspecified in Q learn alg:

•a=pick_next_action(Q,s)

•How should you pick an action at each step?

•Could pick greedily according to Q

•Might tend to keep doing the same thing and not explore at all. Need to force exploration.

Page 13: Policies and exploration and eligibility, oh my!

Picking the action•One critical step underspecified in Q learn alg:

•a=pick_next_action(Q,s)

•How should you pick an action at each step?

•Could pick greedily according to Q

•Might tend to keep doing the same thing and not explore at all. Need to force exploration.

•Could pick an action at random

•Ignores everything you’ve learned about Q so far

•Would you still converge?

Page 14: Policies and exploration and eligibility, oh my!

Off-policy learning•Exploit a critical property of the Q learn alg:

•Lemma (w/o proof): The Q learning algorithm will converge to the correct Q* independently of the policy being executed, so long as:

•Every (s,a) pair is visited infinitely often in the infinite limit

•α is chosen to be small enough (usually decayed)

Page 15: Policies and exploration and eligibility, oh my!

Off-policy learning•I.e., Q learning doesn’t care what policy is being executed -- will still converge

•Called an off-policy method: the policy being learned can be diff than the policy being executed

•Off-policy property tells us: we’re free to pick any policy we like to explore, so long as we guarantee infinite visits to each (s,a) pair

•Might as well choose one that does (mostly) as well as we know how to do at each step

Page 16: Policies and exploration and eligibility, oh my!

“Almost greedy” exploring

•Can’t be just greedy w.r.t. Q (why?)

•Typical answers:

•ε-greedy: execute argmaxa{Q(s,a)} w/ prob

(1-ε) and a random action w/ prob ε

•Boltzmann exploration: pick action a w/ prob:

Page 17: Policies and exploration and eligibility, oh my!

The value of experience•We observed that Q learning converges

slooooooowly...

•Same is true of many other RL algs

•But we can do better (sometimes by orders of magnitude)

•What’re the biggest hurdles to Q convergence?

Page 18: Policies and exploration and eligibility, oh my!

The value of experience•We observed that Q learning converges

slooooooowly...

•Same is true of many other RL algs

•But we can do better (sometimes by orders of magnitude)

•What’re the biggest hurdles to Q convergence?

•Well, there are many

•Big one, though, is: poor use of experience

•Each timestep only changes one Q(s,a) value

•Takes many steps to “back up” experience very far

Page 19: Policies and exploration and eligibility, oh my!

That eligible state

•Basic problem: Every step, Q only does a one-step backup

•Forgot where it was before that

•No sense of the sequence of state/actions that got it where it is now

•Want to have a long-term memory of where the agent has been; update the Q values for all of them

Page 20: Policies and exploration and eligibility, oh my!

That eligible state•Want to have a long-term memory of where the agent has been; update the Q values for all of them

•Idea called eligibility traces:

•Have a memory cell for each state/action pair

•Set memory when visit that state/action

•Each step, update all eligible states

Page 21: Policies and exploration and eligibility, oh my!

Retrenching from Q

•Can integrate eligibility traces w/ Q-learning

•But it’s a bit of a pain

•Need to track when agent is “on policy” or “off policy”, etc.

•Good discussion in Sutton & Barto

Page 22: Policies and exploration and eligibility, oh my!

Retrenching from Q

•We’ll focus on a (slightly) simpler learning alg:

•SARSA learning

•V. similar to Q learning

•Strictly on policy: only learns about policy it’s actually executing

•E.g., learns instead of

Page 23: Policies and exploration and eligibility, oh my!

The Q-learning algorithmAlgorithm: Q_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Repeat {

s=get_current_world_state()

a=pick_next_action(Q,s)

(r,s’)=act_in_world(a)

Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))

} Until (bored)

Page 24: Policies and exploration and eligibility, oh my!

SARSA-learning algorithmAlgorithm: SARSA_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

s=get_current_world_state()

a=pick_next_action(Q,s)

Repeat {

(r,s’)=act_in_world(a)

a’=pick_next_action(Q,s’)

Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a))

a=a’; s=s’;

} Until (bored)

Page 25: Policies and exploration and eligibility, oh my!

SARSA vs. Q•SARSA and Q-learning very similar

•SARSA updates Q(s,a) for the policy it’s actually executing

•Lets the pick_next_action() function pick action to update

•Q updates Q(s,a) for greedy policy w.r.t. current Q

•Uses max_a to pick action to update

•might be diff than the action it executes at s’

•In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing

•Exploration can get Q-learning in trouble...

Page 26: Policies and exploration and eligibility, oh my!

Getting Q in trouble...“Cliff walking” example(Sutton & Barto, Sec 6.5)

Page 27: Policies and exploration and eligibility, oh my!

Getting Q in trouble...“Cliff walking” example(Sutton & Barto, Sec 6.5)

Page 28: Policies and exploration and eligibility, oh my!

Radioactive breadcrumbs•Can now define eligibility traces for SARSA

•In addition to Q(s,a) table, keep an e(s,a) table

•Records “eligibility” (real number) for each state/action pair

•At every step ((s,a,r,s’,a’) tuple):

•Increment e(s,a) for current (s,a) pair by 1

•Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’)

•Decay all e(s’’,a’’) by factor of λγ

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

Page 29: Policies and exploration and eligibility, oh my!

SARSA(λ)-learning alg.Algorithm: SARSA(λ)_learnInputs: S, A, γ (0<=γ<1), α (0<=α<1), λ(0<=λ<1)Outputs: Qe(s,a)=0 // for all s, as=get_curr_world_st(); a=pick_nxt_act(Q,s)Repeat {(r,s’)=act_in_world(a)a’=pick_next_action(Q,s’)δ=r+γ*Q(s’,a’)-Q(s,a)e(s,a)+=1foreach (s’’,a’’) pair in (SXA) {

Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δe(s’’,a’’)*=λγ}

a=a’; s=s’;} Until (bored)