rl 2 it’s 2:00 am. do you know where your mouse is?

24
RL 2 It’s 2:00 AM. Do you know where your mouse is?

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RL 2 It’s 2:00 AM. Do you know where your mouse is?

RL 2It’s 2:00 AM. Do you know where your

mouse is?

Page 2: RL 2 It’s 2:00 AM. Do you know where your mouse is?

First up: Vote!•Albuquerque Municipal Election today (Oct 4)

•Not all of you are eligible to vote, I know...

•... But if you are, you should.

•Educate yourself first!

•Mayor

•City councilors

•Bonds (what will ABQ spend its money on?)

•Propositions (election finance, min wage, voter ID)

•Polls close at 7:00 PM today...

Page 3: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Voting resources•City of Albuquerque web site: www.cabq.gov

•League of Women Voters web site:

•http://www.lwvabc.org/elections/2005VG_English.html

Page 4: RL 2 It’s 2:00 AM. Do you know where your mouse is?

News o’ the day•Wall Street Journal reports: “Microsoft Windows Officially Broken”

•In 2004, MS Longhorn (successor to XP) bogged down

•Whole code base had to be scrapped & started afresh ⇒ Vista

•Point: not MS bashing (much)

•Importance of software process

•MS moved to a more agile process for Vista

•Test first

•Rigorous regression testing

•Better coding infrastructure

Page 5: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Administrivia•Grading: P1 rollout finished grading

•I will send grade reports this afternoon & tomorrow morning

•Prof Lane out of town Oct 11

•Andree Jacobsen will cover

•Stefano Markidis out of town Oct 19

•Will announce new office hours presently

Page 6: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Your place in History•Last time:

•Q2

•Introduction to Reinforcement Learning (RL)

Page 7: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Your place in History•This time:

✓P2M1 due

✓Voting

✓News

✓Administriva

✓Q&A

•More on RL

•Design exercise: WorldSimulator and Terrains

Page 8: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Recall: Mack & his maze

•Mack lives a hard life as a psychology test subject

•Has to run around mazes all day, finding food and avoiding electric shocks

•Needs to know how to find cheese quickly, while getting shocked as little as possible

•Q: How can Mack learn to find his way around?

?

Page 9: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Reward over times1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

Page 10: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Reward over times1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

V(s1)=R(s

1)+R(s

4)+R(s

11)+R(s

10)+...

Page 11: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Reward over times1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

V(s1)=R(s

1)+R(s

2)+R(s

6)+...

Page 12: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Where can you go?•Definition: Complete set of all states agent could be in is called the state space: S

•Could be discrete or continuous

•For Project 2: states are discrete

•Q: what is the state space for P2?

•Size of state space: |S|

•Q: How big is the state space for P2?

Page 13: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Where can you go?•Definition: Complete set of actions an agent could take is called the action space: A

•Again, discrete or cont.

•Again, P2: A is discrete

•Q: What is A for P2? Size?

•Again, size: |A|

Page 14: RL 2 It’s 2:00 AM. Do you know where your mouse is?

What is it worth to you?•Idea of “good” and “bad” places to go

•Quantified as “rewards”

•(This is where term “reinforcement learning” comes from. Originated in psychology.)

•Formally: R : S → Reals

•R(s) == reward for getting to state s

•How good or bad it is to reach state s

•Larger (more positive) is better

•Agent “wants” to get more positive rwd

Page 15: RL 2 It’s 2:00 AM. Do you know where your mouse is?

How does it happen?•Dynamics of agent defined by transition function

•T: S x A x S → [0,1]

•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]

•Examples from P2?

Page 16: RL 2 It’s 2:00 AM. Do you know where your mouse is?

How does it happen?•Dynamics of agent defined by transition function

•T: S x A x S → [0,1]

•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]

•Examples from P2?

•In practice: Don’t write T down explicitly. Encoded by WorldSimulator and Terrain/agent interactions.

Page 17: RL 2 It’s 2:00 AM. Do you know where your mouse is?

The MDP•Entire RL environment defined by a Markov decision process:

•M= 〈 S,A,T,R 〈

•S: state space

•A: action space

•T: transition function

•R: reward function

•Q: What modules represent these in P2?

Page 18: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Policies•Total accumulated reward (value, V) depends on

•Where agent starts

•What agent does at each step (duh)

Page 19: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Policies•Total accumulated reward (value, V) depends on

•Where agent starts

•What agent does at each step (duh)

•Plan of action is called a policy, π

•Policy defines what action to take in every state of the system:

Page 20: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Experience & histories•Fundamental unit of experience in RL:

•At time t in some state si, take action a

j,

get reward rt, end up in state s

k

•Called an experience tuple or SARSA tuple

•Set of all experience during a single episode up to time T is a history or trajectory:

Page 21: RL 2 It’s 2:00 AM. Do you know where your mouse is?

How good is a policy?•Value is a function of start state and policy:

•Value measures:

•How good is policy π, averaged over all time, if agent starts at state s1 and runs forever?

Page 22: RL 2 It’s 2:00 AM. Do you know where your mouse is?

The goal of RL

•Agent’s goal:

•Find the best possible policy: π*

•Find policy, π*, that maximizes Vπ(s) for all s

Page 23: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Design Exercise:WorldSimulator &

Friends

Page 24: RL 2 It’s 2:00 AM. Do you know where your mouse is?

Design exercise•Q1:

•Design the act() method in WorldSimulator

•What objects does it need to access?

•How can it take different terrains/agents into account?

•Q2:

•GridWorld2d<T> could be really large

•Most of the terrain tiles are the same everywhere

•How can you avoid millions of copies of same tile?