rl 2 it’s 2:00 am. do you know where your mouse is?
Post on 21-Dec-2015
213 views
TRANSCRIPT
RL 2It’s 2:00 AM. Do you know where your
mouse is?
First up: Vote!•Albuquerque Municipal Election today (Oct 4)
•Not all of you are eligible to vote, I know...
•... But if you are, you should.
•Educate yourself first!
•Mayor
•City councilors
•Bonds (what will ABQ spend its money on?)
•Propositions (election finance, min wage, voter ID)
•Polls close at 7:00 PM today...
Voting resources•City of Albuquerque web site: www.cabq.gov
•League of Women Voters web site:
•http://www.lwvabc.org/elections/2005VG_English.html
News o’ the day•Wall Street Journal reports: “Microsoft Windows Officially Broken”
•In 2004, MS Longhorn (successor to XP) bogged down
•Whole code base had to be scrapped & started afresh ⇒ Vista
•Point: not MS bashing (much)
•Importance of software process
•MS moved to a more agile process for Vista
•Test first
•Rigorous regression testing
•Better coding infrastructure
Administrivia•Grading: P1 rollout finished grading
•I will send grade reports this afternoon & tomorrow morning
•Prof Lane out of town Oct 11
•Andree Jacobsen will cover
•Stefano Markidis out of town Oct 19
•Will announce new office hours presently
Your place in History•Last time:
•Q2
•Introduction to Reinforcement Learning (RL)
Your place in History•This time:
✓P2M1 due
✓Voting
✓News
✓Administriva
✓Q&A
•More on RL
•Design exercise: WorldSimulator and Terrains
Recall: Mack & his maze
•Mack lives a hard life as a psychology test subject
•Has to run around mazes all day, finding food and avoiding electric shocks
•Needs to know how to find cheese quickly, while getting shocked as little as possible
•Q: How can Mack learn to find his way around?
?
Reward over times1
s2
s3
s4
s5
s6
s4
s2
s7
s11
s8
s9
s10
Reward over times1
s2
s3
s4
s5
s6
s4
s2
s7
s11
s8
s9
s10
V(s1)=R(s
1)+R(s
4)+R(s
11)+R(s
10)+...
Reward over times1
s2
s3
s4
s5
s6
s4
s2
s7
s11
s8
s9
s10
V(s1)=R(s
1)+R(s
2)+R(s
6)+...
Where can you go?•Definition: Complete set of all states agent could be in is called the state space: S
•Could be discrete or continuous
•For Project 2: states are discrete
•Q: what is the state space for P2?
•Size of state space: |S|
•Q: How big is the state space for P2?
Where can you go?•Definition: Complete set of actions an agent could take is called the action space: A
•Again, discrete or cont.
•Again, P2: A is discrete
•Q: What is A for P2? Size?
•Again, size: |A|
What is it worth to you?•Idea of “good” and “bad” places to go
•Quantified as “rewards”
•(This is where term “reinforcement learning” comes from. Originated in psychology.)
•Formally: R : S → Reals
•R(s) == reward for getting to state s
•How good or bad it is to reach state s
•Larger (more positive) is better
•Agent “wants” to get more positive rwd
How does it happen?•Dynamics of agent defined by transition function
•T: S x A x S → [0,1]
•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]
•Examples from P2?
How does it happen?•Dynamics of agent defined by transition function
•T: S x A x S → [0,1]
•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]
•Examples from P2?
•In practice: Don’t write T down explicitly. Encoded by WorldSimulator and Terrain/agent interactions.
The MDP•Entire RL environment defined by a Markov decision process:
•M= 〈 S,A,T,R 〈
•S: state space
•A: action space
•T: transition function
•R: reward function
•Q: What modules represent these in P2?
Policies•Total accumulated reward (value, V) depends on
•Where agent starts
•What agent does at each step (duh)
Policies•Total accumulated reward (value, V) depends on
•Where agent starts
•What agent does at each step (duh)
•Plan of action is called a policy, π
•Policy defines what action to take in every state of the system:
Experience & histories•Fundamental unit of experience in RL:
•At time t in some state si, take action a
j,
get reward rt, end up in state s
k
•Called an experience tuple or SARSA tuple
•Set of all experience during a single episode up to time T is a history or trajectory:
How good is a policy?•Value is a function of start state and policy:
•Value measures:
•How good is policy π, averaged over all time, if agent starts at state s1 and runs forever?
The goal of RL
•Agent’s goal:
•Find the best possible policy: π*
•Find policy, π*, that maximizes Vπ(s) for all s
Design Exercise:WorldSimulator &
Friends
Design exercise•Q1:
•Design the act() method in WorldSimulator
•What objects does it need to access?
•How can it take different terrains/agents into account?
•Q2:
•GridWorld2d<T> could be really large
•Most of the terrain tiles are the same everywhere
•How can you avoid millions of copies of same tile?