markov decision processes (continued)bboots3/cs4641-fall2018/lecture19/19_mdps2.pdf§alternative...
TRANSCRIPT
MarkovDecisionProcesses(continued)
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Based on slides by Dan Klein
Example:GridWorld
§ Amaze-likeproblem§ Theagentlivesinagrid§ Wallsblocktheagent’spath
§ Noisymovement:actionsdonotalwaysgoasplanned§ 80%ofthetime,theactionNorthtakestheagentNorth§ 10%ofthetime,NorthtakestheagentWest;10%East§ Ifthereisawallinthedirectiontheagentwouldhave
beentaken,theagentstaysput
§ Theagentreceivesrewardseachtimestep§ Small“living”rewardeachstep(canbenegative)§ Bigrewardscomeattheend(goodorbad)
§ Goal:maximizesumof(discounted)rewards
Recap:MDPs
§ Markovdecisionprocesses:§ StatesS§ ActionsA§ TransitionsP(s’|s,a)(orT(s,a,s’))§ RewardsR(s,a,s’)(anddiscountg)§ Startstates0
§ Quantities:§ Policy=mapofstatestoactions§ Utility=sumofdiscountedrewards§ Values=expectedfutureutilityfromastate(maxnode)§ Q-Values=expectedfutureutilityfromaq-state(chancenode)
a
s
s,a
s,a,s’s’
OptimalQuantities
§ Thevalue ofastates:V*(s)=expectedutilitystartinginsandactingoptimally
§ Thevalueofaq-state(s,a):Q*(s,a)=expectedutilitystartingouthavingtakenactionafromstatesand(thereafter)actingoptimally
§ Theoptimalpolicy:p*(s)=optimalactionfromstates
a
s
s’
s, a
(s,a,s’) is a transition
s,a,s’
s is a state
(s, a) is a q-state
[Demo:gridworld values(L9D1)]
Gridworld ValuesV*
Gridworld:Q*
TheBellmanEquations
Howtobeoptimal:
Step1:Takecorrectfirstaction
Step2:Keepbeingoptimal
TheBellmanEquations
§ Definitionof“optimalutility”viaexpectimaxrecurrencegivesasimpleone-steplookaheadrelationshipamongstoptimalutilityvalues
§ ThesearetheBellmanequations,andtheycharacterizeoptimalvaluesinawaywe’lluseoverandover
a
s
s,a
s,a,s’s’
ValueIteration
§ Bellmanequationscharacterize theoptimalvalues:
§ Valueiterationcomputes them:
§ Valueiterationisjustafixedpointsolutionmethod
a
V(s)
s,a
s,a,s’
V(s’)
Convergence*
§ HowdoweknowtheVk vectorsaregoingtoconverge?
§ Case1:IfthetreehasmaximumdepthM,thenVM holdstheactualuntruncated values
§ Case2:Ifthediscountislessthan1§ Sketch:ForanystateVk andVk+1 canbeviewedasdepth
k+1resultsinnearlyidenticalsearchtrees§ Thedifferenceisthatonthebottomlayer,Vk+1 hasactual
rewardswhileVk haszeros§ ThatlastlayerisatbestallRMAX
§ ItisatworstRMIN
§ Buteverythingisdiscountedbyγk thatfarout§ SoVk andVk+1 areatmostγk max|R|different§ Soaskincreases,thevaluesconverge
PolicyEvaluation
FixedPolicies
§ Valueiterationcomputessearchtreesthatmaxoverallactionstocomputeoptimalvalues
§ Ifwefixedsomepolicyp(s),thenthetreewouldbesimpler– onlyoneactionperstate§ …althoughthetree’svaluewoulddependonwhichpolicywefixed
a
s
s,a
s,a,s’s’
p(s)
s
s,p(s)
s, p(s),s’s’
Dotheoptimalaction Dowhatp saystodo
UtilitiesforaFixedPolicy
§ Anotherbasicoperation:computetheutilityofastatesunderafixed(generallynon-optimal)policy
§ Definetheutilityofastates,underafixedpolicyp:Vp(s)=expectedtotaldiscountedrewardsstartinginsandfollowingp
§ Recursiverelation(one-steplook-ahead/Bellmanequation):
p(s)
s
s,p(s)
s, p(s),s’s’
Example:PolicyEvaluationAlwaysGoRight AlwaysGoForward
Example:PolicyEvaluationAlwaysGoRight AlwaysGoForward
PolicyEvaluation
§ HowdowecalculatetheV’sforafixedpolicyp?
§ Idea1:TurnrecursiveBellmanequationsintoupdates(likevalueiteration)
§ Efficiency:O(S2)periteration
§ Idea2:Withoutthemaxes,theBellmanequationsarejustalinearsystem§ SolvewithMatlab (oryourfavoritelinearsystemsolver)
p(s)
s
s,p(s)
s, p(s),s’s’
PolicyExtraction
ComputingActionsfromValues
§ Let’simaginewehavetheoptimalvaluesV*(s)
§ Howshouldweact?§ It’snotobvious!
§ Weneedtosolveonestepof
§ Thisiscalledpolicyextraction,sinceitgetsthepolicyimpliedbythevalues
ComputingActionsfromQ-Values
§ Let’simaginewehavetheoptimalq-values:
§ Howshouldweact?§ Completelytrivialtodecide!
§ Importantlesson:actionsareeasiertoselectfromq-valuesthanvalues!
PolicyIteration
ProblemswithValueIteration
§ ValueiterationrepeatstheBellmanupdates:
§ Problem1:It’sslow– O(S2A)periteration
§ Problem2:The“max”ateachstaterarelychanges
§ Problem3:Thepolicyoftenconvergeslongbeforethevalues
a
s
s,a
s,a,s’s’
k=0
Noise=0.2Discount=0.9Livingreward=0
k=1
Noise=0.2Discount=0.9Livingreward=0
k=2
Noise=0.2Discount=0.9Livingreward=0
k=3
Noise=0.2Discount=0.9Livingreward=0
k=4
Noise=0.2Discount=0.9Livingreward=0
k=5
Noise=0.2Discount=0.9Livingreward=0
k=6
Noise=0.2Discount=0.9Livingreward=0
k=7
Noise=0.2Discount=0.9Livingreward=0
k=8
Noise=0.2Discount=0.9Livingreward=0
k=9
Noise=0.2Discount=0.9Livingreward=0
k=10
Noise=0.2Discount=0.9Livingreward=0
k=11
Noise=0.2Discount=0.9Livingreward=0
k=12
Noise=0.2Discount=0.9Livingreward=0
k=100
Noise=0.2Discount=0.9Livingreward=0
PolicyIteration
§ Alternativeapproachforoptimalvalues:§ Step1:Policyevaluation:calculateutilitiesforsomefixedpolicy(notoptimalutilities!)untilconvergence
§ Step2:Policyimprovement:updatepolicyusingone-steplook-aheadwithresultingconverged(butnotoptimal!)utilitiesasfuturevalues
§ Repeatstepsuntilpolicyconverges
§ Thisispolicyiteration§ It’sstilloptimal!§ Canconverge(much)fasterundersomeconditions
PolicyIteration
§ Evaluation:Forfixedcurrentpolicyp,findvalueswithpolicyevaluation:§ Iterateuntilvaluesconverge:
§ Improvement:Forfixedvalues,getabetterpolicyusingpolicyextraction§ One-steplook-ahead:
Comparison
§ Bothvalueiterationandpolicyiterationcomputethesamething(alloptimalvalues)
§ Invalueiteration:§ Everyiterationupdatesboththevaluesand(implicitly)thepolicy§ Wedon’ttrackthepolicy,buttakingthemaxoveractionsimplicitlyrecomputes it
§ Inpolicyiteration:§ Wedoseveralpassesthatupdateutilitieswithfixedpolicy(eachpassisfastbecausewe
consideronlyoneaction,notallofthem)§ Afterthepolicyisevaluated,anewpolicyisselected(slowlikeavalueiterationpass)§ Thenewpolicywillbebetter(orwe’redone)
§ BotharedynamicprogramsforsolvingMDPs
Summary:MDPAlgorithms
§ Soyouwantto….§ Computeoptimalvalues:usevalueiterationorpolicyiteration§ Computevaluesforaparticularpolicy:usepolicyevaluation§ Turnyourvaluesintoapolicy:usepolicyextraction(one-steplookahead)
§ Thesealllookthesame!§ Theybasicallyare– theyareallvariationsofBellmanupdates§ Theyalluseone-steplookahead§ Theydifferonlyinwhetherwepluginafixedpolicyormaxoveractions
DoubleBandits
Double-BanditMDP
§ Actions:Blue,Red§ States:Win,Lose
W L$1
1.0
$1
1.0
0.25 $0
0.75$2
0.75 $2
0.25$0
Nodiscount100timestepsBothstateshavethesamevalue
OfflinePlanning
§ SolvingMDPsisofflineplanning§ Youdetermineallquantitiesthroughcomputation§ YouneedtoknowthedetailsoftheMDP§ Youdonotactuallyplaythegame!
PlayRed
PlayBlue
Value
Nodiscount100timestepsBothstateshavethesamevalue
150
100
W L$1
1.0
$1
1.0
0.25 $0
0.75$2
0.75 $2
0.25$0
OnlinePlanning
§ Ruleschanged!Red’swinchanceisdifferent.
W L$1
1.0
$1
1.0
?? $0
??$2
?? $2
??$0
Let’sPlay!
$0 $0 $0 $2 $0$2 $0 $0 $0 $0
WhatJustHappened?
§ Thatwasn’tplanning,itwaslearning!§ Specifically,reinforcementlearning§ TherewasanMDP,butyoucouldn’tsolveitwithjustcomputation§ Youneededtoactuallyacttofigureitout
§ Importantideasinreinforcementlearningthatcameup§ Exploration:youhavetotryunknownactionstogetinformation§ Exploitation:eventually,youhavetousewhatyouknow§ Regret:evenifyoulearnintelligently,youmakemistakes§ Sampling:becauseofchance,youhavetotrythingsrepeatedly§ Difficulty:learningcanbemuchharderthansolvingaknownMDP
NextTime:ReinforcementLearning!