markov decision processes (continued)bboots3/cs4641-fall2018/lecture19/19_mdps2.pdf§alternative...

MarkovDecisionProcesses(continued)

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

Based on slides by Dan Klein

Example:GridWorld

§ Amaze-likeproblem§ Theagentlivesinagrid§ Wallsblocktheagent’spath

§ Noisymovement:actionsdonotalwaysgoasplanned§ 80%ofthetime,theactionNorthtakestheagentNorth§ 10%ofthetime,NorthtakestheagentWest;10%East§ Ifthereisawallinthedirectiontheagentwouldhave

beentaken,theagentstaysput

§ Theagentreceivesrewardseachtimestep§ Small“living”rewardeachstep(canbenegative)§ Bigrewardscomeattheend(goodorbad)

§ Goal:maximizesumof(discounted)rewards

Recap:MDPs

§ Markovdecisionprocesses:§ StatesS§ ActionsA§ TransitionsP(s’|s,a)(orT(s,a,s’))§ RewardsR(s,a,s’)(anddiscountg)§ Startstates0

§ Quantities:§ Policy=mapofstatestoactions§ Utility=sumofdiscountedrewards§ Values=expectedfutureutilityfromastate(maxnode)§ Q-Values=expectedfutureutilityfromaq-state(chancenode)

a

s

s,a

s,a,s’s’

OptimalQuantities

§ Thevalue ofastates:V*(s)=expectedutilitystartinginsandactingoptimally

§ Thevalueofaq-state(s,a):Q*(s,a)=expectedutilitystartingouthavingtakenactionafromstatesand(thereafter)actingoptimally

§ Theoptimalpolicy:p*(s)=optimalactionfromstates

a

s

s’

s, a

(s,a,s’) is a transition

s,a,s’

s is a state

(s, a) is a q-state

[Demo:gridworld values(L9D1)]

Gridworld ValuesV*

Gridworld:Q*

TheBellmanEquations

Howtobeoptimal:

Step1:Takecorrectfirstaction

Step2:Keepbeingoptimal

TheBellmanEquations

§ Definitionof“optimalutility”viaexpectimaxrecurrencegivesasimpleone-steplookaheadrelationshipamongstoptimalutilityvalues

§ ThesearetheBellmanequations,andtheycharacterizeoptimalvaluesinawaywe’lluseoverandover

a

s

s,a

s,a,s’s’

ValueIteration

§ Bellmanequationscharacterize theoptimalvalues:

§ Valueiterationcomputes them:

§ Valueiterationisjustafixedpointsolutionmethod

a

V(s)

s,a

s,a,s’

V(s’)

Convergence*

§ HowdoweknowtheVk vectorsaregoingtoconverge?

§ Case1:IfthetreehasmaximumdepthM,thenVM holdstheactualuntruncated values

§ Case2:Ifthediscountislessthan1§ Sketch:ForanystateVk andVk+1 canbeviewedasdepth

k+1resultsinnearlyidenticalsearchtrees§ Thedifferenceisthatonthebottomlayer,Vk+1 hasactual

rewardswhileVk haszeros§ ThatlastlayerisatbestallRMAX

§ ItisatworstRMIN

§ Buteverythingisdiscountedbyγk thatfarout§ SoVk andVk+1 areatmostγk max|R|different§ Soaskincreases,thevaluesconverge

PolicyEvaluation

FixedPolicies

§ Valueiterationcomputessearchtreesthatmaxoverallactionstocomputeoptimalvalues

§ Ifwefixedsomepolicyp(s),thenthetreewouldbesimpler– onlyoneactionperstate§ …althoughthetree’svaluewoulddependonwhichpolicywefixed

a

s

s,a

s,a,s’s’

p(s)

s

s,p(s)

s, p(s),s’s’

Dotheoptimalaction Dowhatp saystodo

UtilitiesforaFixedPolicy

§ Anotherbasicoperation:computetheutilityofastatesunderafixed(generallynon-optimal)policy

§ Definetheutilityofastates,underafixedpolicyp:Vp(s)=expectedtotaldiscountedrewardsstartinginsandfollowingp

§ Recursiverelation(one-steplook-ahead/Bellmanequation):

p(s)

s

s,p(s)

s, p(s),s’s’

Example:PolicyEvaluationAlwaysGoRight AlwaysGoForward

PolicyEvaluation

§ HowdowecalculatetheV’sforafixedpolicyp?

§ Idea1:TurnrecursiveBellmanequationsintoupdates(likevalueiteration)

§ Efficiency:O(S2)periteration

§ Idea2:Withoutthemaxes,theBellmanequationsarejustalinearsystem§ SolvewithMatlab (oryourfavoritelinearsystemsolver)

p(s)

s

s,p(s)

s, p(s),s’s’

PolicyExtraction

ComputingActionsfromValues

§ Let’simaginewehavetheoptimalvaluesV*(s)

§ Howshouldweact?§ It’snotobvious!

§ Weneedtosolveonestepof

§ Thisiscalledpolicyextraction,sinceitgetsthepolicyimpliedbythevalues

ComputingActionsfromQ-Values

§ Let’simaginewehavetheoptimalq-values:

§ Howshouldweact?§ Completelytrivialtodecide!

§ Importantlesson:actionsareeasiertoselectfromq-valuesthanvalues!

PolicyIteration

ProblemswithValueIteration

§ ValueiterationrepeatstheBellmanupdates:

§ Problem1:It’sslow– O(S2A)periteration

§ Problem2:The“max”ateachstaterarelychanges

§ Problem3:Thepolicyoftenconvergeslongbeforethevalues

a

s

s,a

s,a,s’s’

k=0

Noise=0.2Discount=0.9Livingreward=0

k=1


k=2


k=3


k=4


k=5


k=6


k=7


k=8


k=9


k=10


k=11


k=12


k=100


PolicyIteration

§ Alternativeapproachforoptimalvalues:§ Step1:Policyevaluation:calculateutilitiesforsomefixedpolicy(notoptimalutilities!)untilconvergence

§ Step2:Policyimprovement:updatepolicyusingone-steplook-aheadwithresultingconverged(butnotoptimal!)utilitiesasfuturevalues

§ Repeatstepsuntilpolicyconverges

§ Thisispolicyiteration§ It’sstilloptimal!§ Canconverge(much)fasterundersomeconditions

PolicyIteration

§ Evaluation:Forfixedcurrentpolicyp,findvalueswithpolicyevaluation:§ Iterateuntilvaluesconverge:

§ Improvement:Forfixedvalues,getabetterpolicyusingpolicyextraction§ One-steplook-ahead:

Comparison

§ Bothvalueiterationandpolicyiterationcomputethesamething(alloptimalvalues)

§ Invalueiteration:§ Everyiterationupdatesboththevaluesand(implicitly)thepolicy§ Wedon’ttrackthepolicy,buttakingthemaxoveractionsimplicitlyrecomputes it

§ Inpolicyiteration:§ Wedoseveralpassesthatupdateutilitieswithfixedpolicy(eachpassisfastbecausewe

consideronlyoneaction,notallofthem)§ Afterthepolicyisevaluated,anewpolicyisselected(slowlikeavalueiterationpass)§ Thenewpolicywillbebetter(orwe’redone)

§ BotharedynamicprogramsforsolvingMDPs

Summary:MDPAlgorithms

§ Soyouwantto….§ Computeoptimalvalues:usevalueiterationorpolicyiteration§ Computevaluesforaparticularpolicy:usepolicyevaluation§ Turnyourvaluesintoapolicy:usepolicyextraction(one-steplookahead)

§ Thesealllookthesame!§ Theybasicallyare– theyareallvariationsofBellmanupdates§ Theyalluseone-steplookahead§ Theydifferonlyinwhetherwepluginafixedpolicyormaxoveractions

DoubleBandits

Double-BanditMDP

§ Actions:Blue,Red§ States:Win,Lose

W L$1

1.0

$1

1.0

0.25 $0

0.75$2

0.75 $2

0.25$0

Nodiscount100timestepsBothstateshavethesamevalue

OfflinePlanning

§ SolvingMDPsisofflineplanning§ Youdetermineallquantitiesthroughcomputation§ YouneedtoknowthedetailsoftheMDP§ Youdonotactuallyplaythegame!

PlayRed

PlayBlue

Value

Nodiscount100timestepsBothstateshavethesamevalue

150

100

W L$1

1.0

$1

1.0

0.25 $0

0.75$2

0.75 $2

0.25$0

OnlinePlanning

§ Ruleschanged!Red’swinchanceisdifferent.

W L$1

1.0

$1

1.0

?? $0

??$2

?? $2

??$0

Let’sPlay!

$0 $0 $0 $2 $0$2 $0 $0 $0 $0

WhatJustHappened?

§ Thatwasn’tplanning,itwaslearning!§ Specifically,reinforcementlearning§ TherewasanMDP,butyoucouldn’tsolveitwithjustcomputation§ Youneededtoactuallyacttofigureitout

§ Importantideasinreinforcementlearningthatcameup§ Exploration:youhavetotryunknownactionstogetinformation§ Exploitation:eventually,youhavetousewhatyouknow§ Regret:evenifyoulearnintelligently,youmakemistakes§ Sampling:becauseofchance,youhavetotrythingsrepeatedly§ Difficulty:learningcanbemuchharderthansolvingaknownMDP

NextTime:ReinforcementLearning!

markov decision processes (continued)bboots3/cs4641-fall2018/lecture19/19_mdps2.pdf§alternative...

Documents