curiosity, |unobserved rewards | and neural networksin rl · (unobserved) environment, d action, e...

Curiosity,| unobservedrewards|andneuralnetworks inRL

CsabaSzepesváriDeepMind&UniversityofAlberta

NewDirectionsinRLandControlPrinceton2019

functionapproximation

PartI:Curiosity

“One of the striking differences between current reinforcement learning algorithms and early human learning is that animals and infants appear to explore their environments with autonomous purpose, in a manner appropriate to their current level of skills.”

ALT2011,invitedtalkbyPeter

ModelingCuriosity(ALW model)• Controlledprocess• Stochasticity:Makesthingsmoreinteresting/realistic• Countablymanystates,theyareobserved

– Simplifyingassumption– Hope:someoftheprinciples/algorithmstransfertothegeneralcase– Youhavetostartsomewhere

• Resettoaninitialstate– Necessary– Engineertheenvironmenttomakethishappen(robotmoms!)

• Goal:Extendthesetofreliablyreachablestatesasquicklyaspossible

Performancemetric• #Reliablyreachablestates/time• Fixanarbitrarypartialorder,≺, onstates

– Notknowntolearner..• Fix𝐿 > 0. Define𝒮'≺ asfollows:

– 𝑠) ∈ 𝒮'≺

– 𝑠 ∈ 𝒮'≺ if∃𝜋 on {𝑠. ≺ 𝑠: 𝑠. ∈ 𝒮'≺} s.t. 𝜏 𝑠 𝜋 ≤ 𝐿

• Define:𝒮'→ =∪≺ 𝒮'≺.

• Note:Simplerdefinitionsdon’twork(counterexamples).

• Prop:∃≺ s.t. 𝒮'→ = 𝒮'≺ and𝒮'→ isfinite.

UCBExplore

Knownstates

1. Discover2. Propose3. Verify

LimandAuer(COLT2012)

Mainresult

Anytime,continuallearningversion:

LimandAuer(COLT2012)

Nonstationarity

w.Auer,Gajane,Ortner (2019)

Performancemetric• 𝐹:numberoftimesthetransitionprobabilitieschange

– (t=1:alwaysachange)

• 𝑊(𝐿) timestepstofindall𝐿-reachablestatesinasingleMCP⇒ 𝐹𝑊(𝐿) timestepswhenthereare𝐹 changes

• Classificationoftimesteps:Alg hascorrectknowledgeofwhatisreachable;ornot.Alg iscompetent vsincompetent

• Goal:Minimizethe#timestepswhenAlg isincompetent

• Difficulty:Thelocationandnumberofchangesisunknown

Mainideas• Twophases:– Buildsetofreachablestates𝒦 (UCBExplore)– Repeat:Checkfornewreachablestates(UCBExplore)ordisappearingstates(asinverificationphaseofUCBExplore)– breakoutwhenUCBExplore oftentakestoolongcomparedtopredictedruntime

• Checkingstartswhenbuildingisdone• Issueswithbuilding:

– Howcanthealg knowwhetherachangehappenedwhilebuilding?E.g.newstatewasfoundreachable.Beforechange,afterchange?

• Solution:Staggeredstartofmanyparallelbuildingprocesses.Quitbuildingwhenanyoftheprocessesfinishes.


Result

• Theorem:Uptolowerordertermsandlogfactors,thetotalnumberofstepswhenthealg isincompetentisatmost𝑊 𝐿 𝐹 = irrespectiveofwhenthechangeshappen.

• Questions:– Is𝑊 𝐿 costnecessarywithoutchanges?– Isthequadraticdependenceabovenecessary?– Nontabular?


PartII:Unobservedrewards• RL:rewardsarealwaysobserved

– internallycomputed– externallyprovided

• Isthisreasonable?• Istheenvironmentstateobservable?

• Whathappenswhenrewardsarenotobservable?• Consequencesfor:

– Planning– Learning⇒ exploration;whichwillneedplanning!

• Bandits:MPDsw.iid state• Partialmonitoring:POMPDBCw.iid state

w.TorLattimore(2019)

Interactinguser

Showwebpage(agent’schoice)

Userclicks(observed)

Userhappy(unobserved)

Environment,𝑋

Action,𝐴 Observation:𝑂 = Φ(𝐴, 𝑋)

Lossℒ(𝐴, 𝑋)(unobserved)

Regret: 𝑅K = maxO∑ ℒ 𝐴Q, 𝑋Q − ℒ(𝑎, 𝑋Q)KQTU

Learnerisgivenmapsℒ,Φ

Forrounds𝑡 = 1,2, … , 𝑛:

1. Environmentchoose𝑠𝑋Q ∈ 𝒳

2. Learnerchooses𝐴Q ∈ 𝒜

3. Learnerssufferslossℒ(𝐴Q, 𝑋Q) – whichremainshidden!

4. LearnerobservesfeedbackΦ 𝐴Q, 𝑋Q

PartialMonitoring

[Rustichini,1999]

Whygreat?

• InformalexamplesofPMproblems:– Dynamicpricing– Altruisticagents– Statisticaltesting(balancingpowerandcost)– Delayedrewards/surrogates

• Subsumesclassicframeworks:– finite-armedbandits– predictionwithexpertadvice– banditswithgraphfeedback– linearbandits– dueling bandits– …

15

Theorem:Let𝒜,𝒳 befinite.Let𝑅K∗ 𝐺 betheminimaxregretonPMproblemG = (ℒ,Φ).Then:

𝑅K∗ 𝐺 =

0 if𝐺hasnonbactionsΘ 𝑛� if𝐺isL. O. andhasnbactionsΘ(𝑛=/q) if𝐺isG. O. butnotL. O.Ω(𝑛) otherwise

PartialMonitoring–ClassificationTheorem

[Cesa-Bianchi, Lugosi, Stoltz, 2006; Bartók, Pál, Sz., 2011; Foster and Rakhlin, 2012; Antos, Bartók, Pál and Sz., 2013; Bartók, Foster, Pál, Rakhlin, Sz., 2014; Lattimore and Sz., 2019a].

Algorithms?

• Classicalapproachesfailinpartialmonitoring– Optimism/Thompson-sampling/exponentialweights

• Complicatedalgorithmsexist;nonearegood!

ExplorationbyOptimisation

Theory

• Single algorithmworksinall‘learnable’finitegames• Near-optimalforbandits,fullinformation,graphfeedback• Bestknownboundsingeneralcase• Essentiallynotuning;learningratetunedonline

Experiments

Conlusions/Futureplans

• Itissometimesgoodtobeambitious!

• Moreexperimentsneeded• Howtosolvetheoptimizationproblem?Itisconvex!Butcost

isnot𝑂(𝑘) ..• Whathappenswhen𝒳 islargeorinfinite?• Generalizations?

– Addstate/context!Use“explorebyoptimization”beyondPM?

• Findmoreapplications?

PartIII:RL&generalization

• Theworldisbig• Needapproximate

models• Minimal

assumptionstomakeRL+Genwork?

• policyerror=f(approximationerrorof“model”)

3results:Generativemodelaccess/planningbysolvingareducedordermodelModel-basedRL:factoredlinearmodels– aconvenientmodelclassModel-freeRL

LRA:LinearlyRelaxedALP𝑐 ≥ 0, 1x𝑐 = 1

𝑊O ∈ 0,∞ z×|,𝜓 ∈ 0,∞ z

𝐽 �,� = max�

𝐽 𝑠𝜓 𝑠

𝛽� ≔ 𝛼maxO 𝑃O𝜓 �,� < 1

𝜓 ∈ span(Φ)

Theorem:Let𝜖 = infC∈ℝ�

𝐽∗ − Φ𝑟 �,�,𝐽�� = Φ𝑟�� ,where𝑟��isthesolutiontotheaboveLP.Then,underthesaidassumptions,

𝐽∗ − 𝐽�� U,� ≤2𝑐x𝜓1 − 𝛽�

(3𝜖 + 𝐽��∗ − 𝐽��∗�,�)

8

0 200 400 600 800

�800

�600

�400

�200

0

J⇤

JCS

JCS�ideal

JLRA

0 200 400 600 800 1,000

1

2

3

4

u⇤

uCSuCS�idealuLRA

Fig. 1. Results for a single-queue with polynomial features. On both figuresthe x axis represents the state space: the length of the queue. For more details,see the text.

[23]), exploring the idea of approximating dual variables anddesigning algorithms that use the newly derived results toactively compute what constraints to select.

REFERENCES

[1] D. J. White, “A survey of applications of Markovdecision processes,” The Journal of the OperationalResearch Society, vol. 44, no. 11, pp. 1073–1096, 1993.

[2] J. Rust, “Numerical dynamic programming in eco-nomics,” in Handbook of Computational Economics,vol. 1, Elsevier, North Holland, 1996, pp. 619–729.

[3] E. A. Feinberg and A. Shwartz, Handbook of Markovdecision processes: methods and applications. KluwerAcademic Publishers, 2002.

[4] Q. Hu and W. Yue, Markov Decision Processes withTheir Applications. Springer, 2007.

[5] O. Sigaud and O. Buffet, Eds., Markov Decision Pro-cesses in Artificial Intelligence. Wiley-ISTE, 2010.

[6] N. Bauerle and U. Rieder, Markov Decision Processeswith Applications to Finance. Springer, 2011.

[7] M. L. Puterman, Markov Decision Processes: DiscreteStochastic Programming. New York: John Wiley, 1994.

[8] F. L. Lewis and D. Liu, Eds., Reinforcement Learningand Approximate Dynamic Programming for FeedbackControl. Wiley-IEEE Press, 2012.

[9] M. Abu Alsheikh, D. T. Hoang, D. Niyato, H.-P. Tan,and S. Lin, “Markov decision processes with appli-cations in wireless sensor networks: A survey,” IEEEComm. Surveys & Tutorials, vol. 17, pp. 1239–1267,2015.

[10] R. J. Boucherie and N. M. van Dijk, Eds., Markov De-cision Processes in Practice. Springer, 2017, vol. 248.

[11] J. Rust, “Using randomization to break the curse ofdimensionality,” Econometrica, vol. 65, pp. 487–516,1996.

[12] Cs. Szepesvari, “Efficient approximate planning in con-tinuous space Markovian decision problems,” AI Com-munications, vol. 13, no. 3, pp. 163–176, 2001.

[13] M. Kearns, Y. Mansour, and A. Y. Ng, “A sparsesampling algorithm for near-optimal planning in largeMarkov decision processes,” Machine learning, vol. 49,pp. 193–208, 2002.

[14] V. D. Blondel and J. N. Tsitsiklis, “A survey of com-putational complexity results in systems and control,”Automatica, vol. 36, pp. 1249–1274, 2000.

[15] P. J. Schweitzer and A. Seidmann, “Generalized polyno-mial approximations in Markovian decision processes,”Journal of Mathematical Analysis and Applications,vol. 110, pp. 568–582, 1985.

[16] L. Kallenberg. (2017). Markov decision processes: Lec-ture notes, [Online]. Available: https://goo.gl/yhvrph.

[17] D. Schuurmans and R. Patrascu, “Direct value-approximation for factored MDPs,” in NIPS, 2001,pp. 1579–1586.

[18] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman,“Efficient solution algorithms for factored MDPs,” Jour-nal of Artificial Intelligence Research, vol. 19, pp. 399–468, 2003.

[19] D. P. de Farias and B. Van Roy, “The linear program-ming approach to approximate dynamic programming,”Operations Research, vol. 51, pp. 850–865, 2003.

[20] ——, “On constraint sampling in the linear program-ming approach to approximate dynamic programming,”Mathematics of Operations Research, vol. 29, pp. 462–478, 2004.

[21] B. Kveton and M. Hauskrecht, “Heuristic refine-ments of approximate linear programming for factoredcontinuous-state Markov decision processes,” in ICAPS,2004, pp. 306–314.

[22] M. Petrik and S. Zilberstein, “Constraint relaxation inapproximate linear programs,” in ICML, 2009, pp. 809–816.

[23] V. V. Desai, V. F. Farias, and C. C. Moallemi, “Asmoothed approximate linear program,” in NIPS, 2009,pp. 459–467.

[24] G. Taylor, M. Petrik, R. Parr, and S. Zilberstein, “Fea-ture selection using regularization in approximate linearprograms for Markov decision processes,” in ICML,2010, pp. 871–878.

[25] J. Pazis and R. Parr, “Non-parametric approximatelinear programming for MDPs,” in AAAI, 2011.

[26] N. Bhat, V. Farias, and C. C. Moallemi, “Non-parametric approximate dynamic programming via thekernel method,” in NIPS, 2012, pp. 386–394.

[27] Y. Abbasi-Yadkori, P. Bartlett, and A. Malek, “Linearprogramming for large-scale Markov decision prob-lems,” in ICML, 2014, pp. 496–504.

[28] C. Lakshminarayanan and S. Bhatnagar, “A gener-alized reduced linear program for Markov DecisionProcesses,” in AAAI, 2015, pp. 2722–2728.

[29] R.-R. Chen and S. P. Meyn, “Value iteration and op-timization of multiclass queueing networks,” QueueingSystems, vol. 32, no. 1-3, pp. 65–97, 1999.

8

0 200 400 600 800

�800

�600

�400

�200

0

J⇤

JCS

JCS�ideal

JLRA

0 200 400 600 800 1,000

1

2

3

4

u⇤

uCSuCS�idealuLRA



REFERENCES






























8

0 200 400 600 800

�800

�600

�400

�200

0

J⇤

JCS

JCS�ideal

JLRA

0 200 400 600 800 1,000

1

2

3

4

u⇤

uCSuCS�idealuLRA



REFERENCES






























IEEETAC,2018w.Chandrashekar L.&Sh.Bhatnagar

minC∈ℝ�

𝑐xΦ𝑟 s.t.∑ 𝑊OxΦ𝑟 ≥ ∑ 𝑊Ox(𝑔O + 𝛼𝑃OΦ𝑟)�

O�O

𝐽��∗ 𝑠 = min{𝑟x𝜙 𝑠 ∶ Φ𝑟 ≥ 𝐽∗, 𝑟 ∈ ℝ�}𝐽��∗ 𝑠 = min{𝑟x𝜙 𝑠 ∶ 𝑊x𝐸Φ𝑟 ≥ 𝑊x𝐸𝐽∗, 𝑟 ∈ ℝ�}

Model-basedRL

Good?Bad?Bonusquestion:Can‖𝑉∗ − 𝑉��‖ becontrolledvia controlling‖𝑉� − 𝑉∗‖?

Canwedobetter?Perhapsusingextrastructure?

w. Bernardo ÁVILA PIRES COLT 2016

Structure:Factoredlinearmodels

w. Bernardo ÁVILA PIRES COLT 2016/semi/linear/PhD Th.

𝒫 𝑑𝑥. 𝑥, 𝑎 ≈ 𝜉 𝑑𝑥′ x𝜓(𝑥, 𝑎)

ℛ𝑉 = ∫ 𝑉 𝑥. 𝜉 𝑑𝑥. (= 𝑤) ∈ ℝ¦

(𝒬𝑤)(𝑥, 𝑎) = 𝑤x𝜓(𝑥, 𝑎)

ℛ: VFUN → ℝ¦

𝒬:ℝ¦ → AVFUN

𝒫: VFUN → AVFUN 𝒫𝑉 𝑥, 𝑎 = ∫ 𝑉 𝑥. 𝒫(𝑑𝑥.|𝑥, 𝑎)

𝒫 ≈ 𝒬ℛSpecialcases:• Tabular• LinearMDP• KME• Stoch.Fact.• KBRL• ..

Legend:𝒱 = VFUN

𝒲 = ℝℐ = CVFUN𝒱𝒜 = AVFUN

Policyerrorinfactoredlinearmodels

𝑈∗ = 𝑀𝑇𝒬𝑢∗𝑢∗ = 𝑀.𝑇ℛ𝒜𝒬𝑢∗

𝜋¶ = 𝐺𝑇𝒬𝑢∗

w. Bernardo ÁVILA PIRES COLT 2016/semi/linear/PhD Th.

Questions:• Istheboundtight?• Time/actionabstraction?• Efficientlearning?Whatspecific

modelsto use?

Online,model-freeRLw.neuralnets

• ContinuingRL;𝑅·¸:pseudoregret;let𝑄Q ≔ 𝑄� º .• Keyidentity:

𝑅·¸ =»𝜈�∗(𝑥)�

½

» 𝑄Q 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − ⟨𝑄Q 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 ⟩¸

QTU

• Then..

𝑄Q 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − 𝑄Q 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 =𝑄ÁQ 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − 𝑄ÁQ 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 ⇒ Controlw. OLP

+ 𝑄Q 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − 𝑄ÁQ 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 ⇒ A:𝐿U(𝜈�∗ ⊗ 𝜋 Q )+ 𝑄ÁQ 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 − ⟨𝑄Q 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 ⟩ ⇒ A:𝐿U(𝜈�∗ ⊗ 𝜋∗)

w. N. Lazic, Y. Abbasi-Yadkori, G. Weisz, ICML’19 + arXiv

Politex


Regretbounds


Theorem

Assumethatforanypolicy𝜋,afterfollowing𝜋 for𝑛 steps,ablack-boxfunctionapproximator producesanaction-valuefunctionwhoseerroris𝜖) + 1/ 𝑛� uptosomeuniversalconstant.

Thentheaveragepseudo-regretofPolitex after𝑇 stepsis𝜖) + 𝑇BÄÅ.

Refinements

• Problem:Howtogetthe𝜖) +UK�error?

– E.g.linearVFA?LSPE!𝜖):limitingerrorofLSPEcouldbe≫ besterror.

• Refinement1:– Useon-policystatevaluefunction-approximator– addextraaction-ditheringperstate– assumeallpoliciesexcitestate-features

• Refinement2:– Assumeaccesstoan“explorationpolicy”thatexcitesfeatures– Interleaveexplorationstepswithpolicysteps– Useoff-policy(!)VFA(whichone?)– ⇒ Regretdegradesabit

• Questions:– CanwedobetterwithotherOLmethods?Isaveragingreallynecessary?– Bettervalue-functionlearners?


Summary

curiosity, |unobserved rewards | and neural networksin rl · (unobserved) environment, d action, e...

Documents