forth - university of texas at austinpstone/courses/394rfall19/resources/week2b-… · 0 i • d o...

14

Upload: others

Post on 01-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf
Page 2: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

0 S,

I•

d OO

Sg

It )I ••⑨• It ALLd O o o o 0 O

" Bellman eqn .forVyO S

,s

I Dynamic programming

po⑧Iiaa

Monte Carloest ,

of Vt

Sample of return :

Gse =

'

Ii ri⇒ VE ) =E[G± )

Page 3: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

0

I•

d O Mc :

O L f )- only sampled transitions

I •••- All the way to end of episode

• Lf Lf If - no bootstrappingI O o o o 0 O

'

" Bellman eqn .

forthO , DP

:I Dynamic programming

← All possible tmsitiins

.•-

only one stepI -

bootstrappingian

Monte Carloest ,

of Vt

Sample of return :

Gsi ri⇒ VE )=E[G± )

Page 4: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ITG ! : ith return starting

from state s,

collected

from policy IT

On policy : VCs ) = ut If Gi

""

TT

predictionQq ( s

, a) = In Ea,

Gis ' " T

Page 5: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

itG ! : ith return starting

from state s,

collected

from policy IT 5 I

On policy : VCs ) = ut E.,

Gis'

"

TT

predictionQ , ( s

, a) = In,

Gis ' " 'T

Page 6: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ITG ! : ith return starting

from state s,

collected

from policy IT

5 I

On policy : VCs ) = ut E.,

Gis'

"

TT

predictionQq ( s

, a) = In E. ,

Gis ' " T'

II- '

Totalsa )off policy : V

,. ( s ) = IT ⇐

,

Gis"

. Ci where Ci = It.

.False )

predictionQa .( s

, a) = In ¥,

Gfi" ". Ci

Page 7: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ITG ! : ith return starting

from state s,

collected

from policy IT

soI

On policy : VCs ) = ut E.,

Gis'

"

TT

predictionQq ( s

, a) = In E. ,

Gis ' " T'

II'

TiLarisa )off policy : V

,. ( s ) = IT ⇐

,

Gis"

. Ci where Ci = III. False )prediction

Qa .( s, a) = In ¥

,Gfi" "

. Ci

Consider : On - policy Control vs . off - policy control w/ E -greedy explanation

what are Va and The us.

T. and The ?- -

offpolicy on - policy

Page 8: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

Safe off policy evaluation :

lb

Return probabilistic lower boundUK ,such that :

Vt > VIPwith prob .

I - 5 Given : IT,

d,

data from Ttb

without ever running policy IT !

Page 9: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

Safe off policy evaluation :

,.

Return probabilistic lower bound VV,

such that :

Vt > VIPwith prob .

I - 5 Given : IT,

d,

data from Tb

without ever running policy IT !

Confidence bounds : Chernoff - Hoeffding inequality

with probability at least I - s :

µ I I Xi - b for OI Xi E b

2h

Page 10: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

Safe off policy evaluation :

,.

Return probabilistic lower boundUK ,such that :

Vt > VIPwith prob .

I - 5 Given : IT,

or,

data from Ttb

without ever running policy IT !

Confidence bounds : Chernoff - Hoeffding inequality

with probability at least I - s :

u I I Xi - b for OI Xi E b

2h

I

the I th §,

Gi .

'T b

- Gma,

for OSGi

EGma×

Page 11: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ItITLAKISK)

Given returns G,

- . . Gn ANDp ,= IT THEN : Vt ( s ) =

from policy Tlb Kei

Page 12: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ItITLAKISK)

Given returns G,

- . . Gn andp ,= IT - THEN : Vials ) =

from policy Tlb k=tiTb(Aulsk )

Ols :

II.Gili

Page 13: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ItITLAKISK)

Given returns G,

- . . Gn andp ,= IT - THEN : Vials ) =

from policy Tlb k=tiTb(Aulsk )

Ols : Wis :

'

a E.Giei

TEEja j

Page 14: forth - University of Texas at Austinpstone/Courses/394Rfall19/resources/week2b-… · 0 I • d O Mc: O L f)-only sampled transitionsI •-All theway to end of episode • Lf Lf

ItITLAKISK)

Given returns G,

- . . Gn andp ,= IT - THEN : Vt ( s ) =

from policy Tlb k=tiTb(Aulsk )

Ols : Wis :

'

a E.Giei

EEja j

PD IS :

ht E GT,

where :

Ii = ?,

R,

t 8 ?.kzt . .it 8

" "

fRtand

Pa:b

= It TetrisTho C Anka )