forth - university of texas at austinpstone/courses/394rfall19/resources/week2b-… · 0 i • d o...

0 S,

I•

d OO

Sg

It )I ••⑨• It ALLd O o o o 0 O

" Bellman eqn .forVyO S

,s

I Dynamic programming

po⑧Iiaa

Monte Carloest ,

of Vt

Sample of return :

Gse =

'

Ii ri⇒ VE ) =E[G± )

0

I•

d O Mc :

O L f )- only sampled transitions

I •••- All the way to end of episode

• Lf Lf If - no bootstrappingI O o o o 0 O

'

" Bellman eqn .

forthO , DP

:I Dynamic programming

← All possible tmsitiins

.•-

only one stepI -

bootstrappingian

Monte Carloest ,

of Vt

Sample of return :

Gsi ri⇒ VE )=E[G± )

ITG ! : ith return starting

from state s,

collected

from policy IT

On policy : VCs ) = ut If Gi

""

TT

predictionQq ( s

, a) = In Ea,

Gis ' " T

itG ! : ith return starting

from state s,

collected

from policy IT 5 I

On policy : VCs ) = ut E.,

Gis'

"

TT

predictionQ , ( s

, a) = In,

Gis ' " 'T


from state s,

collected

from policy IT

5 I


Gis'

"

TT

predictionQq ( s

, a) = In E. ,

Gis ' " T'

II- '

Totalsa )off policy : V

,. ( s ) = IT ⇐

,

Gis"

. Ci where Ci = It.

.False )

predictionQa .( s

, a) = In ¥,

Gfi" ". Ci


from state s,

collected

from policy IT

soI


Gis'

"

TT

predictionQq ( s

, a) = In E. ,

Gis ' " T'

II'

TiLarisa )off policy : V

,. ( s ) = IT ⇐

,

Gis"

. Ci where Ci = III. False )prediction

Qa .( s, a) = In ¥

,Gfi" "

. Ci

Consider : On - policy Control vs . off - policy control w/ E -greedy explanation

what are Va and The us.

T. and The ?- -

offpolicy on - policy

Safe off policy evaluation :

lb

Return probabilistic lower boundUK ,such that :

Vt > VIPwith prob .

I - 5 Given : IT,

d,

data from Ttb

without ever running policy IT !


,.

Return probabilistic lower bound VV,

such that :

Vt > VIPwith prob .

I - 5 Given : IT,

d,

data from Tb


Confidence bounds : Chernoff - Hoeffding inequality

with probability at least I - s :

µ I I Xi - b for OI Xi E b

2h


,.

Return probabilistic lower boundUK ,such that :

Vt > VIPwith prob .

I - 5 Given : IT,

or,

data from Ttb


Confidence bounds : Chernoff - Hoeffding inequality

with probability at least I - s :

u I I Xi - b for OI Xi E b

2h

I

the I th §,

Gi .

'T b

- Gma,

for OSGi

EGma×

ItITLAKISK)

Given returns G,

- . . Gn ANDp ,= IT THEN : Vt ( s ) =

from policy Tlb Kei

ItITLAKISK)

Given returns G,

- . . Gn andp ,= IT - THEN : Vials ) =

from policy Tlb k=tiTb(Aulsk )

Ols :

II.Gili

ItITLAKISK)

Given returns G,

- . . Gn andp ,= IT - THEN : Vials ) =


Ols : Wis :

'

a E.Giei

TEEja j

ItITLAKISK)

Given returns G,

- . . Gn andp ,= IT - THEN : Vt ( s ) =


Ols : Wis :

'

a E.Giei

EEja j

PD IS :

ht E GT,

where :

Ii = ?,

R,

t 8 ?.kzt . .it 8

" "

fRtand

Pa:b

= It TetrisTho C Anka )

forth - university of texas at austinpstone/courses/394rfall19/resources/week2b-… · 0 i • d o...

Documents