tabular methods & value approximation reviewavereshc/rl_fall19/lecture_18_tabular... · 1...

‘-

1

TABULAR METHODS & VALUE APPROXIMATION REVIEWLecture 18

CSE4/510: Reinforcement Learning

October 24, 2019

‘-

2

MDP

‘-

3

MDP

‘-

4

MDP

‘-

5

MDP

‘-

6

MDP

‘-

7

MDP

‘-

8

MDP

‘-

9

MDP

1.

2.

3.

4.

5.

A

B

C

D

E

F

‘-

10

MDP

1.

2.

3.

4.

5.

A

B

C

D

E

F

‘-

11

Dynamic Programming

‘-

12

Dynamic Programming

‘-

13

Dynamic Programming

‘-

14

1. Policy Evaluation

2. Policy Improvement

A

B

Dynamic Programming

‘-

15

‘-

16

Policy Model Pros Cons Applications

On

Policy

Model

Based

• It finds optimal

policies in

polynomial time for

most cases

• Guaranteed to find

optimal policy

• Requires the

knowledge of the

transition probability

this is an unrealistic

requirement for many

problems

Can be applied

to environment

for which the

state transition

probability is

known

Dynamic Programming

‘-

17

Distribution vs Sample Model

1. Distribution model

2. Sample model B. List all possible outcomes and their

probabilities

A. Produce a single outcome taken

according to its probability of occurring

‘-

18

Optimal Functions

B

A

C

1

2

3

‘-

19

Optimal Functions

B

A

C

1

2

3

‘-

20

Update Functions

1. Dynamic Programming

2. Monte Carlo

4. Temporal Difference

A.

B.

C.3. Q-learning

D.

‘-

21

Update Functions

1. Dynamic Programming

2. Monte Carlo

4. Temporal Difference A.

B.

C.

3. Q-learning D.

‘-

22

Overview

MC / DP / TD ?

‘-

23

Monte Carlo

‘-

24

Monte Carlo

‘-

25

Monte Carlo

‘-

26


Monte Carlo

‘-

27


On

Policy

Model

Free

• Learn optimal

behavior directly

from interaction with

the environment

• Can be used to

focus on the region

of special interest

and be accurately

evaluated

• Must have the

terminal state

• Must wait until the

end of an episode

before return is

known. For problems

with very long

episodes this will

become too slow

It couldn’t be

used on

continues task,

should be

episodic

Monte Carlo

‘-

28

SARSA vs Q-learning

‘-

29

SARSA

‘-

30


SARSA

‘-

31

Q-learning

‘-

32

Q-Learning


‘-

33

Q-Learning


Off

Policy

Model

Free

Easy to implement • Memory

requirement

increases with

number of states

• Does not

perform well in

stochastic

environment

Environment

with limited

number of

states and

discrete

action spaces

‘-

34

Function Approximation

‘-

35


‘-

36


‘-

37


‘-

38


‘-

39


‘-

40

Deep Q-network

‘-

41

Deep Q-network

‘-

42

Deep Q-network

‘-

43

Deep Q-network (DQN)


‘-

44

Deep Q-network (DQN)


Off

Policy

Model

Free

• Can generalize

to unseen states

• Input is just a

state

• It may over-

estimate value

• Cannot be

applicable to

continuous action

spaces

Environment

with limited

number of

states and

discrete

action spaces

‘-

45

Double Deep Q-network

‘-

46


‘-

47


‘-

48


‘-

49

Double Deep Q-network (DDQN)


‘-

50

Double Deep Q-network (DDQN)


Off

Policy

Model

Free

• Value estimation

is more accurate

comparing to

DQN

• Input is just a

state

• It may take longer

to train

Environment

with limited

number of

states and

discrete

action spaces

‘-

51

Dueling Deep Q-network (Dueling DQN)

‘-

52


‘-

53


‘-

54


‘-

55


‘-

56



‘-

57

Prioritized Experience Replay (RER)

‘-

58

PER

tabular methods & value approximation reviewavereshc/rl_fall19/lecture_18_tabular... · 1...

Documents