ece 517: reinforcement learning in artificial...

21
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: Case Studies and Gradient Policy Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 October 29, 2015

Upload: others

Post on 26-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

1

ECE 517: Reinforcement Learning in Artificial Intelligence

Lecture 17: Case Studies and Gradient Policy

Dr. Itamar Arel

College of EngineeringDepartment of Electrical Engineering and Computer Science

The University of TennesseeFall 2015

October 29, 2015

Page 2: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 2

Introduction

We’ll discuss several case studies of reinforcement learning

Illustrate some of the trade-offs and issues that arise in real-world applications

For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem

We also highlight the representation issues that are so often critical to successful applications

Applications of RL are still far from routine and typically require as much art as science

Making applications easier and more straightforward is one of the goals in current RL research

Page 3: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 3

TD-Gammon (Tesauro’s 1992, 1994, 1995, …)

One of the most impressive early applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon

TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of

the TD(l) algorithm and nonlinear function approximation

FA using a FFNN trained by backpropagating TD errors

There are probably more professional backgammon players than there are professional chess players

BG is in part a game of chance, which can be viewed as a large MDP

Page 4: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 4

TD-Gammon (cont.)

The game is played with 15 white and 15 black pieces on a board of 24 locations, called points

Here’s a typical position early in the game, seen from the perspective of the white player

Page 5: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 5

TD-Gammon (cont.)

White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps

The objective is to advance all pieces to points 19-24, and then off the board

Hitting – removal of single piece

30 pieces, 24 locations implies enormous number of configurations (state set is ~1020)

Effective branching factor of 400, considering that each dice roll has ~20 possibilities

Page 6: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 6

TD-Gammon - details

Although the game is highly stochastic, a complete description of the game's state is available at all times

The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won,

when it is 1

Episodic (game = episode), undiscounted

Non-linear form of TD(l) using a FF neural network Backpropagation of TD error

4 input units for each point; unary encoding of number of white pieces, plus other features

Use of “after-states”

Learning during self-play – fully incrementally

Page 7: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 7

TD-Gammon – Neural Network Employed

Page 8: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 8

Summary of TD-Gammon Results

Two players played against each other

Each had no prior knowledge of the game

Only the rules of the game were imposed

Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players

Page 9: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 9

Rebuttal on TD-Gammon

For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS (1997)

Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success”

No need for dealing with exploration/exploitation

Any such approach would work with backgammon

No sensitivity of state to value - success does not extend to other problems

e.g. Tetris, maze-type problems – exploration issue comes up

Page 10: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 10

The Acrobot

Robotic application of RL

Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to

the hands on the bar) cannotexert torque

The second joint (correspondingto the gymnast bending at thewaist) can

This system has been widelystudied by control engineersand machine learning researchers

Page 11: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 11

The Acrobot (cont.)

One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time

In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque

A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used

Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps)

Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context

Page 12: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 12

Acrobot Learning Curves for Sarsa(l)

Page 13: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 13

Typical Acrobot Learned Behavior

Page 14: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 14

RL in Robotics

Robot motor capabilities were investigated using RL

Walking, grabbing and delivering MIT Media Lab

Robocup competitions – soccer games

Sony AIBOs are commonlyemployed Maze-type problems

Balancing themselveson unstable platform

Multi-dimensional inputstreams

Page 15: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 15

Policy Gradient Methods

Assume that our policy, p, has a set of n real-valued parameters, q = {q 1, q2, q3, ... , qn} Running the policy with a particular q results in a

reward, rq Estimate the reward gradient, , for each qi

r

i

iiθ

θθ

r

This is another

learning rate

Page 16: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 16

Policy Gradient Methods (cont.)

This results in hill-climbing in policy space So, it’s subject to all the problems of hill-climbing

But, we can also use tricks from search theory, like random starting points and momentum terms

This is a good approach if you have a parameterized policy Let’s assume we have a “reasonable” starting policy

Typically faster than value-based methods

“Safe” exploration, if you have a good policy

Learns locally-best parameters for that policy

Page 17: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 17

An Example: Learning to Walk

RoboCup 4-legged league Walking quickly is a big advantage Historically - tuned manually

Robots have a parameterized gait controller 12 parameters Controls step length, height, etc.

Robot walk across soccer field and is timed Reward is a function of the time taken They know when to stop (distance measure)

Page 18: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 18

An Example: Learning to Walk (cont.)

Basic idea1. Pick an initial q = {q1, q2, ... , q12}

2. Generate N testing parameter settings by

perturbing qqj = {q1 + d1, q2 + d2, ... , q12 + d12}, di {-e, 0, e}

3. Test each setting, and observe rewardsqj→ rj

4. For each qi q

Calculate qi+, qi

0, qi- and set

5. Set q← q’, and go to 2

Average reward

when qni = qi - di

largest θ if

largest θ if

largest θ if

θθ'

i

i

i

ii

d

d0

0

Page 19: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 19

An Example: Learning to Walk (cont.)

Initial

Final

Page 20: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 20

Value Function or Policy Gradient?

When should I use policy gradient? When there’s a parameterized policy

When there’s a high-dimensional state space

When we expect the gradient to be smooth

Typically on episodic tasks (e.g. AIBO walking)

When should I use a value-based method? When there is no parameterized policy

When we have no idea how to solve the problem (i.e. no known structure)

Page 21: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~ielhanan/courses/ECE-517/notes/lecture15.pdf · ECE 517: Reinforcement Learning in AI 5 TD-Gammon (cont.)

ECE 517: Reinforcement Learning in AI 21

Summary

RL is a powerful tool which can support a wide range of applications

There is an art to defining the observations, states, rewards and actions

Main goal: formulate “as simple as possible” representation

Policy Gradient methods directly search in policy space