ece 517: reinforcement learning in artificial...

1

ECE 517: Reinforcement Learning in Artificial Intelligence

Lecture 17: Case Studies and Gradient Policy

Dr. Itamar Arel

College of EngineeringDepartment of Electrical Engineering and Computer Science

The University of TennesseeFall 2015

October 29, 2015

ECE 517: Reinforcement Learning in AI 2

Introduction

We’ll discuss several case studies of reinforcement learning

Illustrate some of the trade-offs and issues that arise in real-world applications

For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem

We also highlight the representation issues that are so often critical to successful applications

Applications of RL are still far from routine and typically require as much art as science

Making applications easier and more straightforward is one of the goals in current RL research


TD-Gammon (Tesauro’s 1992, 1994, 1995, …)

One of the most impressive early applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon

TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of

the TD(l) algorithm and nonlinear function approximation

FA using a FFNN trained by backpropagating TD errors

There are probably more professional backgammon players than there are professional chess players

BG is in part a game of chance, which can be viewed as a large MDP


TD-Gammon (cont.)

The game is played with 15 white and 15 black pieces on a board of 24 locations, called points

Here’s a typical position early in the game, seen from the perspective of the white player


TD-Gammon (cont.)

White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps

The objective is to advance all pieces to points 19-24, and then off the board

Hitting – removal of single piece

30 pieces, 24 locations implies enormous number of configurations (state set is ~1020)

Effective branching factor of 400, considering that each dice roll has ~20 possibilities


TD-Gammon - details

Although the game is highly stochastic, a complete description of the game's state is available at all times

The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won,

when it is 1

Episodic (game = episode), undiscounted

Non-linear form of TD(l) using a FF neural network Backpropagation of TD error

4 input units for each point; unary encoding of number of white pieces, plus other features

Use of “after-states”

Learning during self-play – fully incrementally


TD-Gammon – Neural Network Employed


Summary of TD-Gammon Results

Two players played against each other

Each had no prior knowledge of the game

Only the rules of the game were imposed

Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players


Rebuttal on TD-Gammon

For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS (1997)

Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success”

No need for dealing with exploration/exploitation

Any such approach would work with backgammon

No sensitivity of state to value - success does not extend to other problems

e.g. Tetris, maze-type problems – exploration issue comes up


The Acrobot

Robotic application of RL

Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to

the hands on the bar) cannotexert torque

The second joint (correspondingto the gymnast bending at thewaist) can

This system has been widelystudied by control engineersand machine learning researchers


The Acrobot (cont.)

One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time

In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque

A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used

Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps)

Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context


Acrobot Learning Curves for Sarsa(l)


Typical Acrobot Learned Behavior


RL in Robotics

Robot motor capabilities were investigated using RL

Walking, grabbing and delivering MIT Media Lab

Robocup competitions – soccer games

Sony AIBOs are commonlyemployed Maze-type problems

Balancing themselveson unstable platform

Multi-dimensional inputstreams


Policy Gradient Methods

Assume that our policy, p, has a set of n real-valued parameters, q = {q 1, q2, q3, ... , qn} Running the policy with a particular q results in a

reward, rq Estimate the reward gradient, , for each qi

iθ

r

i

iiθ

θθ

r

This is another

learning rate


Policy Gradient Methods (cont.)

This results in hill-climbing in policy space So, it’s subject to all the problems of hill-climbing

But, we can also use tricks from search theory, like random starting points and momentum terms

This is a good approach if you have a parameterized policy Let’s assume we have a “reasonable” starting policy

Typically faster than value-based methods

“Safe” exploration, if you have a good policy

Learns locally-best parameters for that policy


An Example: Learning to Walk

RoboCup 4-legged league Walking quickly is a big advantage Historically - tuned manually

Robots have a parameterized gait controller 12 parameters Controls step length, height, etc.

Robot walk across soccer field and is timed Reward is a function of the time taken They know when to stop (distance measure)


An Example: Learning to Walk (cont.)

Basic idea1. Pick an initial q = {q1, q2, ... , q12}

2. Generate N testing parameter settings by

perturbing qqj = {q1 + d1, q2 + d2, ... , q12 + d12}, di {-e, 0, e}

3. Test each setting, and observe rewardsqj→ rj

4. For each qi q

Calculate qi+, qi

0, qi- and set

5. Set q← q’, and go to 2

Average reward

when qni = qi - di

largest θ if

largest θ if

largest θ if

θθ'

i

i

i

ii

d

d0

0


An Example: Learning to Walk (cont.)

Initial

Final


Value Function or Policy Gradient?

When should I use policy gradient? When there’s a parameterized policy

When there’s a high-dimensional state space

When we expect the gradient to be smooth

Typically on episodic tasks (e.g. AIBO walking)

When should I use a value-based method? When there is no parameterized policy

When we have no idea how to solve the problem (i.e. no known structure)


Summary

RL is a powerful tool which can support a wide range of applications

There is an art to defining the observations, states, rewards and actions

Main goal: formulate “as simple as possible” representation

Policy Gradient methods directly search in policy space

ece 517: reinforcement learning in artificial...

Documents