racing f-zero with imitation learningcs229.stanford.edu/proj2017/final-posters/5140437.pdfracing...

1
Racing F-ZERO with Imitation Learning Overview I We are motivated by recent success of applying machine learning to play games at a level capable of surpassing human experts. We use the Retro Learning Environment for the SNES game F-Zero in order to reproduce works in the Imitation Learning space. I We implement Dataset Aggregation (DAgger) to solve the dataset mismatch problem prevalent in sequential prediction tasks. Data Acquisition I Retro Learning Environment (RLE), a learning framework based on the Arcade Learning Environment can support SNES games. http://github.com/nadavbh12/Retro-Learning-Environment I Through 3 human playthroughs, aqcuired 29,375 RGBA images labeled with controller input Automatic Player I For the automatic playing of the game we collect aspects of the state from the emulator and base our decisions on a one-step greedy search. We pick among [RIGHT, NOOP, LEFT] and simulate the next 30 frames with that action choice, greedily choosing the action that maximizes our reward, as defined below: reward = α (isForward * speed) + β (score) + γ (power) (1) DAgger Algorithm Train CNN on dataset D as initial policy ˆ π 1 (2) for i = 1 to N (3) begin game (4) while game has not ended (5) run the CNN to obtain new trajectories (6) if pred probs < 0.5 or rand(0-1) < (7) get D i = {(s , π * (s ))} given by automatic player (8) aggregate datasets:D D D i (9) end while (10) Train CNN ˆ π i +1 on D (11) end for (12) Training Convolutional Neural Network I Based on NVIDIA autopilot CNN for self-driving cars I 39,366,570 total parameters I Implemented in Keras with Tensorflow backend using 1 GPU I Data split into 80% train and 20% test examples I Images and labels are shuled prior to training I Raw pixel input of size 224 × 256 × 3 I Categorical cross entropy loss function: H (p, q) = - x p (x )log (q(x )) Results and Evaluation I CNN accuracy is 95% on the test set I DAgger run for 30 iterations able to complete full laps I Saliency map visualizes aention over ’NOOP’ class. Note that as expected the boundaries of the track are most salient. Surprisingly it also picks up the power bar and clock Figure: Saliency Map Figure: Final network architecture. This figure is generated by adapting the code from https://github.com/gwding/draw_convnet Stephanie Wang, Michael Brennan [email protected], [email protected] Stanford University

Upload: others

Post on 07-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Racing F-ZERO with Imitation Learningcs229.stanford.edu/proj2017/final-posters/5140437.pdfRacing F-ZERO with Imitation Learning Overview I We are motivated by recent success of applying

Racing F-ZERO with Imitation Learning

OverviewI We are motivated by recent success of applying machine learning

to play games at a level capable of surpassing human experts. Weuse the Retro Learning Environment for the SNES game F-Zero inorder to reproduce works in the Imitation Learning space.I We implement Dataset Aggregation (DAgger) to solve the dataset

mismatch problem prevalent in sequential prediction tasks.

Data Acquisition

I Retro Learning Environment (RLE), a learning framework based onthe Arcade Learning Environment can support SNES games.http://github.com/nadavbh12/Retro-Learning-Environment

I Through 3 human playthroughs, aqcuired 29,375 RGBA imageslabeled with controller input

Automatic Player

I For the automatic playing of the game we collect aspects of thestate from the emulator and base our decisions on a one-stepgreedy search. We pick among [RIGHT, NOOP, LEFT] and simulatethe next 30 frames with that action choice, greedily choosing theaction that maximizes our reward, as defined below:

reward = α (isForward * speed) + β(score) + γ (power) (1)

DAgger Algorithm

Train CNN on dataset D as initial policy π̂1 (2)for i = 1 to N (3)

begin game (4)while game has not ended (5)

run the CNN to obtain new trajectories (6)if pred probs < 0.5 or rand(0-1) < ϵ (7)

get Di = {(s,π ∗(s))}given by automatic player (8)aggregate datasets:D← D ∪ Di (9)

end while (10)Train CNN π̂i+1 on D (11)

end for (12)

Training Convolutional Neural Network

I Based on NVIDIA autopilot CNN for self-driving carsI 39,366,570 total parametersI Implemented in Keras with Tensorflow backend using 1 GPUI Data split into 80% train and 20% test examplesI Images and labels are shu�led prior to trainingI Raw pixel input of size 224 × 256 × 3I Categorical cross entropy loss function:

H (p, q) = −∑x

p(x)log(q(x))

Results and Evaluation

I CNN accuracy is 95% on the test setI DAgger run for 30 iterations able to complete full lapsI Saliency map visualizes a�ention over ’NOOP’ class. Note that as

expected the boundaries of the track are most salient. Surprisinglyit also picks up the power bar and clock

Figure: Saliency Map

Figure: Final network architecture. This figure is generated by adapting the code from https://github.com/gwding/draw_convnet

Stephanie Wang, Michael [email protected], [email protected]

Stanford University