arxiv:1710.04459v1 [cs.ai] 12 oct 2017 · on a large-scale naturalistic dataset of 420 hours or 45...

Arguing Machines: Perception-Control System Redundancyand Edge Case Discovery in Real-World Autonomous Driving

Lex FridmanMIT

[email protected]

Benedikt JenikMIT

[email protected]

Bryan ReimerMIT

[email protected]

Figure 1: Illustration of the two perception-control “arguing machines” under investigation in this work: (1) a Tesla AutopilotL2 steering system and (2) an end-to-end neural network trained to make steering decisions from a sequence of images.

Abstract

Safe autonomous driving may be one of the most difficultengineering challenges that any artificial intelligence systemhas been asked to do since the birth of AI over sixty yearsago. The difficulty is not within the task itself, but ratherin the extremely small margin of allowable error given thehuman life at stake and the extremely large number of edgecases that have to be accounted for. In other words, we taskthese systems to expect the unexpected with near 100% ac-curacy, which is a technical challenge for machine learningmethods that to date have generally been better at memo-rizing the expected than predicting the unexpected. In fact,the process of efficiently and automatically discovering theedge cases of driving may be the key to solving this engi-neering challenge. In this work, we propose and evaluate amethod for discovering edge cases by monitoring the dis-agreement between two monocular-vision-based automatedsteering systems. The first is a proprietary Tesla Autopilotsystem equipped in the first generation of Autopilot-capablevehicles. The second is a end-to-end neural network trainedon a large-scale naturalistic dataset of 420 hours or 45 millionframes of autonomous driving in Tesla vehicles.

Preprint: AAAI Conference on Artificial Intelligence (AAAI-18)

1 IntroductionOver the past three decades, the wide-scale adoption ofmicroprocessor-based electronic control units (ECUs) intransport vehicles have fundamentally changed the task ofdesigning and building a safe, reliable car. Today, softwareplays a critical role in almost every automotive sub-system.As an illustration, the 2017 Ford F150 pickup truck relieson over 150 million lines of source code, compared to 6.5million in the avionics and online support systems of a Boe-ing 787 (Ford Motor Company 2016). Practically, this meansthat the challenge of achieving high reliability, robustness,and redundancy has fundamentally become a software engi-neering problem as much as, if not more than, a mechanicalengineering one.

The potential wide-scale adoption of increasingly-automated driving technology may further expand the scopeand difficulty of the underlying software engineering prob-lem, the primary component of which is redundancy. Redun-dancy is key to safety. If one system fails, another must beready to take over. For perception systems, redundancy canbe achieved by adding several of the same sensor or addingsensors that are able to capture similar, overlapping, or oth-erwise complementary information.

Lidar, radar, ultrasonic, monocular vision, stereo vision,

arX

iv:1

710.

0445

9v1

[cs

.AI]

12

Oct

201

7

1. Monocular Camera

2. Real-Time Perception-Control System (Neural Network)

3. Disagreement Notification

4. Steering Commands fromTesla (Autopilot or Human)and Neural Network

5. Temporal Difference Inputto Neural Network

1

2

3

4

5

Figure 2: Implementation and evaluation of the system presented in this paper. The primary perception-control system is TeslaAutopilot. The secondary perception-control system is an end-to-end neural network. We equipped a Tesla Model S vehiclewith a monocular camera, an NVIDIA Jetson TX2, and an LCD display that shows the steering commands from both systems,the temporal difference input to the neural network, and (in red text) a notice to the driver when a disagreement is detected.

infrared vision are all examples of sensor modalities thatcan serve as redundant support for the each other (Koop-man and Wagner 2016). But how do we offer redundancyfor an artificial intelligence system tasked with managingthe full pipeline from perception to planning to control toactuation? Who or what serves as the fallback mechanism?Two options have been considered: (a) human driver (Koo etal. 2015) and (b) human teleoperator (Kim and Ryu 2013).In both cases a human being monitors the system, preparedto take over at any moment if the automated driving systemis unable to handle the situation at hand (Reimer 2014). Inthis work, we propose a solution where another AI system(termed “machine”) may serve as an initial fallback. More-over, we propose that such as a system can take an activerole in anticipating challenging driving situations (termed“edge cases”) by monitoring the disagreement (termed “ar-gument”) between its prediction and that of the primary AIsystem.

As shown in Fig. 1, the role of the primary machine inthis paper is served by the first generation of Tesla Autopilotsoftware with the perception and steering predictions per-formed by the integrated Mobileye system (Pirzada 2015).The role of the secondary machine in this paper is servedby an end-to-end convolutional neural network similar to

that described and evaluated in (Bojarski et al. 2016) ex-cept that our model considers the temporal dynamics of thedriving scene by taking as input some aspects of the visualchange in the forward-facing video for up to 1 second backin time (see §4.1). The output of both systems is a steeringangle. The differences in those outputs is what constitutesthe argument based on which disengagement suggestionsand edge case proposals are made. The network model istrained on a balanced dataset constructed through samplingfrom 420 hours of real-world on-road automated driving bya fleet of 16 Tesla vehicles (Anonymized 2017) (see §3).

We perform two evaluations in this work. First, we eval-uate the ability of the end-to-end network to predict steer-ing angles commensurate with real-world steering anglesthat were used to keep the car in its lane. For this, we usedistinct periods of automated lane-keeping during Autopilotengagement as the training and evaluation datasets. Second,we evaluate the ability of an argument arbitrator (termed“disagreement function”) to estimate, based on a short timewindow, the likelihood that a transfer of control is initiated,whether by the human driver (termed “human-initiated”) orthe Autopilot system itself (termed “machine-initiated”). Wehave 6,500 total disengagements in our dataset. All disen-gagements (whether human-initiated or machine-initiated)

(a) The GPS location of all 6,500 Autopilot disengagements in the automated driving dataset used in this work.

(b) The distribution of speed (measured by fraction of time spentat each speed) when the Tesla Autopilot system is controlling thevehicle.

(c) The distribution of steering angles (measured by fraction oftime spent at each steering angle) when the Tesla Autopilot systemis controlling the vehicle.

Figure 3: Statistics capturing the spatial and temporal dimensions of the on-road Autopilot autonomous driving dataset used fortraining and evaluation in this work.

are considered to be representative of cases where the vi-sual characteristics of the scene (e.g., poor lane markings,complex lane mergers, light variations) were better handledby a human operator. Therefore, we chose to evaluate thedisagreement function by its ability to predict these disen-gagements, which it is able to do with 90.4% accuracy (seeFig. 6).

The central idea proposed in this work is that robust-ness of the artificial intelligence system behind the percep-tion and planning necessary for automated driving can be

achieved by supplementing the training dataset with edgecases automatically discovered through monitoring the dis-agreement between multiple machine learning models.

The source code and the training data used in this pa-per is made publicly available. Furthermore, we implementand deploy the system described in this work to show itscapabilities and performance in real-world conditions. Itssuccessful operation is exhibited in an extensive, on-roadvideo demonstration that is made publicly available at (linksanonymized). As Fig. 2 shows, we instrumented a Tesla

Model S vehicle with an NVIDIA Jetson TX2 running theneural network based perception-control system and dis-agreement function in real-time. The input to the system is aforward-facing monocular camera and the output are steer-ing commands. The large display shows steering commandsboth from the primary system (Tesla) and secondary system(neural network), and notifies the driver when a disagree-ment is detected.

2 Related WorkSoftware is taking on greater operational control in mod-ern vehicles and in so doing is opening the door to machinelearning. These approaches are fundamentally hungry fordata, based on which, they aim to take on the higher levelperception and planning tasks. As an example, over 15 mil-lion vehicles worldwide are equipped with Mobileye com-puter vision technology (Stein et al. 2015), including the firstgeneration Autopilot system that serves as the “primary ma-chine” in this work.

Given the requirement of extremely low error rates andneed to generalize over countless edge cases, large-scale an-notated data is essential to making these approaches workin real-world conditions. In fact, for driving, training datarepresentative of all driving situations may be more impor-tant than incremental improvements in perception, control,and planning algorithms. Tesla, as an example, is acknowl-edging this need by asking its owners to share data with thecompany for the explicit purpose of training the underlyingmachine learning models. Our work does precisely this, ap-plying end-to-end neural network approaches to training onlarge-scale, semi-autonomous, real-world driving data. Theresulting model serves as an observer and critic of the pri-mary system with the goals of (1) discovering edge cases inthe offline context and (2) bringing the human back into theloop when needed in the online context.

2.1 End-to-End Approached to DrivingIn contrast to modular engineering approaches to self-driving systems, where deep learning only plays a role forthe initial scene interpretation step, it is also possible toapproach driving as a more holistic task that can possi-bly be solved in a data-driven way by a single learner: anend-to-end neural network. First attempts were made al-most 30 years ago (Pomerleau 1989), long before the recentGPU-enabled performance breakthroughs in deep learning(Krizhevsky, Sutskever, and Hinton 2012).

A similar, but more modern approach using deeper, con-volutional nets has been deployed in an experimental vehicleby NVIDIA (Bojarski et al. 2016), and further improvementsto that were made using various forms of data augmentation(Ross, Gordon, and Bagnell 2011) and adapted to the drivingcontext by (Zhang and Cho 2017). In addition to just usingconvolutional neural networks, more advanced approachesthat combine temporal dynamics information have been pro-posed (Xu et al. 2016).

2.2 Ensemble of Neural NetworksThe idea of multiple networks collaborating or competingagainst each other to optimize an objective have been im-

plemented in various contexts. For example, multiple net-works have been combined together in order to improve ac-curacy (Krogh, Vedelsby, and others 1995) as have tradi-tionally been explored in machine learning as ensembles ofclassifiers. In these approaches, decision-level fusion is per-formed across many classifiers in order to increase accuracyand robustness of the overall system.

Alternatively, generative adversarial networks (GANs)have networks working against each other for representationlearning and subsequent generation of samples from thoselearned representations (Goodfellow et al. 2014), includinggeneration of steering commands (Kuefler et al. 2017). Neu-ral networks have also been used in different environmentsat the same time (Mnih et al. 2016) to learn from them inparallel, or, as in our work, to look at what the disagreementto other systems reveals about the underlying state of theworld the networks operate in.

3 DatasetThe dataset used for the training and evaluation of the end-to-end steering network model comprising the “secondarymachine” is taken from a large-scale naturalistic drivingstudy of semi-autonomous vehicle technology (Anonymized2017). Specifically, we used 420 hours of driving data wherea Tesla Autopilot system was controlling both the longitu-dinal and lateral movement of the vehicle. The spatial andtemporal characteristics of Autopilot use in our dataset isshown in Fig. 3.

This subset of the full naturalistic driving dataset servedas ground truth for automated lane keeping. In other words,given the operational characteristics of Autopilot, we knowthat the vehicle only leaves the lane in two situations: (1)during automated lane changes and (2) as part of a “disen-gagement” where the driver elects or is forced to take backcontrol of the vehicle. We have the full enumeration of bothscenarios. The latter is of particular interest to the task ofarguing machines, as one indication of a valuable disagree-ment is one that is associated with a human driver feelingsufficiently uncomfortable to elect to take back control of thevehicle. There are 6,500 such instances of disengagementthat are used for evaluating the ability of the disagreementfunction to discover edge cases and challenging driving sce-narios as discussed in §4.2.

4 Arguing Machines4.1 End-to-End Learning of Steering TaskOur model, which is inspired by (Bojarski et al. 2016) uses5 convolutional layers, the first 3 with a stride of 2 × 2 and5 × 5 kernels and the remaining 2 keeping the same stride,while switching to smaller 3× 3 kernels. On top of that, weadd 4 fully connected layers going down to output sizes of100, 50, 10, and 1, respectively. Throughout the net ReLUactivations (Nair and Hinton 2010) are used on the layers. Inaddition, we use Dropout (Krizhevsky, Sutskever, and Hin-ton 2012) as regularization technique on the fully connectedlayers. The net is trained using an RMSprop (Hinton, Sri-vastava, and Swersky 2012) optimizer minimizing the meansquared error between predicted and actual steering angle.

M5:

Stat

icR

GB

M4:

Stat

icR

GB

Ed

ges

M3:

Dyn

amic

Gra

y

M2:

Dyn

amic

Dif

f Fl

ash

bac

ks

M1:

Dyn

amic

Dif

f In

terv

als

R, frame=0 G, frame=0 B, frame=0

R, frame=0 G, frame=0 B, frame=0

frame=-20 frame=-10 frame=0

frame=-30 frame=-20 frame=-10

frame=-10 frame=-5 frame=-1

Figure 4: Visualization on one illustrative example of each of the 5 neural network preprocessing models evaluated in this paper.See Fig. 5 for mean absolute error achieved by each model.

Since a large part of driving - and therefore also ourdataset - consists of going straight, we had to specificallyselect input images to remove that imbalance, and resultingbias towards lower steering angle values the net would learnotherwise. To accomplish this dataset balancing task, we cal-culate a threshold using the minimum number of availableframes in steering angle ranges of one degree. This thresh-old is then used within the range of interest of [−10◦, 10◦]steering angle to allow at max threshold frames get selectedto achieve a balance. This results in about 100,000 trainingand 50,000 validation frames.

For the input to the neural network we considered 5 dif-ferent preprocessing methods (see Fig. 4) - referenced as M1- M5 in the following sections - each producing a 256× 144image with 3 channels. M5 uses the method proposed in(Bojarski et al. 2016) as a comparison, consisting of theRGB channels of a single frame. M4 uses the same singleframe, but precomputes edges on each color channel.

To improve the accuracy beyond that, for input methodsM1 to M3 we use a temporal component, meaning multipleframes, to improve situation awareness. M3, in addition tothe current frame, also looks 10 and 20 frames back and pro-

Figure 5: The mean absolute error achieved by each of the 5models illustrated in Fig. 4.

Figure 6: The tradeoff between false accept rate (FAR) andfalse reject rate (FRR) achieved by varying the constantthreshold used to make the binary disagreement classifica-tion. The red circle designates a threshold of 10 that is visu-alization on an illustrative example in Fig. 7.

vides the grayscale version of them as the input image chan-nels. M2 goes beyond that and, in addition to using multipleframes as input, also subtracts them from each other, whichhelps with an implicit input normalization, as well as auto-matically highlighting the important moving parts like lanemarkings. The exact mathematical formulation of the inputis:

It = {(Ft − Ft−10), (Ft − Ft−5), (Ft − Ft−1)} (1)

Figure 7: Illustrative example showing snapshots of the for-ward roadway, plots of the steering angles suggested by theprimary machine (black line) and secondary machine (blueline), and a plot of the disagreement function along with athreshold value of 10 that corresponds to the red circle inFig. 6.

where It and Ft are the input to the neural network and thevideo frame at time t. The unit of time is 1 video frame or33.3 milliseconds given the 30 fps video used in this work.In (1), each channel is based on the current frame, but alsoincorporates a “flashback” to another frame further back.

M1 does not use “flashbacks”, but instead looks at thechanges that happened over a series of time segments - each10 frames long, as follows:

It = {(Fi−20−Fi−30), (Ft−10−Ft−20), (Ft−Ft−10)} (2)

To evaluate the network using the different preprocessingmethods, we compute the mean absolute steering angle er-ror over the validation set. The results are shown in Fig. 5.Precomputing edges (M4) already leads to improved perfor-mance over just supplying the RGB image (M5), and provid-ing temporal context (M1-M3) does even better, with “flash-backs” (M2) performing better than just providing multipleframes (M3), and comparing time segments (M1) perform-ing best. For the evaluation of the disagreement function in§4.2, we use M1.

4.2 Disagreement Function and Edge CaseDiscovery

The goal for the disagreement function is to compare thesteering angle suggested by the “primary machine” (Au-topilot) and the “secondary machine” (neural network) andbased on this comparison to make a binary classification of

whether the current situation is a challenging driving situa-tion or not. The disagreement function can take many formsincluding modeling the underlying entropy of the disagree-ment, but the function computed and evaluated in this workpurposefully took on a simple form through the followingprocess:

1. Normalize the steering angle for both the primary and sec-ondary machines to be in [−1, 1] normalized by the range[−10, 10] and all angles exceeding the range are set to therange limits.

2. Compute the difference between the normalized steeringsuggestions and sum them over a window of 1 second (or30 samples).

3. Make the binary classification decision based on a dis-agreement threshold δ.

The metrics used for evaluating the performance of thedisagreement system are false accept rate (FAR) and falsereject rate (FRR). Where the detection event of interest isthe Autopilot disengagement. In other words, an “accept” isa prediction that this moment in time is likely to be asso-ciated with a disengagement and can thus be considered anedge case for the machine learning system. A “reject” is aprediction that this moment in time is not likely to be asso-ciated with a disengagement. In order to compute FAR andFRR measure for a given value of δ, we use classificationwindows evenly sampled from disengagement periods andnon-disengagement periods. A disengagement period is de-fined as the 5 seconds leading up to a disengagement and 1second following it.

The illustrative example in Fig. 7 shows the temporal dy-namics of the two steering suggestions, the resulting dis-agreement, and the role of δ in marking that moment leadingup to the disengagement as an edge case. The ROC curve inFig. 6 shows, by varying δ, that the optimal mean error rateis 0.096, and is achieved when δ = 10. This means thatgiven any 1 second period of Autopilot driving in our testdataset, the difference function can predict whether a dis-engagement will happen in the next 5 seconds with 90.4%accuracy. This is a promising result that motivates furtherevaluation of the predictive power of the disagreement func-tion both on a larger dataset of Autopilot driving and in real-world on-road testing.

5 ConclusionThis work proposes a solution for detecting the inevitablemoments when a perception pipeline for the driving task en-counters an edge case where it may struggle but is not awareof it, and thus confidently suggest a potentially suboptimalor even unsafe steering command. Our approach proposesan end-to-end neural network to serve as a “secondary ma-chine” to argue with the “primary machine” and when adisagreement arises to bring the human back into the con-trol loop. Moreover, we explore the use of this disagreementprocess to discover new edge cases in naturalistic driving,and show that even a single neural network can be an ef-fective supervisor of the steering suggestions provided bythe primary system. Our approach to building a “secondary

machine” has two benefits: (1) it treats the “primary ma-chine” as a black box and thus can operate as an after-marketproduct supervising any proprietary automotive AI systemas long as the inputs and outputs to the system can be ap-proximately observed, and (2) as with any machine learn-ing system, it improves with more data and in the case ofAutopilot does not require any manual annotation since thelane keeping system automatically annotates the data as dis-cussed in §3. We evaluate both the secondary system and thedifference function to show its ability to predict transfers ofcontrol from machine to human with 90.4% accuracy within5 seconds before it happens. Finally, we implement, deploy,and demonstrate our system in a Tesla Model S vehicle op-erating in real-world conditions.

References[Anonymized 2017] Anonymized. 2017. (Anonymized) nat-uralistic driving dataset for understanding the human at thecenter of automation.

[Bojarski et al. 2016] Bojarski, M.; Del Testa, D.;Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.;Jackel, L. D.; Monfort, M.; Muller, U.; Zhang, J.; et al.2016. End to end learning for self-driving cars. arXivpreprint arXiv:1604.07316.

[Ford Motor Company 2016] Ford Motor Company. 2016.Ford invests in pivotal to accelerate cloud-based softwaredevelopment; new labs drive ford smart mobility innovation.[Online; posted 5-May-2016].

[Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.;Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville,A.; and Bengio, Y. 2014. Generative adversarial nets. InGhahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.;and Weinberger, K. Q., eds., Advances in Neural Informa-tion Processing Systems 27. Curran Associates, Inc. 2672–2680.

[Hinton, Srivastava, and Swersky 2012] Hinton, G.; Srivas-tava, N.; and Swersky, K. 2012. Neural networks for ma-chine learning lecture 6a overview of mini–batch gradientdescent.

[Kim and Ryu 2013] Kim, J.-S., and Ryu, J.-H. 2013. Sharedteleoperation of a vehicle with a virtual driving interface. InControl, Automation and Systems (ICCAS), 2013 13th Inter-national Conference on, 851–857. IEEE.

[Koo et al. 2015] Koo, J.; Kwac, J.; Ju, W.; Steinert, M.;Leifer, L.; and Nass, C. 2015. Why did my car just dothat? explaining semi-autonomous driving actions to im-prove driver understanding, trust, and performance. Inter-national Journal on Interactive Design and Manufacturing(IJIDeM) 9(4):269–275.

[Koopman and Wagner 2016] Koopman, P., and Wagner, M.2016. Challenges in autonomous vehicle testing and vali-dation. SAE International Journal of Transportation Safety4(2016-01-0128):15–24.

[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-tion with deep convolutional neural networks. In Pereira,F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds.,

Advances in Neural Information Processing Systems 25.Curran Associates, Inc. 1097–1105.

[Krogh, Vedelsby, and others 1995] Krogh, A.; Vedelsby, J.;et al. 1995. Neural network ensembles, cross validation, andactive learning. Advances in neural information processingsystems 7:231–238.

[Kuefler et al. 2017] Kuefler, A.; Morton, J.; Wheeler, T.;and Kochenderfer, M. 2017. Imitating driver behavior withgenerative adversarial networks. In IEEE Intelligent Vehi-cles Symposium.

[Mnih et al. 2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves,A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu,K. 2016. Asynchronous methods for deep reinforcementlearning. In International Conference on Machine Learning,1928–1937.

[Nair and Hinton 2010] Nair, V., and Hinton, G. E. 2010.Rectified linear units improve restricted boltzmann ma-chines. In Proceedings of the 27th international conferenceon machine learning (ICML-10), 807–814.

[Pirzada 2015] Pirzada, U. 2015. Tesla autopilot - an in-depth look at the technology behind the engineering marvel.[Online; posted 3-Dec-2015].

[Pomerleau 1989] Pomerleau, D. A. 1989. Alvinn: An au-tonomous land vehicle in a neural network. In Touretzky,D. S., ed., Advances in Neural Information Processing Sys-tems 1. Morgan-Kaufmann. 305–313.

[Reimer 2014] Reimer, B. 2014. Driver assistance systemsand the transition to automated vehicles: A path to increaseolder adult safety and mobility? Public Policy & Aging Re-port 24(1):27–31.

[Ross, Gordon, and Bagnell 2011] Ross, S.; Gordon, G. J.;and Bagnell, D. 2011. A reduction of imitation learningand structured prediction to no-regret online learning. InAISTATS, volume 1, 6.

[Stein et al. 2015] Stein, G.; Dagan, E.; Mano, O.; andShashua, A. 2015. Collision warning system. US Patent9,168,868.

[Xu et al. 2016] Xu, H.; Gao, Y.; Yu, F.; and Darrell, T.2016. End-to-end learning of driving models from large-scale video datasets. arXiv preprint arXiv:1612.01079.

[Zhang and Cho 2017] Zhang, J., and Cho, K. 2017. Query-efficient imitation learning for end-to-end simulated driving.In Proceedings of the Thirty-First AAAI Conference on Arti-ficial Intelligence, February 4-9, 2017, San Francisco, Cali-fornia, USA., 2891–2897.

arxiv:1710.04459v1 [cs.ai] 12 oct 2017 · on a large-scale naturalistic dataset of 420 hours or 45...

Documents