highly autonomous vehicle validationkoopman/lectures/2017_techad.pdf · 2017-11-10 · highly...

Highly Autonomous Vehicle Validation:

It’s more than just road testing!

© 2017 Edge Case Research LLC

Prof. Philip Koopman

2© 2017 Edge Case Research LLC

Self-driving cars are so cool! But also kind of scary

Is a billion miles of testing enough? Perhaps not, even if you could afford it Will simulation really solve this?

What exactly are you validating? Requirements vs. implementation validation

Can you map autonomy onto ISO 26262? Why machine learning is the hard part

How Do You Validate Autonomous Vehicles?


1985 1990 1995 2000 2005 2010

DARPAGrand

Challenge

DARPALAGR

ARPADemo

II

DARPASC-ALV

NASALunar Rover

NASADante II

AutoExcavator

AutoHarvesting

Auto Forklift

Mars Rovers

Urban Challenge

DARPAPerceptOR

DARPAUPI

Auto Haulage

Auto Spraying

Laser Paint RemovalArmy

FCS

NREC: 30+ Years Of Cool Robots

Robot &AV Safety

AHS Safety


Best case testing scenario: For example, 134M miles/mishap (NYC Taxi Data) Assumptions:

– Random independent mishap arrivals in testing– 95% confidence of >=134M miles/mishap

401.4M miles of testing if no mishaps– More likely 1B+ miles if “just as good as” humans

Significant practical issues: Unlikely that software fails randomly Reset testing meter if software changes Reset testing meter if environment changes

– Hard to know what changes are “small” vs. “big”

A Billion Miles of On-Road Testing?

# Mishapsin Testing

Total Testing Miles for 95% Confidence

0 401M1 636M2 844M3 1.039B

[Fatal and Critical Injury data/ NYC Local Law 31 of 2014]


Validate implementation: does vehicle behave correctly? Traffic scenarios (e.g., aggressive human driver behavior) Sensor limitations (e.g., contrast, glare, clutter) Adverse weather (e.g., snow, fog, rain) Anticipated road conditions (e.g., flooding, obstacles) Scalability is an issue – big scenario cross-product

How long to drive around to collect scenario elements? Depends on distribution of how often they appear

How About Designing Tests Instead?

Extreme contrast

Poor visibility

Road obstacles Construction Water (appears flat!)


Vehicle safety limited by novel black swan arrivals Assume 100K miles/novel hazard seen in road testing:

– Case 1: 100 novel hazards @ 10M miles/hazard for each type– Case 2: 100,000 novel hazards @ 10B miles/hazard for each type

» Note: US fleet averages about 8B+ miles/day

Distribution of Scenario Elements Matters

Random Independent Arrival Rate (exponential)

Power Law Arrival Rate (80/20 rule)

Cross-Over at 4000 hrs

Random Independent Arrival Rate (exponential)

Power Law Arrival Rate (80/20 rule) Many , Infrequent ScenariosTotal Area is the same!

Different

Cross-Over at 4000 hrs

You might not see some everyday

hazards even after a billion miles of on-road testing

http://piximus.net/fun/funny-and-odd-things-spotted-on-the-road


Machine Learning learns by example There is no design to trace to testing Training data are de facto requirements

We don’t know if design is correct ML behavior has an inscrutable “design”

– This is one facet of the “legibility” problem– How do you trace unknown design to tests?

“Black swans” depend on what it has learned– (We’ll come back to this shortly)

Possible Approach: Trace tests back to a safety argument Left side of V just represents safety functions/reqts.

– E.g.: “doesn’t hit people” rather than analyzing ML weights Use non-ML software to enforce those safety requirements

How Do We Map To the V Model?

?


Strategy: non-ML safety checker Enforces safety; can be ISO 26262

Safety Envelope: Specify unsafe regions for safety Specify safe regions for functionality

– Deal with complex boundary via:» Under-approximate safe region» Over-approximate unsafe region

Transition triggers system safety response

Partition the requirements: Operation: functional requirements Failsafe: safety requirements (safety functions)

Safety Envelopes to Mitigate ML Risks

UNSAFE!


How did we make this safe? Fail-safe safety gate via high-ASIL checker

Untrusted Machine Learning is “doer” Checker is a run-time monitor

Enforces the safety envelope Metric Temporal Logic safety invariants Works best for control functions

Fail Operational Doer/Checker for HAV: 2CASA dual channel architecture Primary pair for normal autonomy Secondary pair for safing mission

Practical Application: Runtime MonitoringDoer/Checker Pair

Low SIL

High SILSimpleSafetyEnvelopeChecker

ML

TARGET GVW: 8,500 kg TARGET SPEED: 80 km/hr

Approved for Public Release.TACOM Case #20247 Date: 07 OCT 2009


Legibility problem: Did the system do the right thing for the right reason? How can you tell what the system will do next? “Black swans” are in context of machine’s learning

– If you don’t know what it learned, what is “novel”?

Machine learning is brittle and inscrutable Proving results of inductive learning is tough Surprises in what was learned brittleness

Problem: will system tolerate noise? “Noise” is likely to affect ML systems differently than humans

Problem: did it work for the correct reason? If you don’t know why system acted, was it a “test pass” or did it get lucky? Statistically valid testing sets will increase number of tests, but still leave doubt

What About ML Legibility?

Bus Not aBus

MagnifiedDifference

(Szegedy, et al. 2013)

Learned“Dumbbell”(Mordvintsev et al., 2015)


Control software robustness testing 1990s: operating systems (Ballista) 2010: HAVs and robots (ASTAA) Switchboard for SIL & HIL tests

Machine learning robustness testing Sensor Robustness (RIOT project)

Robustness Stress Testing (“Noise”)

SyntheticEnvironmentRobustnessTesting

SyntheticEquipmentFaults

Gaussian blur

DISTRIBUTION A – NREC case number STAA-2013-10-02 DISTRIBUTION A – NREC RIOT


Passing a test once might be by luck Difficult to infer causation given only correlation

Make the ML system explain why Pre-define scenario element “bins” (OEDR/ODD related)

– Bins are scenario elements that are present– Trace bins to test scenario design

ML says which bin(s) it thinks are in play– Did it understand which scenario it was in? Or get lucky?

Must pass the test for the right reason ML forced to learn bins for scenario elements

– Separate ML goes from scenario elements to behavior Residual risks:

– ML lies about what it sees (but we can catch it in act)– ML black box pair learns covert communication channel

Design Machine Learning for Validation


Initial simulation set-up Use higher realism levels to

validate simulation accuracy

Simulation used for AV validation: Push down to low fidelity simulations

for brute force coverage Identify residual risks at each level

– Relevant simplifications– Simulation assumptions

Higher fidelity tests assumptions

Why are you simulating/testing? On-road: to discover requirement gaps Mid-level: to mitigate risks of believing simulations Reduced order models/low level: to get coverage

Simulation As Risk Reduction


Traditional safety has its place Traditional functionality should be ISO 26262 Safety envelopes for ML control functions

ML perception simulation and testing Use the simulation “hammer” effectively Simulation/test for SOTIF safety risk reduction Robustness testing helps understand maturity

Strategies for key risk areas: Machine learning: require ML to explain its actions (e.g., OEDR bins, ODD scenarios) Operational concepts: detect ODD violations during test & deployment Requirements: identify gaps in safety requirements; continuous scenario improvement Safety methodology: test assumptions made in a safety argument Societal & technical collaboration: industry consensus on understanding and continuously

reducing residual risks via monitoring; how safe is safe enough?

Key Principles of Pragmatic HAV Validation

[ECR]


Multi-prong approach required for validation Need rigorous engineering beyond vehicle testing/sims Account for heavy-tail distribution of scenario elements

– How does system recognize it’s in a novel situation?– Will system have safe enough response to novelty?

Unique Machine Learning validation challenges How do we create “requirements” for an ML system? How do we ensure that testing traces to the ML training data? How do we ensure adequate requirements and testing coverage for the real world?

Promising approaches: Safety monitor: let ML optimize behavior while guarding against the unexpected Robustness testing: inject faults into system building blocks to uncover faults In progress: comprehensive safety validation approach

Conclusions

[General Motors]

highly autonomous vehicle validationkoopman/lectures/2017_techad.pdf · 2017-11-10 · highly...

Documents