highly autonomous vehicle validationkoopman/lectures/2017_techad.pdf · 2017-11-10 · highly...
TRANSCRIPT
Highly Autonomous Vehicle Validation:
It’s more than just road testing!
© 2017 Edge Case Research LLC
Prof. Philip Koopman
2© 2017 Edge Case Research LLC
Self-driving cars are so cool! But also kind of scary
Is a billion miles of testing enough? Perhaps not, even if you could afford it Will simulation really solve this?
What exactly are you validating? Requirements vs. implementation validation
Can you map autonomy onto ISO 26262? Why machine learning is the hard part
How Do You Validate Autonomous Vehicles?
3© 2017 Edge Case Research LLC
1985 1990 1995 2000 2005 2010
DARPAGrand
Challenge
DARPALAGR
ARPADemo
II
DARPASC-ALV
NASALunar Rover
NASADante II
AutoExcavator
AutoHarvesting
Auto Forklift
Mars Rovers
Urban Challenge
DARPAPerceptOR
DARPAUPI
Auto Haulage
Auto Spraying
Laser Paint RemovalArmy
FCS
NREC: 30+ Years Of Cool Robots
Robot &AV Safety
AHS Safety
4© 2017 Edge Case Research LLC
Best case testing scenario: For example, 134M miles/mishap (NYC Taxi Data) Assumptions:
– Random independent mishap arrivals in testing– 95% confidence of >=134M miles/mishap
401.4M miles of testing if no mishaps– More likely 1B+ miles if “just as good as” humans
Significant practical issues: Unlikely that software fails randomly Reset testing meter if software changes Reset testing meter if environment changes
– Hard to know what changes are “small” vs. “big”
A Billion Miles of On-Road Testing?
# Mishapsin Testing
Total Testing Miles for 95% Confidence
0 401M1 636M2 844M3 1.039B
[Fatal and Critical Injury data/ NYC Local Law 31 of 2014]
5© 2017 Edge Case Research LLC
Validate implementation: does vehicle behave correctly? Traffic scenarios (e.g., aggressive human driver behavior) Sensor limitations (e.g., contrast, glare, clutter) Adverse weather (e.g., snow, fog, rain) Anticipated road conditions (e.g., flooding, obstacles) Scalability is an issue – big scenario cross-product
How long to drive around to collect scenario elements? Depends on distribution of how often they appear
How About Designing Tests Instead?
Extreme contrast
Poor visibility
Road obstacles Construction Water (appears flat!)
6© 2017 Edge Case Research LLC
Vehicle safety limited by novel black swan arrivals Assume 100K miles/novel hazard seen in road testing:
– Case 1: 100 novel hazards @ 10M miles/hazard for each type– Case 2: 100,000 novel hazards @ 10B miles/hazard for each type
» Note: US fleet averages about 8B+ miles/day
Distribution of Scenario Elements Matters
Random Independent Arrival Rate (exponential)
Power Law Arrival Rate (80/20 rule)
Cross-Over at 4000 hrs
Random Independent Arrival Rate (exponential)
Power Law Arrival Rate (80/20 rule) Many , Infrequent ScenariosTotal Area is the same!
Different
Cross-Over at 4000 hrs
You might not see some everyday
hazards even after a billion miles of on-road testing
http://piximus.net/fun/funny-and-odd-things-spotted-on-the-road
7© 2017 Edge Case Research LLC
Machine Learning learns by example There is no design to trace to testing Training data are de facto requirements
We don’t know if design is correct ML behavior has an inscrutable “design”
– This is one facet of the “legibility” problem– How do you trace unknown design to tests?
“Black swans” depend on what it has learned– (We’ll come back to this shortly)
Possible Approach: Trace tests back to a safety argument Left side of V just represents safety functions/reqts.
– E.g.: “doesn’t hit people” rather than analyzing ML weights Use non-ML software to enforce those safety requirements
How Do We Map To the V Model?
?
8© 2017 Edge Case Research LLC
Strategy: non-ML safety checker Enforces safety; can be ISO 26262
Safety Envelope: Specify unsafe regions for safety Specify safe regions for functionality
– Deal with complex boundary via:» Under-approximate safe region» Over-approximate unsafe region
Transition triggers system safety response
Partition the requirements: Operation: functional requirements Failsafe: safety requirements (safety functions)
Safety Envelopes to Mitigate ML Risks
UNSAFE!
9© 2017 Edge Case Research LLC
How did we make this safe? Fail-safe safety gate via high-ASIL checker
Untrusted Machine Learning is “doer” Checker is a run-time monitor
Enforces the safety envelope Metric Temporal Logic safety invariants Works best for control functions
Fail Operational Doer/Checker for HAV: 2CASA dual channel architecture Primary pair for normal autonomy Secondary pair for safing mission
Practical Application: Runtime MonitoringDoer/Checker Pair
Low SIL
High SILSimpleSafetyEnvelopeChecker
ML
TARGET GVW: 8,500 kg TARGET SPEED: 80 km/hr
Approved for Public Release.TACOM Case #20247 Date: 07 OCT 2009
10© 2017 Edge Case Research LLC
Legibility problem: Did the system do the right thing for the right reason? How can you tell what the system will do next? “Black swans” are in context of machine’s learning
– If you don’t know what it learned, what is “novel”?
Machine learning is brittle and inscrutable Proving results of inductive learning is tough Surprises in what was learned brittleness
Problem: will system tolerate noise? “Noise” is likely to affect ML systems differently than humans
Problem: did it work for the correct reason? If you don’t know why system acted, was it a “test pass” or did it get lucky? Statistically valid testing sets will increase number of tests, but still leave doubt
What About ML Legibility?
Bus Not aBus
MagnifiedDifference
(Szegedy, et al. 2013)
Learned“Dumbbell”(Mordvintsev et al., 2015)
11© 2017 Edge Case Research LLC
Control software robustness testing 1990s: operating systems (Ballista) 2010: HAVs and robots (ASTAA) Switchboard for SIL & HIL tests
Machine learning robustness testing Sensor Robustness (RIOT project)
Robustness Stress Testing (“Noise”)
SyntheticEnvironmentRobustnessTesting
SyntheticEquipmentFaults
Gaussian blur
DISTRIBUTION A – NREC case number STAA-2013-10-02 DISTRIBUTION A – NREC RIOT
12© 2017 Edge Case Research LLC
Passing a test once might be by luck Difficult to infer causation given only correlation
Make the ML system explain why Pre-define scenario element “bins” (OEDR/ODD related)
– Bins are scenario elements that are present– Trace bins to test scenario design
ML says which bin(s) it thinks are in play– Did it understand which scenario it was in? Or get lucky?
Must pass the test for the right reason ML forced to learn bins for scenario elements
– Separate ML goes from scenario elements to behavior Residual risks:
– ML lies about what it sees (but we can catch it in act)– ML black box pair learns covert communication channel
Design Machine Learning for Validation
13© 2017 Edge Case Research LLC
Initial simulation set-up Use higher realism levels to
validate simulation accuracy
Simulation used for AV validation: Push down to low fidelity simulations
for brute force coverage Identify residual risks at each level
– Relevant simplifications– Simulation assumptions
Higher fidelity tests assumptions
Why are you simulating/testing? On-road: to discover requirement gaps Mid-level: to mitigate risks of believing simulations Reduced order models/low level: to get coverage
Simulation As Risk Reduction
14© 2017 Edge Case Research LLC
Traditional safety has its place Traditional functionality should be ISO 26262 Safety envelopes for ML control functions
ML perception simulation and testing Use the simulation “hammer” effectively Simulation/test for SOTIF safety risk reduction Robustness testing helps understand maturity
Strategies for key risk areas: Machine learning: require ML to explain its actions (e.g., OEDR bins, ODD scenarios) Operational concepts: detect ODD violations during test & deployment Requirements: identify gaps in safety requirements; continuous scenario improvement Safety methodology: test assumptions made in a safety argument Societal & technical collaboration: industry consensus on understanding and continuously
reducing residual risks via monitoring; how safe is safe enough?
Key Principles of Pragmatic HAV Validation
[ECR]
15© 2017 Edge Case Research LLC
Multi-prong approach required for validation Need rigorous engineering beyond vehicle testing/sims Account for heavy-tail distribution of scenario elements
– How does system recognize it’s in a novel situation?– Will system have safe enough response to novelty?
Unique Machine Learning validation challenges How do we create “requirements” for an ML system? How do we ensure that testing traces to the ML training data? How do we ensure adequate requirements and testing coverage for the real world?
Promising approaches: Safety monitor: let ML optimize behavior while guarding against the unexpected Robustness testing: inject faults into system building blocks to uncover faults In progress: comprehensive safety validation approach
Conclusions
[General Motors]