.lusoftware verification & validationVVSFunctional Safety in ML-based
Cyber-Physical Systems
Practical and Scalable Verification
Lionel Briand
SEMLA, 2019, Montreal
Research Team• Researchers: Donghwan Shin, Fabrizio Pastore, Shiva Nejati,
Annibale Panichella, Raja Ben Abdessalem
• PhD students: Fitash Ul Haq
2
Raja Ben Abdessalem
Shiva Nejati AnnibalePanichella
DonghwanShin
FabrizioPastore
FitashUl Haq
Cyber-Physical Systems
• Increasingly autonomous
• ML-based, e.g., DNN in perception layer
• Not just recommendations, act on the physical environment
• Safety implications
3
• A system of collaborating computational elements controlling physical entities
Autonomous Driving
4
Automated Emergency Braking (AEB)
Pedestrian Protection (PP)
Lane Departure Warning (LDW)
Traffic Sign Recognition (TSR)
Advanced Driver Assistance Systems (ADAS)
Decisions are made over time based on sensor data
5
Sensors
Controller
Actuator Decision
Sensors/Camera
Environment
ADAS
Motivation and Goal
• Deep Neural Networks (DNNs) are increasingly used to capture anddrive (part of) the behavior of ADAS.
• Autonomous driving: Control steering angles based on images of the field of view of the car.
6
• Certification: Scalable and effective ways to systematically verify the safety of these DNN-based systems.
• Note: In software engineering, we are concerned with DNN-based systems, not individual DNN components.
CPS Development Process
7
Functional modeling: • Controllers• Plant• Decision
Continuous and discrete Simulink models
Model simulation and testing
Architecture modelling• Structure• Behavior• Traceability
System engineering modeling (SysML)
Analysis: • Model execution and
testing• Model-based testing• Traceability and
change impact analysis
• ...
(partial) Code generation
Deployed executables on target platform
Hardware (Sensors ...) Analog simulators
Testing (expensive)
Hardware-in-the-Loop Stage
Software-in-the-Loop StageModel-in-the-Loop Stage
Formal Verification• Formal verification: Exhaustive,
guarantees (deterministic or statistical)
• Models of system, environment, and properties in formalisms for which there are efficient decision procedures [12].
• Challenges: Undecidability, assumptions, scalability [13].
• Not even a remote possibility for DNN components in the foreseeable future
8
Seshia et al., Towards Verified AI [11]
Testing• We focus on falsification (testing) and explanation.
• Automated testing with a focus on safety violations
• Explanation of safety violations è Risk assessment
• Testing takes place at different phases of development in CPS: MiL, SiL, HiL.
• We don’t test DNN components, but DNN-based systems including DNN components.
• Testing does not prove satisfiability: It finds safety violations or provide assurance cases (evidence)
9
Standards
10
ISO 26262
• Defines the vehicle safety as the absence of unreasonable risks that arise from malfunctions of the system.
• Requires safety goals, necessary to mitigate the risks.
• Provides requirements and recommendations to avoid and control random hardware failures and systematic system failures that could violate safety goals.
11
Testing in ISO 26262• Several recommendations for testing at the unit and system levels
• e.g., Different structural coverage metrics, black-box testing
• However, such testing practices are not adequate for DNN-based ADAS
• The input space of ADAS is much larger than traditional automotive systems.
• No (complete) system specifications, especially for DNN components.
• ADAS may fail without the presence of a systematic fault, e.g., inherent limitations, incomplete training.
• Imperfect environment simulators.
• Traditional testing notions (e.g., coverage) are not clear for DNN components.
12
DNN Coverage• Assess testing “completeness” and guide test generation
• Neuron coverage: counts activated neurons over total neurons [1]
• Variants of neuron coverage: count activated neurons under certain conditions, e.g., k-multisection neuron coverage [5] and t-way combination sparse coverage [8]
• Surprise adequacy: calculate diversity of test inputs using continuous neuron activation values [6]
• Not focused on effectively finding and explaining system safety violations.
13
SOTIF• ISO/PAS standard: Safety of the intended functionality (SOTIF) [9].
• Autonomy: Huge increase in functionalities relying on advanced sensing, algorithms (ML), and actuation.
• SOTIF accounts for limitations and risks related to nominal performance of sensors and software:
• The inability of the function to correctly comprehend the situation and operate safely; this also includes functions that use machine learning algorithms;
• Insufficient robustness of the function with respect to sensor input variations or diverse environmental conditions.
14
SOTIF: Scenes and Scenarios
15
Domain Conceptual Model (Terminology) [9]
Temporal view of scenes, events, actions and situations in a scenario
16
Verification in SOTIF
• Decision algorithm verification: “… ability of the decision-algorithm to react when required and its ability to avoid unwanted action.”
• Integrated system verification: “Methods to verify the robustness and the controllability of the system integrated into the vehicle …”
17
Decision Algorithm Verification
18
© ISO 2017 – All rights reserved 33
Table 6: Decision Algorithm verification 779
Methods
A Verification of robustness to interference from other sources, e.g. white noise, audio frequencies, Signal-to-Noise Ratio degradation (e.g. by noise injection testing)
B Requirement-based test (e.g. classification, sensor data fusion, situation analysis, function)
C Verification of the architectural properties including independence, if applicable
D In the loop testing (e.g. SIL / HIL / MIL) on selected SOTIF relevant use cases and scenarios
E Vehicle level testing on selected SOTIF relevant use cases and scenarios
F Inject inputs into the system that trigger potentially hazardous behaviour
NOTE For test case derivation the method of combinatorial testing can be used [5]
780
10.4 Actuation verification 781
Methods to verify the actuators for their intended use and for their reasonably foreseeable misuse in 782 the decision algorithm can be applied as illustrated by Table 7. 783
Table 7: Actuation verification 784
785
786
10.5 Integrated system verification 787
Methods to verify the robustness and the controllability of the system integrated into the vehicle can be 788 applied as illustrated by Fehler! Verweisquelle konnte nicht gefunden werden.8. 789
Methods
A Requirements-based test (e.g. precision, resolution, timing constraints, bandwidth)
B Verification of actuator characteristics, when integrated within the vehicle environment
C Actuator test under different environmental conditions (e.g. cold conditions, damp conditions)
D Actuator test between different preload conditions (e.g. change from medium to maximum load)
E Verification of actuator ageing effects (e.g. accelerated life testing)
F In the loop testing (e.g. SIL / HIL / MIL) on selected SOTIF relevant use cases and scenarios
G Vehicle level testing on selected SOTIF relevant use cases and scenarios
Integrated System Verification
19
34 © ISO 2018 – All rights reserved
Table 8: Integrated system verification 790
Methods
A Verification of robustness to Signal-to-Noise Ratio degradation (e.g. by noise injection testing)
B Requirement-based Test when integrated within the vehicle environment (e.g. range, precision, resolution, timing constraints, bandwidth)
C In the loop testing (e.g. SIL / HIL / MIL) on selected SOTIF relevant use cases and scenarios
D System test under different environmental conditions (e.g. cold, damp, light, visibility conditions)
E Verification of system ageing affects. (e.g. accelerated life testing)
F Randomized input tests a)
G Vehicle level testing on selected SOTIF relevant use cases and scenarios
H Controllability tests (including reasonably foreseeable misuse)
a) Randomized input tests can include erroneous patterns e.g. in the case of image sensors adding flipped images or altered image patches; or in the case of radar sensors adding ghost targets to simulate multi-path returns.
Annex D provides examples for the verification of perception systems. 791
11 Validation of the SOTIF (Area 3) 792
11.1 Objectives 793
The functions of the system and the components (sensors, decision-algorithms and actuators) shall be 794 validated to show that they do not cause an unreasonable level of risk in real-life use cases (see Area 3 795 of Figure 9). This requires evidence that the validation targets are met. 796
To support the achievement of this objective the following information can be considered: 797
- Validation strategy, as defined in Clause 9; 798
- Verification results in defined use cases, as defined in Clause 10; 799
- Functional concept, including sensors, actuators and decision-algorithm specification; 800
- System design specification; 801
- Validation targets, as defined in Clause 6; 802
- Vehicle design (e.g. sensor mounting position); and 803
- Analysis of triggering events results as described in Clause 7.2. 804
11.2 Evaluation of residual risk 805
Methods to evaluate the residual risk arising from real-life situations, that could trigger a hazardous 806 behaviour of the system when integrated in the vehicle, can be applied as illustrated by Fehler! 807 Verweisquelle konnte nicht gefunden werden.9. 808
How to perform such testing efficiently and effectively?
20
Research is needed
21
Existing Work
• DNN verification can focus on different kinds of errors
• Type 1: Adversarial examples
• Type 2: Failure from single-image test inputs
• Type 3: Failure from self-driving scenario test inputs
• Type 4: Unexpected situations
22
Type 1: Adversarial Examples
• Szegedy et al. [7] first indicated an intriguing weakness of DNNs in the context of image classification
• “Applying an imperceptible perturbation to a test image is possible to arbitrarily change the DNN’s prediction”
23
Adversarial example due to noise [4]
Type 1: Adversarial Examples
• Recently, both digital and physical adversarial examples have been studied without being limited to “imperceptible” perturbations; these can be interpreted as attacks.
24
Adversarial billboard example (physical attack) [3]
Type 2: Single-Image Test Inputs
• Several works have proposed in the context of ADAS, where the test inputs are generated by applying label-preserving changes to existing already-labeled data [1, 2].
25
Original image Test image(generated by adding fog)
Type 2: Single-Image Test Inputs
• While those works focus more on safety by considering real-world changes, the analysis is still limited to label-preserving changes on a single image.
26
Snowy and rainy scenes synthesized by Generative Adversarial Network (GAN) [2]
State of the Art: Summary• Most of existing research focus on
• Types 1 & 2 errors
• Label-preserving changes to images
• Limited in terms of searching for safety violations
• Further, existing research does not account for the impact of object dynamics (e.g., car speed) in different scenarios (e.g., specific configurations of roads)
• Not compliant with SOTIF requirements: In-the-loop testing of “relevant” scenarios in different environmental conditions
27
Type 3: Scenario Test Inputs• SOTIF scenarios: Sequences of scenes (e.g., images) that
cause some ADAS safety requirements to be violated.
• Account for the impact of car dynamics (e.g., speed) on different scenarios (e.g., specific configurations of roads)
28
20 km/h 40 km/h
Scenario 1 -> No misclassification Scenario 2 -> Misclassification!
… …xx
Type 4: Unexpected Situations
• Unusual and rare scenarios may not be included in the training set used for DNNs.
• For example, hardware failures, such as a broken camera sensor, may cause an unexpected situations.
• It is important to assess the robustness of DNN-based systems to plausible but rare situations.
29
How to interpret such results for risk analysis?
30
Explanations for Misclassifications• Based on visual heatmaps: use colors to capture the extent to which
different features contribute to the misclassification of the input.
• State-of-the-art
• black-box techniques: perturbations of input image [14, 15]
• white-box techniques:
• backward propagation of prediction score [16, 17]
• smallest image portion correctly classified [18]
31
misclassified ascow
What we really need
• Software engineering needs are different
• Associate scenario (SOTIF) characteristics with probability of safety violations
• Assess risk (probability * loss)
• Generate more training examples in high-risk areas
• Reduce and increasingly better characterize risk
So what do we do?
33
Motivation and Goal• How to gain confidence in the safety of these DNN-based
systems? (risks analysis based on assurance cases)
• Goal: Scalable and practical techniques to help verify DNN-based systems and demonstrate their safety, according to standards:
• Effectively finding scenarios violating safety requirements.
• Explaining violations: Characterizing conditions leading to them.
• Assess risks.
34
Example
35
Testing via Physics-based Simulation (MiL, SiL)
36
ADAS(SUT)
Simulator (Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)▪ Other cars▪ Pedestrians▪ Environment (weather / roads / traffic signs)
Test input
Test output
Time-stamped output
Domain Model
37
- intensity: RealSceneLight
DynamicObject
1- weatherType: Condition
Weather
- fog- rain- snow- normal
«enumeration»Condition
- field of view: Real
Camera Sensor
RoadSide Object
- roadType: RTRoad
1 - curved- straight- ramped
«enumeration»RT
- v0: RealVehicle
- x0: Real
- y0: Real
- θ: Real- v0: Real
Pedestrian
- x: Real- y: Real
Position
1
*
1
*
11
- state: BooleanCollision
Parked Cars
Trees- simulationTime: Real- timeStep: Real
Test Scenario
AEB
- certainty: RealDetection
11
11
11
11
«positioned»
«uses»1 1
- AWA
Output Trajectory
Environment inputsMobile object inputsOutputs
ADAS Testing Challenges
• Test input space is multidimensional, large, and complex
• Explaining failures and fault localization are difficult
• Execution of physics-based simulation models is computationally expensive
38
Search Guided by Classification
41
Test input generation (NSGA II)
Evaluating test inputs
Build a classification treeSelect/generate tests in the fittest regionsApply genetic operators
Input data ranges/dependencies + Simulator + Fitness functions
(candidate) test inputs
Simulate every (candidate) testCompute fitness functions
Fitnessvalues
Test cases revealing worst case system behaviors + A characterization of critical input regions [10]
Decision Trees
42
Partition the scenario input space into homogeneous regions
All points Count 1200
“non-critical” 79%“critical” 21%
“non-critical” 59%“critical” 41%
Count 564 Count 636“non-critical” 98%“critical” 2%
Count 412“non-critical” 49%“critical” 51%
Count 152“non-critical” 84%“critical” 16%
Count 230 Count 182
vp0 >= 7.2km/h vp
0 < 7.2km/h
✓p0 < 218.6� ✓p0 >= 218.6�
RoadTopology(CR = 5,Straight,RH = [4� 12](m))
RoadTopology(CR = [10� 40](m))
“non-critical” 31%“critical” 69%
“non-critical” 72%“critical” 28%
high risk area
Genetic Evolution Guided by Classification
43
Initial input
Fitness computation
Classification
Selection
Breeding
Better Guidance
• Fitness computations rely on simulations and are very expensive
• Classification trees help provide better guidance
44
Explaining Risk
46
All points Count 1200
“non-critical” 79%“critical” 21%
“non-critical” 59%“critical” 41%
Count 564 Count 636“non-critical” 98%“critical” 2%
Count 412“non-critical” 49%“critical” 51%
Count 152“non-critical” 84%“critical” 16%
Count 230 Count 182
vp0 >= 7.2km/h vp
0 < 7.2km/h
✓p0 < 218.6� ✓p0 >= 218.6�
RoadTopology(CR = 5,Straight,RH = [4� 12](m))
RoadTopology(CR = [10� 40](m))
“non-critical” 31%“critical” 69%
“non-critical” 72%“critical” 28%
Path condition characterizes high risk areas
Generated Decision Trees
47
Goo
dnes
sOfFit
Reg
ionS
ize
1 5 642 30.40
0.50
0.60
0.70
tree generations
(b) 0.80
71 5 642 30.00
0.05
0.10
0.15
tree generations
(a) 0.20
7
Goo
dnes
sOfFit-crt
1 5 642 3
0.30
0.50
0.70
tree generations
(c) 0.90
7
The generated critical regions consistently become smaller, more homogeneous and more precise over successive tree generations
Matching our Needs
• Find safety violations in the entire scenario space.
• Associate scenario characteristics with probability of safety violations: Path conditions.
• Assess risk (probability * loss): Leaf probability estimates.
• Automatically generate more training examples in high risk areas: Iterations generating more examples in targeted leaves and re-training the decision tree.
• Increasingly better characterize risk: Iteratively refining decision trees.
49
But many ADAS concurrently run on a car?
50
51
…
…
Actuator Commands:- Steering- Acceleration- BrakingConflicts
Sensor/ Camera
Data
Autonomous Feature
Actuator Command
Sensor/ Camera
Data
Autonomous Feature
Actuator Command
Sensor/ Camera
Data
Autonomous Feature
Actuator Command
Feature Interactions
Sensor/ Camera
Data
Autonomous Feature
Actuator Command
Integration Component
52
……
Integration Component
Actuator Command
Sensor/ Camera
Data
Autonomous Feature
Actuator Command
Sensor/ Camera
Data
Autonomous Feature
Actuator Command
Goals
• Automatically test for undesirable feature interactions [11]
• Automatically repair the integration component
53
Other CPS Domains
• Similar ISO standards
• With the rise of DNNs, they will develop SOTIF-like standards
• Most CPS are safety critical
• Most CPS follow similar development phases
• Different domain models, e.g., notion of scenes and scenarios
• Different, dedicated simulators
54
Conclusions
• Research must be driven by SE needs, e.g., safety certification needs –different from DNN research.
• Assess safety risks, develop assurance cases.
• Research cannot only focus on verifying DNN-components, but must address DNN-based systems.
• Research cannot only focus on single system features, as DNN-based systems include DNN-components interacting through shared sensors and actuators.
• Scalability of verification solutions is paramount.55
References[1] Tian, Yuchi, et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." Proceedings of the 40th international conference on software engineering. ACM, 2018.
[2] Zhang, Mengshi, et al. "DeepRoad: Gan-based metamorphic autonomous driving system testing." arXiv preprint arXiv:1802.02295 (2018).
[3] Zhou, Husheng, et al. "DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems." arXivpreprint arXiv:1812.10812 (2018).
[4] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
[5] Ma, Lei, et al. "DeepGauge: multi-granularity testing criteria for deep learning systems." Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 2018.
[6] Kim, Jinhan, Robert Feldt, and Shin Yoo. "Guiding Deep Learning System Testing using Surprise Adequacy." arXivpreprint arXiv:1808.08444 (2018).
[7] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).
[8] Ma, Lei, et al. "DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems." 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2019.
57
References[9] SOTIF, https://www.iso.org/standard/70939.html
[10] Ben Abdessalem et al., "Testing Vision-Based Control Systems Using Learnable Evolutionary Algorithms”, ICSE 2018
[11] Ben Abdessalem et al., "Testing Autonomous Cars for Feature Interaction Failures using Many-Objective Search”, ASE 2018
[12] Seshia et al., “Towards Verified Artificial Intelligence”, arXiv:1905.03490.
[13] Nejati et al., “Evaluating Model Testing and Model Checking for Finding Requirements Violations in Simulink Models”, arXiv:1905.03490,2019.
[14] Dabkowski et al., “Real Time Image Saliency for Black Box Classifiers”, arXiv:1705.07857, 2017.
[15] Petsiuk et al., “RISE: Randomized Input Sampling for Explanation of Black-box Models. 29th British Machine Vision Conference”, https://arxiv.org/abs/1806.07421, 2018.
[16] Selvaraju et al., “Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In IEEE International Conference on Computer Vision”, 2017.
[17] Montavon et al., “Methods for interpreting and understanding deep neural networks”, Digital Signal Processing (2017).
[18] Du et al., “Towards Explanation of DNN-based Prediction with Guided Feature Inversion”, proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, 2018.
58
.lusoftware verification & validationVVSFunctional Safety in ML-based
Cyber-Physical Systems
Practical and Scalable Verification
Lionel Briand
SEMLA, 2019, Montreal