© 2013 carnegie mellon university analytical architecture fault models software engineering...
TRANSCRIPT
© 2013 Carnegie Mellon University
Analytical Architecture Fault Models
Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213
Peter H. FeilerJan 24, 2013
2
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Copyright 2012 Carnegie Mellon University and IEEE
This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.
NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.
This material has been approved for public release and unlimited distribution.
This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].
DM-0000087
3
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
4
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Mismatched Assumptions in Embedded SWSystem
EngineerControl
Engineer
System
Under Contro
l
Control
System
Physical Plant CharacteristicsLag, proximity
Operator ErrorAutomation & human actions
Sy
ste
m
Us
er/
En
vir
on
me
nt
HazardsImpact of
system failures
Ap
plic
atio
n
De
velo
perComput
ePlatform
RuntimeArchitectur
e
Application
Software
Embedded SW System Engineer
Data Stream Characteristics
ETE Latency (F16)State delta (NASA)
Measurement UnitsAriane 4/5Air Canada
Concurrency Communication
ITunes crashes on dual-cores
Distribution & RedundancyVirtualization of HW
(ARPA-Net split)
Hardware
Engineer
Why do system level failures still occur despite fault tolerance techniques being deployed in systems?
5
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
SAE Architecture Analysis & Design Language (AADL) for Software-reliant Systems
The Computer System
The System
Computer SystemHardware & OS
Physical platformAircraft
ControlGuidance
Deployed onUtilizes
Physical interfacePlatform component
AADL focuses on interaction between the three elements of a software-intensive system based on architectural abstractions of each.
Embedded Application Software
Flight control & MissionThe Software
Focus on software runtime architecture
6
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
System Level Fault Root Causes
Violation of data stream assumptions• Stream miss rates, Mismatched data representation, Latency jitter & age
Partitions as Isolation Regions• Space, time, and bandwidth partitioning• Isolation not guaranteed due to undocumented resource sharing• fault containment, security levels, safety levels, distribution
Virtualization of time & resources• Logical vs. physical redundancy• Time stamping of data & asynchronous systems
Inconsistent System States & Interactions• Modal systems with modal components• Concurrency & redundancy management• Application level interaction protocols
Performance impedance mismatches• Processor, memory & network resources• Compositional & replacement performance mismatches• Unmanaged computer system resources
Fault propagation Security analysis
Architectural redundancy patterns
End-to-end latency analysisPort connection consistency
Partitioned architecture modelsModel compliance
Resource budget analysis & task roll-up analysisResource allocation &
deployment configurations
Virtual processors & busesSynchronization domains
7
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Architecture-Centric Modeling Approach
Security• Intrusion
• Integrity
• Confidentiality
Safety & Reliability
• MTBF
• FMEA
• Hazard analysis
Real-timePerformance• Execution time/
Deadline
• Deadlock/starvation
• Latency
ResourceConsumption• Bandwidth
• CPU time
• Power consumption
• Data precision/accuracy
• Temporal correctness
• Confidence
Data Quality
Architecture Model
Single Annotated Architecture Model Addresses Impact Across Non-Functional Properties
Auto-generated analytical models
8
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
AADL Error Model Scope and Purpose
System safety process uses many individual methods and analyses, e.g.•hazard analysis• failure modes and effects analysis• fault trees•Markov processes
Related analyses are also useful for other purposes, e.g.•maintainability•availability• Integrity
Goal: a general facility for modeling fault/error/failure behaviors that can be used for several modeling and analysis activities.
SAE ARP 4761 Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment
Annotated architecture model permits checking for consistency and completeness between these various
declarations.
System
Component
Subsystem
Capture FMEA model
Capture hazards
Capture risk mitigation architecture
9
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Error Propagation Paths
Component A Component B
Processor 1 Processor 2Bus
Error_Free Failed Error_Free Failed
Error_Free Failed Error_Free FailedError_Free Failed
Error Model and the Architecture
Error eventError propagation Color: Different types of error
Error propagation pathError flow
Propagation of errors of different types from error sources along propagation paths between architecture components.Error flows as abstractions of propagation through components.Component error behavior as transitions between states triggered by error events and incoming propagations.Composite error behavior in terms of component error behavior states.
10
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Error Propagation PathsAn Error Propagation View
Error eventError propagation Color: Different types of error
Error propagation pathError flow
Components can have incoming and outgoing error propagationsError propagations follow propagation paths determined by connections and bindings in the architecture or undocumented observable paths.Error flows specify flow of errors through a component, or the component acting as propagation source or sink.
Component AComponent B Error Propagation View
11
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Fault Propagation & Transformation Calculus
Default mapping• Any to all
Single in port to multiple out ports• late → (value, *, late)
Multiple in ports• (late, _) → (value, late) : wildcard• (late, f) → (f, late) : variable
Overlapping rules• (*, late) → (*, *)• (*, f) → (*, f)
Conditional mappings (considered too complex)• (f, g) → late, if f = late• *, if f = value and g = value• value, if f = g = value
Build on York U. work
Leverage architecture model
Overcome short comings
to address practical problems
Fault mapping rules
12
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Component Error Propagation
Component CNoData
NoDataBadValue
ProcessorMemory
Bus NoResource
Error Flow:
Path P1.NoData->P3.NoData
Source P2.BadData;
Path processor.NoResource -> P2.NoData
P3
P1
LateData
ValueError
Incoming/Assumed• Propagated errors
• Errors not propagated
Outgoing/Intention• Propagated errors
• Errors not propagated
Bound resources• Propagated errors
• Errors not propagated
• Propagation to resource
Incoming
Outgoing
Binding
NoData
P2BadValue
“Not“ indicates that this error type is not intended to be propagated.
This allows us to determine whether propagation specification is complete.
Propagation of Error Types
Not propagated
Propagated Error Type
Error Flow through component
Path P1.NoData->P2.NoData
Source P2.BadData
Path processor.NoResource -> P2.NoData
P1 Port
ProcessorHW Binding
Legend
Direction
Error propagation specification without error behavior state machine is possible.
13
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Consistency in Error Propagation
Component ANoData
NoDataBadData
ProcessorMemory
Bus NoResource
P2P1
LateDataBadData
Component BNoData
ProcessorMemory
Bus NoResource
P2P1
BadData
Mismatched fault propagation and masking assumptions
Robustness to unexpected fault propagation
Component A
Outgoing Propagation
P2
Not propagated
Component BP1
Not propagated
Propagated
Propagated
Not propagated
Propagated
Not propagated
PropagatedUnspecified
UnspecifiedPropagatedUnspecified
Unspecified
Unspecified
IncomingPropagation
14
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Error Propagation PathsComponent Error Behavior Specification
Error eventError propagation Color: Different types of error
Propagation pathError flow
Components can have error behavior specified by an error behavior state machineTransitions between states triggered by error events and incoming propagations.Conditions for outgoing propagations are specified in terms of the current state and incoming propagations.Detection of error states and incoming propagations is mapped into a reporting port in the system architecture model
Detection report Binding
Port/access point
Component A
Operational Failed
Detection
Recover/repair event
15
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Compositional Abstraction of Fault Model
• Abstracted error behavior of FGS and subcomponents• Error behavior and propagation specification
• Composite error behavior specification of FGS• State in terms of subcomponent states
[1 ormore(FG1.Failed or AP1.Failed) and
1 ormore(FG2.Failed or AP2.Failed) or AC.Failed]->Failed
FG1
FG2
AP1
AP2
AC
FGSOperational Failed
Operational FailedOperational Failed
Operational Failed
Operational Failed
NoValue
NoValueNoValue
NoValue
NoValue
FG Operational Failed
NoValue
Failure
16
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Error Types & Error Type SetsError type declarationsTimingError: type ;EarlyValue: type extends TimingError;LateLate: type extends TimingError;ValueError: type ;BadValue: type extends ValueError;
An error type set represents a set of type tokens •Elements in a type set are mutually exclusive•An error type with subtypes includes tokens of any subtype•A type product represents a simultaneously occurring types
–Combinations of subtypesInputOutputError : type set {TimingError, ValueError, TimingError+ValueError};
An error type token represents a typed error instance •Represents actual event, propagation, or state types{LateValue + BadValue} or {LateValue}
Error Type Set as Constraint{T1} tokens of one type hierarchy{T1, T2} tokens of one of two error type hierarchies{T1+T2} type product (one error type from each error type hierarchy){NoError} represents the empty setConstraint on state, propagation, flow, transition condition, detection condition, outgoing propagation condition, composite state condition
17
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
A Common Set of Error Propagation TypesIndependent hierarchies of error typesCan be combined to characterize failure modes, resulting error states, and types of error propagation
Use aliases to adapt to domainNoPower = ServiceOmission
Introduce user defined types
18
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Formal Error Type SpecificationsService is a sequence of action items. Service errors with respect to the service as a whole rather than individual service items
• Item Omission is perceived as a transient fault in that a service item is not provided. Service Omission: ai A | ai= where is the empty action.
•Service Omission is perceived as a permanent fault in that no service items are provided after the point of failure. Service Omission: [ai aj] A |S([ai aj]) with S defined as ak [ai aj] | ak= .
Value errors with respect to the value of an individual service item•Out Of Range error indicating that the value is outside the expected range of
values, a detectable error. V represents the range of acceptable values across the service sequence.Out Of Range error: aiA | vi > max(V) OR vi < min(V)
19
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
20
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Hazard Information in AADL Error Model
Hazards modeled as error propagations in the AADL Error Model
21
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Sample FHA in Spreadsheet View
Hazard information exported to spreadsheet format
Architecture Fault Model allows hazards to be recorded for use throughout all development phases
22
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Original System Safety Analysis
Auto Pilot
FMSProcessor
Operational
Failed
Flight Mgnt System
Anticipated: No Stall Propagation
FMS Power
AirspeedData
Failed
ActuatorCmd
Stall
NoService
Anticipated: NoService
Operational
NoData
EGI
EGI HW
EGI Logic
Oper’l
Failed
Oper’l
Failed
Anticipated: No EGI data
NoData
23
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Discovery of Unexpected PSSA Hazard
Auto Pilot
FMSProcessor
Operational
Failed
Flight Mgnt System
FMS Power
AirspeedData
Failed
ActuatorCmd
Stall
NoService
Anticipated: NoService
CorruptedData
Unexpected propagation of corrupted Airspeed data results in Stall due to miss-correction
Operational
NoData
EGI
EGI HW
EGI Logic
Oper’l
Failed
Oper’l
Failed
Corrupted
EGI maintainer adds corrupted data hazard to model. Error Model analysis detects unhandled propagation.
Transient hardware failure corrupts EGI data
Anticipated: No EGI data
Stall caused by corrupted data
24
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Recent Automated FMEA Experience
Failure Modes and Effects Analyses are rigorous and comprehensive reliability and safety design evaluations•Required by industry standards and Government policies•When performed manually are usually done once due to cost and schedule• If automated allows for
–multiple iterations from conceptual to detailed design–Tradeoff studies and evaluation of alternatives–Early identification of potential problems
Largest analysis of satellite to date consists of 26,000 failure modes• Includes detailed model of satellite bus•20 states perform failure mode•Longest failure mode sequences have 25 transitions (i.e., 25 effects)
24
Myron Hecht, Aerospace Corp.Safety Analysis for JPL, member of DO-178C committee
25
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
26
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
27
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Simple Error Behavior State Machine
annex Error_Model {** error behavior Example events -- both events will have mode-specific occurrence values for powered,unpowered SelfCheckedFault: error event; UncoveredFault: error event; SelfRepair: recover event; Fix: repair event; states Operational: initial state ; FailStopped: state; FailTransient: state; FailUnknown: state; transitions SelfFail: Operational -[SelfCheckedFault]-> (FailStopped with 0.7, FailTransient with 0.3); Recover: FailTransient -[SelfRepair]-> Operational; UncoveredFail: Operational -[UncoveredFault]-> FailUnknown;end behavior;**};
Simple state machine with branching transition
28
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Component Error Behavior SpecificationComponent-specific behavior specification• Identifies an error behavior state machine•Specifies transition trigger conditions in terms of incoming propagated errors
or working condition of connected component•Specifies propagation conditions for outgoing propagated errors in terms of
states & incoming propagated errors•Specifies detection events, i.e., conditions under which an event is raised in
the component (core AADL model)component error behavior use types MyErrorLibrary ; use behavior MyErrorLibrary::SWStateBehavior ; transitions Operational-[port1{BadData}]->FailTransient; FailStopped-[port1{BadData}]->Mask; propagations all –[2 ormore (Port1{BadData}, Port2{BadData},Port3{BadData})]-> Outport3{BadData}; detections FailedState –[]-> Self.Failed ; properties Occurrence => 0.0005 Poisson applies to BadDataFault; end behavior;
29
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Transient ErrorsIn error behavior state machine•Probability that error event results in transient error behavior
–Branching transition with branch probability•Error state with transition triggered by recover event
–Recover event has property to indicate time (range) and distribution over time (range)• Distribution over single value vs. range
Operational
FailStop
TransientFailure
Recover Event
Error EventPermenant failure
Probability 0.1
30
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Presence & Absence of Incoming Propagations
Conjunctions on incoming propagations with single type• Inport1{LateValue} vs. Inport1{LateValue} and inport2{EarlyValue} vs.
Inport1{LateValue} and inport2{noerror}•noerror means that no error propagation occurs at a given time on a feature
Conditions on incoming propagation type set tuples• Inport1{LateValue} vs. Inport1{LateValue,BadValue} •Tuple with just a LateValue means that no other error types have occurred
31
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Error Detection and Error Recovery
Error detection and reporting•Example: novalue as detectable persistent fault•Self detection and reporting (internal to component)•Detection of error propagation by recipient component
Error recovery with probability of success and failure•Branch transitions
Operational
BadValueState
BadValueEvent
out propagation BadValue
Outport2
BadValueState –[RecoverEvent]-> ( Operational with 0.99, PersistentFaultState with 0.01);
PersistentFaultState
ErrorOutport
RecoverEventOn BranchTransition
Recover FailureRecover Success
detectionsPersistentFaultState-[]-> ErrorOutPort {NoValue} ;
Component internal detection and reporting
External detection and reporting
Inport1
detectionsall -[Inport1{NoValue}]-> ErrorOutPort {NoValue} ;
ErrorOutport
out propagation NoValue
32
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
33
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Error Model at Each Architecture Level
• Abstracted error behavior of FMS• Error behavior and propagation specification
• Composite error behavior specification of FMS• State in terms of subcomponent states
[1 ormore(FG1.Failed or AP1.Failed) and
1 ormore(FG2.Failed or AP2.Failed) or AC.Failed]->Failed
FG1
FG2
AP1
AP2
AC
FMSOperational Failed
Operational FailedOperational Failed
Operational Failed
Operational Failed
NoValue
NoValueNoValue
NoValue
NoValue
FMSOperational Failed
NoValue
Failure
Consistency Checking Across Levels of the
Hierarchy
Composite error models lead to fault trees and reliability predictions
Fault occurrence probability
Fault occurrence probability
34
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Impact of Deployment Configuration Changes
FMS Failure on 2 or 3 processor configuration (CPU failure rate = 10-5)
FMS Failure Rate 0 5*10-6 5*10-5
MTTF – One CPU operational 112,000 67,000 14,000
MTTF – Two CPU operational 48,000 31,000 7,000
Side effects of design and deployment decisions in reliability predictions
Workload balancing of partitions later in development affects reliability
3 processor configuration can be less reliable than 2 processor configuration
Example: AP and FG distributed across two processors
35
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
SAFEbus with Dual Communication Channel•Network is aware of dual host nature
• Reflected in replication factor properties for connections and bus access • Dual channel communication as abstraction reflected in Replication Errors
•Integrity gate keeper for host system• Maps inconsistent value/omission errors into item omissions• Expressed by error path from incoming to outgoing binding propagation
•Fail-op/fail-stop tolerance of SAFEbus faults• Operational, SingleErrorOp, Failed states: SAFEbus error events cause state change• Operational: gate keeper mappings• SingleErrorOp: no error source & full gate keeper mappings• Failed: service omission as sole outgoing propagation• Modeled by error behavior state machine and component behavior conditions
SAFEbus
SAFEbus as Application Gate Keeper and Source of Error
2X 2X
Inconsistent Value Item/ServiceOmission
Inconsistent omissionInconsistent Value
Corrupt Value
36
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Assumptions & Hazards in Use of SAFEbus
•Assumption: no replicated Host stand-by operation• Item/service omission on one Host-S channel => Item/service omission for both Host-
T channels• Modeled as Inconsistent Omission propagation in stand-by operational mode
•Assumption: no identical corrupt/bad value propagation from sender• Need to ensure no replication of corrupt/bad value (e.g., sensor fan-out)• Modeled as no outgoing symmetric value error from sending host
•SAFEbus as shared resource• Manage babbling host (item/service commission)• Modeled as propagated commission that is expected to be masked
SAFEbus
Host-S Host-T
Replicated Host System
2X
Item/ServiceOmission
Service/item Commission
Service/item Commission
Mask incoming Mask incoming
Single sourceCorrupt/Bad value
Corrupt/Bad value2X
Inconsistent ValueCorrupt/Bad value
Inconsistent Omission
37
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Multi-layered Platforms
Component A Component B
Processor 1 Processor 2Bus
Operational Failed Operational Failed
Operational Failed Operational FailedOperational Failed
Processor
Bindings
Processor
Bindings
Bindings
Binding
Virtual Bus
Binding
Bindings
Operational Failed
Type transformation rules specify how binding propagations affect /augment nominal communication and error propagation between component A and component B
Connection Bindingto buses, processors, devices,
virtual processors, virtual buses
Virtual Processor
Partition X
Bindings
Binding
Platform layers exposed via virtual bus and virtual processor
38
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
39
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Operational Modes and Failure Modes•Nominal system behavior
• Operational modes• Functional behavior• Represented by AADL modes and Behavior Annex
•Anomalous behavior reflecting failure modes• Deviation from nominal behavior• Due to design error• Due to physical failure (operational error)• Represented by AADL Error Model Annex
•System awareness of anomalous behavior• Detection, reporting/recording• Transformation, masking• Represented by AADL Error Model Annex
40
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Example: GPS•Operational modes: Hi-Precision, Lo-Precision, Off
• User initiated transitions•Failures: Sensor failure, Processing failure•Error states of GPS:
• Dual Sensor Op(operational) • Sensor1{NoError} and Sensor2{NoError} and Processing{NoError}
• Single Sensor Op (Degraded)• 1 orless(Sensor1{Failed}, Sensor2{Failed})
• Dual sensor failure or processing failure (FailStop)• 1 ormore(Sensor1{Failed} and Sensor2{Failed},
Processing{Failed})
•Mapping of Error States onto Operational Modes
HiP LoP
HiP, LoP, Off
Sensor1
Sensor2
Processing
Off
Degraded
Operational
Sensor Fails
FailStopOmission
Value
Recover Event
(Reset)
Processing Fails
HiP LoP Off
Operational
DegradedFailStop
41
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Combined GPS Behavior Model•Error states constrain operational modes
• Degraded supports LoP and Off• FailStop shows Off behavior
•Forced mode transitions• New error state that excludes current mode• Explicit reflection of forced transition in mode behavior
• Detection event and “Emergency” mode transition•Behavioral record of error state (Mode state/Error state pairs)
• Specify detection and reporting of error state• Allows for detection and reporting of limits on user initiated transitions
Expected mode
Operational Degraded Fail Stop
HiP HiP Force transition to LoP{ValueError}
Force transition to Off{Omission}
LoP LoP LoPOk: LoP <-> Off
Force transition to Off{Omission}
Off Off Off Off Perceived {NoError}
HiP-O LoP-O Off-O
Operational
Degraded
FailStop
LoP-D Off-D
Off-F
42
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
SAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
46
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Analyzable Architecture Fault Models
Through model-based analysis identify architecture induced unhandled, testable, and untestable faults and understand the root cause, impact, and potential mitigation options.
Fault Impact & FDIR Analysis
Architecture Fault Model Analysis
Discover testable and untestable faults
Discover unhandled faults &
safety violations
FADEC Operational Mode & Fault Mgnt Behavior Analysis
Model validation
Requirements
Faults that can be tested
Decision coverage
Faults that cannot be testedRace conditions
Improved documentation &
design
Faults that are unhandled
Transient data loss in protocol
C
Fault Impact Analysis
Omission
Detection of Unhandled Data Loss Fault
Sequence
Fault propagation Effects Engine Control Mode to Issue Shut Down Engine Sequence
Reachability AnalysisOf Unsafe States
Root Cause of Data Loss Is Non-deterministic Temporal Buffer Read/Write Ordering
ProcessorCyclic Executive RMS
Config1 Config2
Read/write Timeline Analysis Under Cyclic Executive &
Preemptive Scheduler
47
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
End-to-end Latency in Control Systems
System Engineer
Control Engineer
System
Under
Control
Control
System
Operational
Environment
•Processing latency•Sampling latency•Physical signal latency
Impact of Scheduler Choice on Controller Stability
A. Cervin, Lund U., CCACSD 2006
Impact of Software Implemented Tasks
Jitter is typically managed by Periodic I/O
AADL immediate & delayed connections specify deterministic sampling
48
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Software-Based Latency Contributors
Execution time variation: algorithm, use of cacheProcessor speedResource contentionPreemptionLegacy & shared variable communicationRate group optimizationProtocol specific communication delayPartitioned architectureMigration of functionalityFault tolerance strategy
Quantification of Timing Errors
How much jitter is acceptable?
When is early/late too early/late?
49
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Migration of Dual Fault Tolerant Control SystemHighly unstable system being controlledControl software fault tolerance•Simplex: Baseline, experimental, recovery controllers•Monitor experimental, recover to baseline control•Unrelated higher priority task shares processor
Dual redundancy to address hardware failures•Two instances of Simplex control system•Distributed leadership decision making•Asynchronous processor hardware
Migration to dual core processors•One core dedicated to Simplex control system•Unrelated task on second core•=> Failure to provide control
PC1 PC2
LeaderLogic
LeaderLogic
SimplexControl
SimplexControl
Ind
epen
den
tTa
sk
Ind
epen
den
tTa
sk
Core1Core2
PC1
IndependentTask
SimplexControl
Synchronous system was not explicitly specified
PALS protocol [Sha] provides time drift containment
Accidental Resilience
50
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Outline
Challenges in Safety-critical Software-intensive systemsSAE AADL and the Error Model AnnexHazards & Fault Impact AnalysisComponent Error BehaviorCompositional AnalysisOperational and Failure ModesThe Intricacies of Desired Timing BehaviorSummary and Conclusion
51
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Architecture Fault Modeling Summary
Architecture Fault Modeling with AADL•SAE AADL V2.1 published in Sept 2012 (V1 published in 2004)•Error Model Annex was published in 2006
–Supported in AADL V1 and AADL V2•Error Model concepts and ontology not specific to AADL, can be applied to
other modeling notations•Revised Error Model Annex (V2) based on user experiences currently in
review
Safety Analysis and Verification •Error Model Annex front-end available in OSATE open source toolset
–Allows for integration with in-house safety analysis tools•Multiple tool chains support various forms of safety analysis (Honeywell,
Aerospace Corp., AVSI SAVI, ESA COMPASS, WW Technology)•FHA, FMEA, fault tree, Markov models, stochastic Petri net generation from
AADL/Error Model–Open source implementation as part of Error Model V2 publication
52
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
References
Website www.aadl.info
Public Wiki https://wiki.sei.cmu.edu/aadl
Dependability Modeling with AADL (V1), SEI Technical Report, 2006.
Draft Error Model V2 Annex & Architecture Fault Modeling report, in progress.
Requirements Definition and Analysis Annex supporting FAA Requirements Engineering Handbook Process and Example, Dominique Blouin. https://wiki.sei.cmu.edu/aadl/index.php/Jan_2012_User_Day
Reliability Validation and Improvement Framework, SEI Special Report CMU/SEI-2012-SR-013, November 2012.
Virtual Upgrade Validation (VUV) Method, SEI Technical Report CMU/SEI-2012-TR-005, June 2012.
53
Analytical Architecture Fault ModelsFeiler, Jan 24, 2013
© 2013 Carnegie Mellon University
Contact Information
Peter H. FeilerSenior MTSRTSSTelephone: +1 412-268-7790Email: [email protected]
U.S. MailSoftware Engineering InstituteCustomer Relations4500 Fifth AvenuePittsburgh, PA 15213-2612USA
Webwww.sei.cmu.eduwww.sei.cmu.edu/contact.cfm
Customer RelationsEmail: [email protected] Phone: +1 412-268-5800SEI Fax: +1 412-268-6257