software reliability techniques applied to constellation
DESCRIPTION
Software Reliability Techniques Applied to Constellation. Technical Briefing NASA OSMA Software Assurance Symposium September 9-11, 2008. Allen P. Nikora, JPL/Caltech, PI Sergio Guarro, ASCA, Inc., Co-I. - PowerPoint PPT PresentationTRANSCRIPT
National Aeronautics and Space Administration
SAS08_Classify_Defects_Nikora 1
Software Reliability Techniques Applied to Constellation
Allen P. Nikora, JPL/Caltech, PISergio Guarro, ASCA, Inc., Co-I
This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology under a contract with the National Aeronautics and Space Administration. The work was sponsored by the NASA Office of Safety and Mission Assurance under the Software Assurance Research Program led by the NASA Software IV&V Facility. This activity is managed locally at JPL through the Assurance and Technology Program Office
Technical Briefing
NASA OSMA Software Assurance Symposium September 9-11, 2008
09/09/2008
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 2
Agenda• Problem/Approach• Relevance to NASA• Accomplishments and/or Tech Transfer Po
tential• Technology Readiness Level• Data Availability• Impediments to Research or Application• Next Steps
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 3
Problem/Approach• Software-related failures responsible for more
than half of NASA major space mission losses or malfunctions between 1996 and 2007– Large majority due to system conditions not been
anticipated or fully understood in the system / software specification and design process
– As NASA space missions are increasingly controlled by software, probability of mission failure due to software may increase if no action is taken
– Minimizing loss of crew/loss of mission requires appropriate techniques to evaluate reliability of on-board and ground-based support software during all development phases.
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 4
Problem/Approach (cont’d)• Modeling of a software system in its anticipated
operational context is an important aspect of assuring software reliability.– Recognized in concept of “operational profile”, software reliability
model assumptions– Many techniques for modeling software reliability treat software in
isolation from the hardware on which it runs and which it controls.• Goals:
– Demonstrate feasibility of applying Context-based Software Risk Modeling (CSRM) technique to CxP applications/scenarios
• Focus on mission-critical applications such as GN&C, Safety and Health Monitoring, Launch Abort
– Develop guidelines for use of context-based techniques– Infuse context-based SW reliability modeling techniques to other
NASA SW development efforts
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 5
Relevance to NASA• Reliability of software component depends on
operating environment. CSRM explicitly includes context in system/software models.
• Unlike traditional software reliability modeling techniques, CSRM helps guide software testing
• CSRM can be used to evaluate risk of software failure during specification and design phases as well as during implementation and test.– Identify risk-prone areas earlier in development
reduced number of defects passed through to test and operations
– Earlier identification of risk-prone areas more effective management of development resources
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 6
Accomplishments and/or Tech Transfer Potential
• Selected PA-1 as initial scenario to be modeled
• Acquired relevant artifacts from Windchill, JSC contacts
• Analysis of PA-1 software specifications/design in progress
• Development of CSRM models of PA-1 software in progress.– GNC is the initial software component selected
for modeling
National Aeronautics and Space Administration
Technology Readiness Level
• CSRM is TRL 9– Actual system has been thoroughly demonstrated and
tested in its operational environment.– All documentation completed.– Successful operational experience.– Sustaining engineering support in place.
• Goal of this effort is to apply CSRM to CxP rather than developing new software reliability modeling techniques
09/09/2008 SAS08_Classify_Defects_Nikora 7
National Aeronautics and Space Administration
Data Availability
• Access to Windchill repository, CxP artifacts
• Contact points at JSC, GSFC to– Help with navigation through repository– Obtain needed artifacts from contractors that
aren’t in repository
09/09/2008 SAS08_Classify_Defects_Nikora 8
National Aeronautics and Space Administration
Impediments to Research or Application
• Large volume of data – difficult to navigate through repository and identify appropriate artifacts.
09/09/2008 SAS08_Classify_Defects_Nikora 9
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 10
Next steps• Complete development of PA-1
model(s)• Analyze models; evaluate software
failure risk• Review models, results• Refine models• Select further applications to model
National Aeronautics and Space Administration
Technical Detail
National Aeronautics and Space Administration CSRM Key Features
From “Risk-Informed Software Assurance for NASA Space Missions”, Sergio Guarro, ASCA Inc., November, 2007
• CSRM Context-based Software Risk Model• A practical approach and framework for assurance of mission-
critical software-intensive systems for NASA programs’ use– System and mission scenario analysis oriented– Integrates traditional PRA event-tree / fault-tree models with Dynamic
Flowgraph Methodology (DFM) models suited to handle software-intensive and human-in-the-loop systems (“dynamic PRA” environments)
– Can be applied for both preliminary assessments of yet-to-be-written software and in-depth assessment of existing, testable software
– Produces software test guidance, as well as assurance and PRA-integrated risk models and metrics
– Supported by implementation toolsets• Classical PRA and DFM software
09/09/2008 SAS08_Classify_Defects_Nikora 12
Approach Next Slide
National Aeronautics and Space Administration
CSRM Technical HighlightsFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007• PRA-style development of mission and risk scenario models• Uses traditional event-tree / fault-tree logic models at top modeling level to
capture the basic aspects of mission scenarios• Uses Dynamic Flowgraph Methodology (DFM) models to capture dynamic
and logically complex aspects of system/software/operator interactions– DFM analytical and quantitative results are fully compatible / can be integrated
with PRA tool binary models and results (SAPHIRE, CAFTA)
• Can incorporate risk, reliability and assurance info from other tools and sources
– SW-process-quality information and non-project-specific reliability data and assessments
• SW reliability info collected in other projects and deemed applicable as a first-estimates of risk levels in current SW modules of interest
– Produces software test guidance, as well as assurance and PRA-SW defect / reliability model output (e.g., Schneidewind’s model or other)
– Traditional test results
09/09/2008 SAS08_Classify_Defects_Nikora 13
Approach Next Slide
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 14
CSRM Analysis OverviewFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
Approach
1. Inspect / examine conventional PRA ET/FT models and identify SW related system functions and events
2. Quantify SW functions and events via process-quality assessment methods and/or generic SW data (as needed and applicable for preliminary assessment and prioritization purposes)
Next Slide
RETRIEVAL
Harddockswith
Dock. Bay
RETURN
ManeuverBack toDock
SEND-VID
SendVideo
NO-COLL
NoCollisionOccurred
MOTION
AERCamis in
Motion
HOLD-SAFE
AutoSafe ModeExecuted
HOLD-EXE
Hold Cmd.Executed
HOLD-CMD
AutoHold
Cmd. Rec.
# END-STATES
1 SUCCESS
2 F-RECOVRD
3 F-RECOVRD
4 F-RECOVRD
5 F-RECOVRD
6 LOSS
7 COLLISION
8 F-RECOVRD
9 F-RECOVRD
10 F-RECOVRD
11 LOSS
12 COLLISION
13 F-RECOVRD
14 F-RECOVRD
15 F-RECOVRD
Event-tree branch-point to be further modeled and analyzed
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 15
CSRM Analysis Overview (cont’d)From “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
Approach
3. Develop DFM model of high-priority SW related functions, accordingly expanding ET branch-point or FT events of interest
Next Slide
RETRIEVAL
Harddockswith
Dock. Bay
RETURN
ManeuverBack toDock
SEND-VID
SendVideo
NO-COLL
NoCollisionOccurred
MOTION
AERCamis in
Motion
HOLD-SAFE
AutoSafe ModeExecuted
HOLD-EXE
Hold Cmd.Executed
HOLD-CMD
AutoHold
Cmd. Rec.
# END-STATES
1 SUCCESS
2 F-RECOVRD
3 F-RECOVRD
4 F-RECOVRD
5 F-RECOVRD
6 LOSS
7 COLLISION
8 F-RECOVRD
9 F-RECOVRD
10 F-RECOVRD
11 LOSS
12 COLLISION
13 F-RECOVRD
14 F-RECOVRD
15 F-RECOVRD
P1
1-P1
National Aeronautics and Space Administration
09/09/2008 SAS08_CxP_SWRel_Nikora 16
CSRM Analysis Overview (cont’d)From “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
Approach
4. Use DFM multi-valued logic / dynamic analysis of higher-level ET or FT event to identify SW and HW/SW potential failure mode sub-scenarios (e.g., “cut-set” constituted of < HW-failure-X AND SW-faulty-response-Y >)
5. Test HW/SW in actual or simulated integrated system set-up, to exclude or establish risk upper-bound for existence of analytically identified potential cut-sets
6. Insert and integrate Step 4 and 5 results into overall PRA ET/FT models, to obtain full system-level mission assurance, risk analysis and quantification perspective
Next Slide
RETRIEVAL
Harddockswith
Dock. Bay
RETURN
ManeuverBack toDock
SEND-VID
SendVideo
NO-COLL
NoCollisionOccurred
MOTION
AERCamis in
Motion
HOLD-SAFE
AutoSafe ModeExecuted
HOLD-EXE
Hold Cmd.Executed
HOLD-CMD
AutoHold
Cmd. Rec.
# END-STATES
1 SUCCESS
2 F-RECOVRD
3 F-RECOVRD
4 F-RECOVRD
5 F-RECOVRD
6 LOSS
7 COLLISION
8 F-RECOVRD
9 F-RECOVRD
10 F-RECOVRD
11 LOSS
12 COLLISION
13 F-RECOVRD
14 F-RECOVRD
15 F-RECOVRD
National Aeronautics and Space Administration
Data need for SW
09/09/2008 SAS08_CxP_SWRel_Nikora 17
CSRM Data NeedsFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
Conceptual design docs.,High level qualitative risk assessment models such as FMEAs, master logic diagrams
Interface documents,Preliminary SW design spec.,Preliminary Hazard Analyses,FMECAs,Classification of SW failure data for similar designs
Early Design Phase
System integration docs.,System PRA model
Detailed design docs.,Preliminary qualitative risk assessment models such as event sequence diagrams, event trees, fault trees, fish bone models etc.
Data need for Balance-of-system
Executable code,Module & Integration testing (qualitative results)
Detailed SW design docs.,Pseudo code,Preliminary module testing (qualitative results – e.g. types of contexts tested, types of errors encountered)
System Integration Phase
Design Maturity
Approach Next Slide
• Logic model(s) development and qualitative analysis– Logic model(s) development and qualitative (i.e., logic) analysis are iterative processes.– Logic model(s) for the software and the balance-of-system will evolve with the design of the
system.– The fidelity of the model(s) and the qualitative analytical results increases with this evolution
process.
National Aeronautics and Space Administration
Data need for SW
09/09/2008 SAS08_CxP_SWRel_Nikora 18
CSRM Data Needs (cont’d)From “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
High level quantitative risk assessment models such as top-level event tree / fault-tree quantifications
Generic SW failure data or reliability / risk assessments for similar designs
Early Design Phase
Quantitative risk assessment results
Preliminary quantitative risk assessment results, such as quantitative estimates for failure modes of sub-systems interacting w/ the SW
Data need for Balance-of-system
Executable code,Module & Integration testing (quantitative results)
Preliminary module testing (qualitative / quantitative results – e.g. type and no. of contexts tested, no. of tests executed, type & no. of errors encountered)
System Integration Phase
Design Maturity
Approach Next Slide
• Quantitative Analysis– Quantitative analysis is also an iterative process:
• Preliminary qualitative and quantitative results identify SW error-forcing contexts to be tested and establish the testing criteria for meeting the reliability threshold.
• More detailed qualitative and quantitative results identify areas of refinement for risk management and risk reduction.
• Final qualitative and quantitative results estimate the contribution of the SW to the overall system risk.
National Aeronautics and Space Administration
Dynamic Flowgraph MethodologyFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 19
• DFM is a directed-graph, modeling methodology that uses multi-valued logic and discrete-event dynamic representation of system parameter and component states
• Capable of handling – within the limits of the discrete state and time representations:– Cause-effect relationshiops– Time-dependent relationships.– Feedback and logic loops– Cognitive models of human operator actions.
A DFM system model, once constructed, can be analyzed in either deductive (e.g., “fault-tree like”) of inductive (e.g., “FMEA or event-tree like”) mode
– Deductive analysis produces the “prime implicants” for any “top event” that can be defined in terms of combinations of possible system parameter and/or component states (even across time boundaries)
– Inductive analysis tracks the evolution of parameter, component and system states over discrete time and logic steps, starting from any user defined combination of states that represents a possible system state
Approach Next Slide
National Aeronautics and Space Administration
DFM and PRA/PSA ToolsFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 20
• DFM is not intended to be a substitute of any existing PRA tool (although in “binary mode” it can mimic both event-tree and fault-tree models)
• DFM can be most useful as a PRA/PSA modeling supplement, for those special portions of a system or mission that call for the use of non-static, non-binary modelsA DFM system model, once constructed, can be analyzed in either deductive (e.g., “fault-tree like”) of inductive (e.g., “FMEA or event-tree like”) mode
• DFM can be integrated with an existing PRA/PSA framework by inserting its results into an existing ET / FT model framework– This can be automated if the ET / FT tool offers a data interchange
utility and / or an “open API”
Approach Next Slide
National Aeronautics and Space Administration DFM Constructs and
Modeling RepresentationsFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 21
• Nodes and discretized state-vectors represent key process parameters and/or components
• Mapping between the discretized state-vectors is governed by multi-valued logic rules– Transfer-boxes (decision
tables)– Transition-boxes (decision
tables with built-in time transitions)
Approach Next Slide
National Aeronautics and Space Administration
Steps in Typical DFM AnalysisFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 22
Step 1: Model Construction• Construct DFM model of system of interest
– Representing the system behavior and flow of causality– (Model is a network of nodes, transfer-boxes, transition-boxes, and
associated arc connections)Step 2: System Analysis• Use DFM inductive and deductive engines to:
1. Verify specified behavior (can be done on system “design model”)2. Identify system failure modes in terms of basic component failure modes
(“Automated FMEA”)3. Develop “Dynamic Scenario Trees” (similar to dynamic event trees)4. Identify prime implicants for system failure (“Top-Events” of interest)5. Define test sequences specifically suited to identify and isolate varioius classes of
possible faulrs. (This feature is useful for generating input vectors for testing software based systems)
Approach Next Slide
National Aeronautics and Space Administration
Steps in Typical DFM AnalysisFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 23
Step 3: Quantification of System Analysis• DFM Model results usually identify subevents that contribute probability to the
branch-points of a system / mission event tree– DFM analysis is equivalent in concept and results to the fault-tree
analyses carried out in traditional PRA to provide further definition and quantification to system sequences initially defined via event-tree models
• DFM “top events” are quantified in fashion similar to fault-tree “top events”• To quantify a DFM Top Event, the set of associated n prime implicants (PIs)
is first converted into a set of m mutually exclusive implicants (MEIs)Top Event = MEI1 MEIm
• The sum of probabilities for the MEIs yields the probability of the Top EventP(Top Event) = P(MEI1) ++ P(MEIm)
• The above is in essence the multi-value logic equivalent of the BDD (Binary Decision Diagram) quantification process for fault-trees
Approach Next Slide
National Aeronautics and Space Administration
Use of DFM in CSRM FrameworkFrom “Risk-Informed Software Assurance for NASA Space
Missions”, Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 24
• CSRM (Context-based Software Risk Model) is a framework to address and guide the integration of functional models of software-related risk into “classical” PRA / PSA frameworks
• CSRM is the modeling approach for software intensive space systems recommended and illustrated in the NASA PRA Procedures Guide
• CSRM can be implemented for simpler systems using only standard ET / FT PRA models
• For more complex systems, use of methods with more advanced and dynamic features (such as DFM or “colored Markov”) is recommended, at least for part of the modeling and analytical effort
Approach Next Slide
)
National Aeronautics and Space Administration Example: Top Level DFM Model of Mini-
AERCam SystemFrom “Risk-Informed Software Assurance for NASA Space Missions”, Sergio
Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 25
Approach
Next Slide
)
This is the sub-model
for the GN&C
Software. It is expanded in the next
slide.
1 clk = 1 sec.
This node represents the actual attitude of the Mini-AERCam.
It is discretized into 3 states:
1. Correct (Error < 3˚)
2. Slightly Inaccurate (Error of 3˚ to 10˚)
3. Inaccurate (Error > 10˚)
National Aeronautics and Space Administration Example: DFM Model of Mini-AERCam
GN&C Sub-ModelFrom “Risk-Informed Software Assurance for NASA Space Missions”, Sergio
Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 26
Approach
Next Slide
)
This sub-model includes the
GPS hardware and the
translational navigation software.
This sub-model includes the angular rate
gyro hardware and the
rotational navigation software.
1 clk = 1 sec.
National Aeronautics and Space Administration Example: DFM Model of Mini-AERCam
Propulsion SubsystemFrom “Risk-Informed Software Assurance for NASA Space Missions”, Sergio
Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 27
Approach
Next Slide
)
This node represents a leak in the propulsionsystem fuel lines after the isovalve but before
thethruster solenoids.
It is discretized into 4 states:
1. None 2. Small (1 – 40%)
A small leak produces thrust and torque of less than 40% of the total thrust and torque the Mini-AERCam can produce to counteract it. A leak of this magnitude should not significantly affect the performance of the Mini-AERCam.
3. Large (41-80%) Produces thrust or torque within 80% of
the Mini-AERCam’s. The Mini-AERCam can compensate and should be recoverable, but its performance is inadequate to perform its mission safely.
4. Critical (> 81%)The Mini-AERCam is expected to beuncontrollable.
National Aeronautics and Space Administration
Analysis of Mini-AERCam DFM ModelFrom “Risk-Informed Software Assurance for NASA Space Missions”,
Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 28
• Analysis of the Autonomous Hold Failure Top Event yields n prime implicants (PIs)Top Event = PI1 Pin
• DFM prime implicants identify:– HW-only fault conditions– SW-only fault conditions– Combinations of HW & SW fault conditions
• For example:– Prime Implicant 1 is
IsoValveCond = Stuck Closed at time-1. HW only fault– Prime Implicant 2 is
TargetAtt = Inaccurate at time-1. SW only error
(The TargetAtt node in the GN&C sub-model represents the accuracy of the target attitude determined by the rotational guidance software function. The PI identifies the possibility that a programmer introduced an error when coding the module, resulting in severely inaccurate output when the latter is used.)
Approach Next Slide
National Aeronautics and Space Administration
Mini-AERCam Model Analysis (cont’d)From “Risk-Informed Software Assurance for NASA Space Missions”,
Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 29
• Prime Implicant 3 isPropLineLeak = Small Leak at time-2 . and.RotThrusterComm = Slightly Inaccurate at time-1
This Prime Implicant corresponds to a combination of hardware and software conditions. (The hardware condition is a small leak in one of the propellant lines. The software condition is an algorithmic fault that causes drifting of the attitude control given a sub-nominal thrust caused by a line leak.)–If only one of the two conditions exists, the Mini-AERCam does not fail:
• The GN&C software works properly when no leak exists.• If a small leak occurs but there is no drift error in the attitude control, the
GN&C is able to compensate for the leak by using the thrusters.• This PI example shows how DFM analysis can identify an off-nominal entry condition
for which the SW may have to be tested:–does not correspond to a normal state of the system;–would not be usually identified and tested for in a standard SW V&V process addressing the SW operational profile.
Approach Next Slide
National Aeronautics and Space Administration
Risk-Informed Testing of Potential SW RiskScenario and Quantification of DFM Prime ImplicantFrom “Risk-Informed Software Assurance for NASA Space Missions”,
Sergio Guarro, ASCA Inc., November, 2007
09/09/2008 SAS08_Classify_Defects_Nikora 30
• Prime Implicant 3 is one of the mutually exclusive implicants. It can be quantified by considering:– The “entry condition” (i.e. small propellant line leak)– The conditional probability that the software causes an attitude shift under this triggering condition
• From a HW failure rate database (e.g., NPRD), the entry condition can be determined to occur with a failure rate of 6.00E-06/hr. For a 5 hour mission duration, the associated probability is P(C3) = 3.00E-05.
• The SW attitude control function can then be tested in the (real or simulated) presence of the system (HW fault) entry condition to determine whether it performs correctly or not
– Without the specific identification of the HW fault condition, random sampling of the SW normal operational input space may never cover the actual system condition!
• In the case discussed the risk quantification process was completed via a simulated “hardware in the loop” test process
– Sampling conducted across the possible range of initial states (i.e., MiniAERCam spatial and rotational positions, compatible thruster command settings, etc.) in which the system could be at the onset of the leak condition.
• With the aid of the CSRM – DFM analysis a normalized sampling set of 450 tests was sufficient to “demonstrate” a risk contribution in the order of 1.E-6 from this scenario, if no erroneous GN&C SW response was observed in the tests
– This was obtained via a straight Bayesian estimation, starting from a uniform, non informative prior
Approach Next Slide