university of toronto lecture 10: managing risksme/csc302/2009s/notes/10-risk.pdf · mars polar...
TRANSCRIPT
1
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 2
Lecture 10:Managing Risk
General ideas about RiskRisk Management
Identifying RisksAssessing Risks
Case Study:Mars Polar Lander
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 3
Risk ManagementAbout Risk
Risk is “the possibility of suffering loss”Risk itself is not bad, it is essential to progressThe challenge is to manage the amount of risk
Two Parts:Risk AssessmentRisk Control
Useful concepts:For each risk: Risk Exposure
RE = p(unsat. outcome) X loss(unsat. outcome)For each mitigation action: Risk Reduction Leverage
RRL = (REbefore - REafter) / cost of intervention
2
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 4
Likelihood of Occurrence
Very likely Possible Unlikely
(5) Loss of Life Catastrophic Catastrophic Severe
(4) Loss of Spacecraft Catastrophic Severe Severe
(3) Loss of Mission Severe Severe High
(2) Degraded Mission High Moderate Low Un
de
sir
ab
le
ou
tco
me
(1) Inconvenience Moderate Low Low
Risk AssessmentQuantitative:
Measure risk exposure using standard cost & probability measuresNote: probabilities are rarely independent
Qualitative:Develop a risk exposure matrix
Eg for NASA:
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 5
Source: Adapted from Boehm, 1989Identifying Risk: Checklists
Personnel Shortfallsuse top talentteam buildingtraining
Unrealistic schedules/budgetsmultisource estimationdesigning to costrequirements scrubbing
Developing the wrong Softwarefunctions
better requirements analysisorganizational/operational analysis
Developing the wrong User Interfaceprototypes, scenarios, task analysis
Gold Platingrequirements scrubbingcost benefit analysisdesigning to cost
Continuing stream of requirementschanges
high change thresholdinformation hidingincremental development
Shortfalls in externally furnishedcomponents
early benchmarkinginspections, compatibility analysis
Shortfalls in externally performedtasks
pre-award auditscompetitive designs
Real-time performance shortfallstargeted analysissimulations, benchmarks, models
Straining computer sciencecapabilities
technical analysischecking scientific literature
3
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 6
Identifying Risks: Fault Tree AnalysisWrong or inadequate
treatment administered
Vital signserroneously reportedas exceeding limits
Vital signs exceedcritical limits but not
corrected in time
Frequency ofmeasurement
too low
Vital signsnot reportedComputer
fails to raisealarm
Nurse doesnot respondto alarm
Computer doesnot read withinrequired time
limits
Human setsfrequencytoo low
Sensorfailure
Nurse failsto input themor does soincorrectly
etc
Event that results froma combination of causes
Basic fault eventrequiring no further
elaboration
Or-gate
And-gate
Source: Adapted from Leveson, “Safeware”, p321
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 7
Source: Adapted from SEI Continuous Risk Management GuidebookContinuous Risk Management
Identify:Search for and locate risks before theybecome problems
Systematic techniques to discover risks
Analyse:Transform risk data into decision-makinginformationFor each risk, evaluate:
ImpactProbabilityTimeframe
Classify and Prioritise Risks
PlanChoose risk mitigation actions
TrackMonitor risk indicatorsReassess risks
ControlCorrect for deviations from the riskmitigation plans
CommunicateShare information on current andemerging risks
4
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 8
Source: Adapted from SEI Continuous Risk Management GuidebookPrinciples of Risk Management
Global PerspectiveView software in context of a largersystemFor any opportunity, identify both:
Potential valuePotential impact of adverse results
Forward Looking ViewAnticipate possible outcomesIdentify uncertaintyManage resources accordingly
Open CommunicationsFree-flowing information at all projectlevelsValue the individual voice
Unique knowledge and insights
Integrated ManagementProject management is risk management!
Continuous ProcessContinually identify and manage risksMaintain constant vigilance
Shared Product VisionEverybody understands the mission
Common purposeCollective responsibilityShared ownership
Focus on results
TeamworkWork cooperatively to achieve thecommon goalPool talent, skills and knowledge
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 9
Case Study: Mars Climate OrbiterLaunched
11 Dec 1998
Missioninterplanetary weather satellitecommunications relay for Mars PolarLander
Fate:Arrived 23 Sept 1999No signal received after initial orbitinsertion
Cause:Faulty navigation data caused by failureto convert imperial to metric units
5
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 10
MCO EventsLocus of error
Ground software file called “Small Forces” gives thruster performance datadata used to process telemetry from the spacecraft
Angular Momentum Desaturation (AMD) maneuver effects underestimated(by factor of 4.45)
Cause of errorSmall Forces Data given in Pounds-seconds (lbf-s)The specification called for Newton-seconds (N-s)
Result of errorAs spacecraft approaches orbit insertion, trajectory is corrected
Aimed for periapse of 226km on first orbitEstimates were adjusted as the spacecraft approached orbit insertion:
1 week prior: first periapse estimated at 150-170km1 hour prior: this was down to 110kmMinimum periapse considered survivable is 85km
MCO entered Mars occultation 49 seconds earlier than predictedSignal was never regained after the predicted 21 minute occultationSubsequent analysis estimates first periapse of 57km
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 11
Mars
To Earth
TCM-4
TCM-4
Larger AMD ΔV’sDriving trajectory downrelative to ecliptic plane
Estimated trajectoryand AMD ΔV’s
Actual trajectoryand AMD ΔV’s
226km57km
MCO Navigation Error
Periap
se
6
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 12
Contributing FactorsFor 4 months, AMD data notused (file format errors)
Navigators calculated data by handFile format fixed by April 1999Anomalies in the computed trajectorybecame apparent almost immediately
Limited ability to investigate:Thrust effects measured along line ofsight using doppler shiftAMD thrusts are mainly perpendicular toline of sight
Poor communicationNavigation team not involved in keydesign decisionsNavigation team did not report theanomalies in the issue tracking system
Inadequate staffingOperations team monitoring 3 missionssimultaneously (MGS, MCO and MPL)
Operations Navigation teamunfamiliar with spacecraft
Different team from development & testDid not fully understand significance ofthe anomaliesSurprised that AMD was performed 10-14times more than expected
Inadequate TestingSoftware Interface Spec not used duringunit test of small forces softwareEnd-to-end test of ground software wasnever completedGround software considered less critical
Inadequate ReviewsKey personnel missing from criticaldesign reviews
Inadquate margins…
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 13
Mars Climate Orbiter Mars Global Surveyor
7
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 14
Lessons?
If your teams don’t coordinate, neither will their software
(See: Conway’s Law)
With software, everything is connected to everything else -- every subsystem is critical
If it doesn’t behave how you expect, it’s not safe(yes, really!)
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 15
Sidetrack: SNAFU principle
Full communication is only possible among peers;Subordinates are too routinely rewarded for telling
pleasant lies, rather than the truth.
Not a good idea to have theIV&V teams reporting to the program office!!
8
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 16
Failure to manage risk
InadequateMargins
Science (functionality)Fixed
(growth)
ScheduleFixed
CostFixed
Launch VehicleFixed
(Some Relief)
RiskOnly
variable
Adapted from MPIAT - Mars Program Independent Assessment Team Summary Report, NASA JPL, March 14, 2000.
See http://www.nasa.gov/newsinfo/marsreports.html
University of Toronto Department of Computer Science
© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 17
Symptoms of failure to manage risk:Are overconfidence and complacency common?
the Titanic effect - “it can’t happen to us!”Do managers assume it’s safe unless someone can prove otherwise?
Are warning signs routinely ignored?What happens to diagnostic data during operations?Does the organisation regularly collect data on anomalies?Are all anomalies routinely investigated?
Is there an assumption that risk decreases?E.g. Are successful missions used as an argument to cut safety margins?
Are the risk factors calculated correctly?E.g. What assumptions are made about independence between risk factors?
Is there a culture of silence?What is the experience of whistleblowers? (Can you even find any?)