university of toronto lecture 10: managing risksme/csc302/2009s/notes/10-risk.pdf · mars polar...

8
1 University of Toronto Department of Computer Science © 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 2 Lecture 10: Managing Risk General ideas about Risk Risk Management Identifying Risks Assessing Risks Case Study: Mars Polar Lander University of Toronto Department of Computer Science © 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 3 Risk Management About Risk Risk is “the possibility of suffering loss” Risk itself is not bad, it is essential to progress The challenge is to manage the amount of risk Two Parts: Risk Assessment Risk Control Useful concepts: For each risk: Risk Exposure RE = p(unsat. outcome) X loss(unsat. outcome) For each mitigation action: Risk Reduction Leverage RRL = (REbefore - REafter) / cost of intervention

Upload: others

Post on 29-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

1

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 2

Lecture 10:Managing Risk

General ideas about RiskRisk Management

Identifying RisksAssessing Risks

Case Study:Mars Polar Lander

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 3

Risk ManagementAbout Risk

Risk is “the possibility of suffering loss”Risk itself is not bad, it is essential to progressThe challenge is to manage the amount of risk

Two Parts:Risk AssessmentRisk Control

Useful concepts:For each risk: Risk Exposure

RE = p(unsat. outcome) X loss(unsat. outcome)For each mitigation action: Risk Reduction Leverage

RRL = (REbefore - REafter) / cost of intervention

Page 2: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

2

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 4

Likelihood of Occurrence

Very likely Possible Unlikely

(5) Loss of Life Catastrophic Catastrophic Severe

(4) Loss of Spacecraft Catastrophic Severe Severe

(3) Loss of Mission Severe Severe High

(2) Degraded Mission High Moderate Low Un

de

sir

ab

le

ou

tco

me

(1) Inconvenience Moderate Low Low

Risk AssessmentQuantitative:

Measure risk exposure using standard cost & probability measuresNote: probabilities are rarely independent

Qualitative:Develop a risk exposure matrix

Eg for NASA:

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 5

Source: Adapted from Boehm, 1989Identifying Risk: Checklists

Personnel Shortfallsuse top talentteam buildingtraining

Unrealistic schedules/budgetsmultisource estimationdesigning to costrequirements scrubbing

Developing the wrong Softwarefunctions

better requirements analysisorganizational/operational analysis

Developing the wrong User Interfaceprototypes, scenarios, task analysis

Gold Platingrequirements scrubbingcost benefit analysisdesigning to cost

Continuing stream of requirementschanges

high change thresholdinformation hidingincremental development

Shortfalls in externally furnishedcomponents

early benchmarkinginspections, compatibility analysis

Shortfalls in externally performedtasks

pre-award auditscompetitive designs

Real-time performance shortfallstargeted analysissimulations, benchmarks, models

Straining computer sciencecapabilities

technical analysischecking scientific literature

Page 3: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

3

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 6

Identifying Risks: Fault Tree AnalysisWrong or inadequate

treatment administered

Vital signserroneously reportedas exceeding limits

Vital signs exceedcritical limits but not

corrected in time

Frequency ofmeasurement

too low

Vital signsnot reportedComputer

fails to raisealarm

Nurse doesnot respondto alarm

Computer doesnot read withinrequired time

limits

Human setsfrequencytoo low

Sensorfailure

Nurse failsto input themor does soincorrectly

etc

Event that results froma combination of causes

Basic fault eventrequiring no further

elaboration

Or-gate

And-gate

Source: Adapted from Leveson, “Safeware”, p321

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 7

Source: Adapted from SEI Continuous Risk Management GuidebookContinuous Risk Management

Identify:Search for and locate risks before theybecome problems

Systematic techniques to discover risks

Analyse:Transform risk data into decision-makinginformationFor each risk, evaluate:

ImpactProbabilityTimeframe

Classify and Prioritise Risks

PlanChoose risk mitigation actions

TrackMonitor risk indicatorsReassess risks

ControlCorrect for deviations from the riskmitigation plans

CommunicateShare information on current andemerging risks

Page 4: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

4

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 8

Source: Adapted from SEI Continuous Risk Management GuidebookPrinciples of Risk Management

Global PerspectiveView software in context of a largersystemFor any opportunity, identify both:

Potential valuePotential impact of adverse results

Forward Looking ViewAnticipate possible outcomesIdentify uncertaintyManage resources accordingly

Open CommunicationsFree-flowing information at all projectlevelsValue the individual voice

Unique knowledge and insights

Integrated ManagementProject management is risk management!

Continuous ProcessContinually identify and manage risksMaintain constant vigilance

Shared Product VisionEverybody understands the mission

Common purposeCollective responsibilityShared ownership

Focus on results

TeamworkWork cooperatively to achieve thecommon goalPool talent, skills and knowledge

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 9

Case Study: Mars Climate OrbiterLaunched

11 Dec 1998

Missioninterplanetary weather satellitecommunications relay for Mars PolarLander

Fate:Arrived 23 Sept 1999No signal received after initial orbitinsertion

Cause:Faulty navigation data caused by failureto convert imperial to metric units

Page 5: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

5

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 10

MCO EventsLocus of error

Ground software file called “Small Forces” gives thruster performance datadata used to process telemetry from the spacecraft

Angular Momentum Desaturation (AMD) maneuver effects underestimated(by factor of 4.45)

Cause of errorSmall Forces Data given in Pounds-seconds (lbf-s)The specification called for Newton-seconds (N-s)

Result of errorAs spacecraft approaches orbit insertion, trajectory is corrected

Aimed for periapse of 226km on first orbitEstimates were adjusted as the spacecraft approached orbit insertion:

1 week prior: first periapse estimated at 150-170km1 hour prior: this was down to 110kmMinimum periapse considered survivable is 85km

MCO entered Mars occultation 49 seconds earlier than predictedSignal was never regained after the predicted 21 minute occultationSubsequent analysis estimates first periapse of 57km

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 11

Mars

To Earth

TCM-4

TCM-4

Larger AMD ΔV’sDriving trajectory downrelative to ecliptic plane

Estimated trajectoryand AMD ΔV’s

Actual trajectoryand AMD ΔV’s

226km57km

MCO Navigation Error

Periap

se

Page 6: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

6

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 12

Contributing FactorsFor 4 months, AMD data notused (file format errors)

Navigators calculated data by handFile format fixed by April 1999Anomalies in the computed trajectorybecame apparent almost immediately

Limited ability to investigate:Thrust effects measured along line ofsight using doppler shiftAMD thrusts are mainly perpendicular toline of sight

Poor communicationNavigation team not involved in keydesign decisionsNavigation team did not report theanomalies in the issue tracking system

Inadequate staffingOperations team monitoring 3 missionssimultaneously (MGS, MCO and MPL)

Operations Navigation teamunfamiliar with spacecraft

Different team from development & testDid not fully understand significance ofthe anomaliesSurprised that AMD was performed 10-14times more than expected

Inadequate TestingSoftware Interface Spec not used duringunit test of small forces softwareEnd-to-end test of ground software wasnever completedGround software considered less critical

Inadequate ReviewsKey personnel missing from criticaldesign reviews

Inadquate margins…

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 13

Mars Climate Orbiter Mars Global Surveyor

Page 7: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

7

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 14

Lessons?

If your teams don’t coordinate, neither will their software

(See: Conway’s Law)

With software, everything is connected to everything else -- every subsystem is critical

If it doesn’t behave how you expect, it’s not safe(yes, really!)

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 15

Sidetrack: SNAFU principle

Full communication is only possible among peers;Subordinates are too routinely rewarded for telling

pleasant lies, rather than the truth.

Not a good idea to have theIV&V teams reporting to the program office!!

Page 8: University of Toronto Lecture 10: Managing Risksme/CSC302/2009S/notes/10-risk.pdf · Mars Polar Lander University of Toronto Department of Computer Science ... interplanetary weather

8

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 16

Failure to manage risk

InadequateMargins

Science (functionality)Fixed

(growth)

ScheduleFixed

CostFixed

Launch VehicleFixed

(Some Relief)

RiskOnly

variable

Adapted from MPIAT - Mars Program Independent Assessment Team Summary Report, NASA JPL, March 14, 2000.

See http://www.nasa.gov/newsinfo/marsreports.html

University of Toronto Department of Computer Science

© 2008 Steve Easterbrook. This presentation is available free for non-commercial use with attribution under a creative commons license. 17

Symptoms of failure to manage risk:Are overconfidence and complacency common?

the Titanic effect - “it can’t happen to us!”Do managers assume it’s safe unless someone can prove otherwise?

Are warning signs routinely ignored?What happens to diagnostic data during operations?Does the organisation regularly collect data on anomalies?Are all anomalies routinely investigated?

Is there an assumption that risk decreases?E.g. Are successful missions used as an argument to cut safety margins?

Are the risk factors calculated correctly?E.g. What assumptions are made about independence between risk factors?

Is there a culture of silence?What is the experience of whistleblowers? (Can you even find any?)