death by software the therac-25 radio-therapy device brian mackay ese6361 - requirements engineering...
TRANSCRIPT
Death by SoftwareThe Therac-25 Radio-Therapy Device
Brian MacKay
ESE6361 - Requirements Engineering – Fall 2013
2The Atomic Age
• World War II ushered in the atomic age• The start of the nuclear arms
race
• In many countries…• The question was how to
harness this power for peaceful purposes
3In Canada: AECL
• Atomic Energy of Canada Limited is a “Crown Corporation”
• Designed and implemented a Heavy Water nuclear reactor• The CANDU system
• It also included AECL-Medical• Harnessing the atom for
medical reasons
4AECL & CGR – Medical Accelerator Technology
• AECL-Medical and the French company: la Compagnie Générale de Radiologie (CGR)
• Worked together during the 1970s on using linear accelerators for radio-therapy• High energy, low dose, Electron beams, or• A stream of photons in the X-Ray spectrum
• The two companies’ partnership produced• The 6 MeV, X-Ray only “Therac-6”• The dual mode, 20 MeV “Therac-20”
5Therac-6 & Therac-20
• Stand-alone electro-mechanical units
• Operator could• Set all settings manually• Position beam devices manually• Once everything was set, and system was “safe” – deliver the dose
• The system had an optional computer that allowed a simpler UI• A Digital Equipment PDP-11• 32 kilobytes of memory• All assembly code
6True Innovation: the Therac-25
• AECL only – CGR partnership had dissolved
• Used a Double-Pass accelerator• Halved the space that the
Therac-6 & Therac-20 had occupied
• Made the computer the primary controller• No stand-alone manual mode
• Shipped in 1983• Still used a DEC PDP-11
7It was the best on the market…
• Except…
• It seriously injured 6 patients between 1985 and 1987
• Killing 3 of those patients
• All because of software
8Hubris
• When an engineer graduates in Canada, he/she attendsThe Ritual Calling of an Engineer
• And gets an Iron Ring
• Rudyard Kipling wrote the ceremony• Instills a sense of
professionalism• And humility
9Supreme Faith in Software
• It appears that this device had rigorous safety engineering on the hardware side• Complete hazard analysis – fault tree
• On the software side, the likelihood of error was described in insanely low terms• Fault probabilities on the order of 10-9 and 10-11
• “Software does not degrade due to wear, fatigue or the reproduction process”
• They had no expectation that a bug could cause a problem
10Malfunction 54
• When there was a problem, the UI displayed the word “Malfunction” followed by a number 1-64• There was NO documentation of what these codes were in the user
manual• An internal AECL service manual described #54 as “dose input 2”
and pointed out that this error code was only there for internal diagnostic reasons
• Under normal conditions, an operator might see as many as 40 malfunction codes in a day• But Malfunction 54 was very rare• They were easily dismissed by pressing [P] (for “Proceed”)
11Electron Mode vs. X-Ray Mode
• In Electron Mode a low power beam is scanned across the patient
• In X-Ray mode a high power beam is aimed at a target, producing X-Rays, which then irradiate the patient
• The electron scanning mechanism and X-Ray target were mounted on a turntable• The position was controlled by
the computer
12Usability
• User interface was a VT-100 Green Screen
• Contained the Prescription• Entered by the operator
• Originally – on error, prescription had to be re-entered• Usability studies changed
this, near the end of the dev cycle
• Introduced a major error
PATIENT NAME : JOHN DOETREATMENT MODE : FIX BEAM TYPE: X ENERGY (MeV): 25
ACTUAL PRESCRIBED UNIT RATE/MINUTE 0 200 MONITOR UNITS 50 50 200 TIME (MIN) 0.27 1.00
GANTRY ROTATION (DEG) 0.0 0 VERIFIEDCOLLIMATOR ROTATION (DEG) 359.2 359 VERIFIEDCOLLIMATOR X (CM) 14.2 14.3 VERIFIEDCOLLIMATOR Y (CM) 27.2 27.3 VERIFIEDWEDGE NUMBER 1 1 VERIFIEDACCESSORY NUMBER 0 0 VERIFIED
DATE : 84-OCT-26 SYSTEM : BEAM READY OP.MODE: TREAT AUTOTIME : 12:55. 8 TREAT : TREAT PAUSE X-RAY 173777OPR ID : T25VO2-RO3 REASON : OPERATOR COMMAND:
13A Race Condition – UI & Operations Threads
• In the Therac-25, the prescription information was entered
• The Electron/X-Ray mode
• Then a command to execute
• If the operator • Entered an X-Ray command in error• Re-edited the page and changed it to
Electron• Then executed the dose, all within 8
seconds
• Then the patient was given an X-Ray dose directly through the Electron turntable element
PATIENT NAME : JOHN DOETREATMENT MODE : FIX BEAM TYPE: X ENERGY (MeV): 25
ACTUAL PRESCRIBED UNIT RATE/MINUTE 0 200 MONITOR UNITS 50 50 200 TIME (MIN) 0.27 1.00
GANTRY ROTATION (DEG) 0.0 0 VERIFIEDCOLLIMATOR ROTATION (DEG) 359.2 359 VERIFIEDCOLLIMATOR X (CM) 14.2 14.3 VERIFIEDCOLLIMATOR Y (CM) 27.2 27.3 VERIFIEDWEDGE NUMBER 1 1 VERIFIEDACCESSORY NUMBER 0 0 VERIFIED
DATE : 84-OCT-26 SYSTEM : BEAM READY OP.MODE: TREAT AUTOTIME : 12:55. 8 TREAT : TREAT PAUSE X-RAY 173777OPR ID : T25VO2-RO3 REASON : OPERATOR COMMAND:
Malfunction 54
14Why Have One Deadly Bug?
• A second deadly bug was eventually found in the Therac-25
• The system periodically tested if everything is positioned properly, setting a variable with the result of the test• A zero indicated OK
• Instead of simply setting the value to 1 or 0, the program incremented the value• And, the variable was a byte
• The result was that every 256 tests of the positioning, the system would falsely indicate that everything was ready to proceed.
15Noteworthy: The Users Found the Bugs
• It’s worth noting that AECL’s reaction to the problems initially was denial• Eventually, the got to the stage where they did piecemeal fixes
• Without the efforts of the staff at the East Texas Cancer Center in Tyler, AECL might never have acknowledged the first bug• After two accidents – with the same operator – they spent time
trying to recreate the race condition
• After the Therac-25, the FDA changed the way it evaluated software (and software engineering) in medical devices.
16The Scorecard
Total Accidents
Deaths
Malfunction 54Race Condition
3 2
Incorrect Increment Logic
3 1*
Total 6 3 One patient died of cancer, but would have died of radiation poisoning in a
few weeks had the cancer not killed him
17Not the Bugs – The Software Engineering
• All software systems have bugs• Even Knuth hands out the occasional $2.56 check
• AECL coalesced their entire operator interface, control system and safety system into one program
• They apparently had very little in the way of formal requirements gathering, design or development standards• All of the software was developed by one programmer
• Their reaction to the problems was to fix them one at a time
18Software Reuse
• The Therac-20 reused some of the software from the Therac-6
• The Therac-25 reused software from both of the previous models
• But• The earlier models had hardware interlocks to prevent over-
dosing
• The desire to reuse previous software resulted in a• Home-made real-time operating system• On an expensive, 10 year old computer system • Running a program written entirely in assembly language• That relied on global variables for inter-task communication – without
synchronization
19No Requirement to Separate Layers
• AECL architected the Therac-25’s software into a single point of failure
• This was far from accepted practice in the early 1980s• Safety systems were migrating from hardware to software• But… they were usually separate, simpler systems – e.g. PLCs
• By the early 80s, there were usually three distinct layers• Safety and integrity• Control and positioning• Operator interface and supervisory
20Testability – Auditing
• AECL’s task architecture and real time OS made adequate testing nearly impossible• Look at the deadly errors – neither is discoverable through testing
• No auditing of operations, or failures was included in the system
• After all the issues with the Therac-25, a check was done on the Therac-20 system and the same bugs were found• But, because that system had mechanical interlocks, no injuries
resulted
21References
• “Medical Devices – The Therac-25”, Levenson, Nancy.http://sunnyday.mit.edu/papers/therac.pdf
• “An Investigation of the Therac-25 Accidents”, Levenson, Nancy and Turner, Clark S., IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html
• “Fatal Dose - Radiation Deaths linked to AECL Computer Errors”, Rose, Barbara Wade, Saturday Night (magazine), June, 1994http://www.ccnr.org/fatal_dose.html
• “Safety-Critical Computing: Hazards, Practices, Standards, and Regulation”, Jacky, Jonathan, http://staff.washington.edu/jon/pubs/safety-critical.html
• “Therac-25”, Wikipediahttp://en.wikipedia.org/wiki/Therac-25
• “PDP-11”, Wikipediahttp://en.wikipedia.org/wiki/PDP-11
• “PDP-11 architecture”, Wikipediahttp://en.wikipedia.org/wiki/PDP-11_architecture