Oct 30
2009

Safe Systems from Unreliable Parts

Posted by Walter Bright in Software Development Methodology and ManagementSecurityBest PracticesArchitecture and DesignApplication DevelopmentAnecdotes

WalterBright

A recent article in Wired [1] piqued my interest because it touched on an interesting subject. In it, a medical "gamma knife" used to treat cancer patients had
suffered from a software glitch. The emergency shutoff switch had no effect, causing the staff to have to run in and extract the patient before he was killed.

How does one create a safe machine? The first thought one has is to make the machine perfect so it cannot fail. Evidently, the designers of that gamma knife followed this principle, hoping their software is perfect and following up on imperfections with bug fixes.

The problem with that approach, of course, is it is impossible to create a component that cannot fail. As programmers, we all know it is impossible to create bug-free software. Even if no expense was spared, perfection can only be asymptotically approached, at exponentially increasing costs. Even if the software was perfect, there could be a hardware failure that corrupts the software.

How can a safe system be created using flawed, unreliable parts, at a practical cost? I used to work for Boeing on flight critical systems design, and received quite an education on this. Boeing is, of course, spectacularly successful at making an inherently unsafe activity, flying, astonishingly safe. They do this by acknowledging the principle:

Any component can fail, at any time, in the worst conceivable manner.

Now, what to do about that?

Any critical compononent or system must have a backup.

Let's illustrate this with a bit of math. Given component A that has a 10% failure rate, we need to get it down to 1%. Improving the quality of that component by a factor of 10 will get us there, but at a cost explosion of 10 times the price. But suppose we add in a backup component B, that also has a 10% failure rate. The odds of A and B both simultaneously failing are 10% of 10%, or 1%. This is achieved by a mere doubling of the cost instead of an order of magnitude increase. The reliability can be further improved to 0.1% by adding another backup component C with 3 times the cost, instead of a hundred times.

In order for this to work, though, A and B (and C) must be completely independent of each other. A failure in one cannot propagate to the other, and the circumstances that cause one to fail must also not cause the other to fail. This independence is where the hard work comes in.

In the gamma knife example (I am inferring based on the scanty information in the article) the emergency shutoff switch relied on the same software as the rest of the system did, so when the software crashed, the failure propagated to the shutoff system, and that didn't work either.

A shutoff system that was not coupled to failures in the software could be one that simply turned off all power to the machine. The gamma radiation source would be automatically blocked upon power failure by shields normally held back when power is on with electromagnets.

Unfortunately, for the makers of the gamma knife and in resolutions of earlier problems with similar machines [3] the focus instead was on rolling out bug fixes and similar attempts to make the software perfect. Reading the comments on the article [1] also show a focus on trying to make the software perfect. While it is worthwhile to try to make the software better and less buggy, the point is it is not the path to making safe systems, because it is impossible to make perfect software.

Conclusion

Reliance on writing perfect software and rolling out bug fixes to correct any imperfections is a fundamentally unsound and unsafe approach. A safe system is one with a backup or failsafe shutdown that is independent and completely decoupled from the primary system. Redundancy is not enough, decoupling is required as well in order to prevent the failure of one system propagating to the backup.

In the next installment, I'll talk about incorporating some of these ideas into software to improve the reliability of that software.

References

[1] 'Known Software Bug' Disrupts Brain-Tumor Zapping
http://www.wired.com/threatlevel/2009/10/gamma/

[2] History's Worst Software Bugs
http://www.wired.com/software/coolapps/news/2005/11/69355?currentPage=all

[3] An Investigation of the Therac-25 Accidents
http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html

Thanks to Bartosz Milewski, Brad Roberts, David Held, and Jason House for reviewing this.



Comments (5)Add Comment
Therac 25
written by David Wilson, October 30, 2009
Hi Walter,

Insightful article. I just wanted to point out that the Therac-25 incidents were not the last software-related radiotherapy failures to occur. For another example, see the IAEA report linked at the bottom of this entry:

http://www.johnstonsarchive.net/nuclear/radevents/2000PAN1.html
...
written by Jack Woehr, October 30, 2009
Simplicity as a design principle might help.
Design so that failures produce a safe condition
written by George Grimes, November 03, 2009
In the case above, have the power to the radiation source controlled by a countdown timer circuit that the system software can only reset. The timer could be a one second count down that the software resets 10 times per second. Once the software fails to reset it for one second. The power for the gamma knife would be gone.
This is a system-level rather than a pure software approach, but it is a system that we are talking about.
...
written by Harlan Cohen, November 04, 2009
Well, you took me right back to Tandem Non-Stop systems and DEC's guides for reliable systems.

Those systems looked towards monitoring points of failure and the quick implementation of redundant backups.

Countdown Counters
written by Ronald Martin, November 05, 2009
External countdown counters have long been a useful way to reset non-responsive or runaway programs, but they are not perfect. Especially in today's world of multi-tasking, it is easy to imagine a task dedicated to resetting the countdown timer that would continue to function normally while a critical task has failed. Even if the critical tasks are responsible for resetting the countdown timer, if there are multiple critical tasks, the situation is the same as before. If you only have one critical task, your software design might well be crippled.

The machine tool industry, which you would think might be less safety conscious than the medical device industry, has long required external devices to implement emergency stop functions while most of the machine control is implemented by a programmable ontroller of by a computer.

The bottom line is that no number of automatic solutions can ever be sufficient when human life or injury is at stake. There must also be manually-operated shut-off controls that are not dependent on the functionality of other parts of the system. These controls should be as simple as possible and their installed functinality should be verifiable and periodically tested.

Write comment
You must be logged in to a comment. Please register if you do not have an account yet.

busy

Get your FREE Subscription to Dr. Dobb’s Digest today!

Dobbs Code Talk Quick Poll

This time next year, your most important operating system (host and/or target) will be:

Look Who's Code Talking


Richard Bullington-McGuire
City: Arlington

Josh Matthews
City: Chicago

Sören Andersen
City: Buffalo

Eric Gustafson
City: Minneapolis

Michael McCabe
City: Houston

Mathew Kumar
City: SF

Dobbs Code Talk Tags

.NET abstraction Ada Adobe Agile Ajax algorithm Algorithmic complexity ALM Analogical reasoning Android Anecdotes Apple Application Development AppStore Architecture and Design ARM Artificial Intelligence Artificial Life Assembler Programming Audio files AVX AWK Banking Bazaar Best Practices Blender Books Brain computer interfacing Build C C Programming C Sharp Cartoon Category theory Cellular automata Clojure Cloud Computing Cobol Cocoa Coder Of The Month Cognition as compression Collaboration Common Process/Frameworks Compilers Computational humour Computational narrative Computational politics Computer Science Computers in art computing pioneers concurrency Conferences Consciousness research Contest Contest140 contests CPlusPlus crime CSharp D Programming Data Centers Databases Debugging Delphi Deployment design Design Patterns Digital Signal Processing Distributed Django Documentation DSL dynamic language Eclipse EDA education Emacs Embedded Systems Encryption engineering Erlang Etymology Excel exception handling Facebook Financial computing Five Questions Flash Flash Lite Flex Forth Fortran Fraud FreeBSD Fun Functional Programming gadgets Games Gender Git gnuplot Go Google Graphics GUI hardware Heron High School High-Performance Computing History Holographic reduced representations HTML5 Humanity Humour Hungarian Notation Identity Inkscape Innovation Intel Interview iPhone J2EE Java JavaFX JavaOne JavaScript language engineering Legal lex LINQ Linux Lisp Literate Programming Logic Programming m4 Mainframes Make Mathematica Mercurial Mesh messaging Metaprogramming Microsoft MID Miscellaneous Musings ML Mobile Software Mobility modeling modular programming multicore Music MVC myblog Natural Language Processing Networking Neural networks newspeak Nokia numerical computing Object Rexx ObjectiveC Office Office 2007 Online spreadsheets OOP Open Source Openaccess publishing OpenBSD OpenSolaris Operating Systems Optimization Oracle Pair Programming Parallelism Concurrency Parsing Pascal Patents Patterns Performance Perl PHP Podcast Pop11 Poplog Privacy Processing Productivity Programming Language Implementation Programming Language One Programming language semantics Programming Languages Programming Style Project Management Prolog Psychology Public understanding of science puzzle Python QA Quantum Computing Quotes Rails Realtime recls Requirements Research practice REST Review RIA rich internet applications Robotics Ruby SaaS Software as a service Scala Schadenfreude Science fiction Screencast Scripting SD Best Practices Search Security Semantic Web Silverlight Snobol SOA social Social Networks Society for the Study of Artificial Intelligence a Software Development Methodology and Management Songs and poems Spending Priorities Spreadsheets SQL Startups Statistics Storage String pattern matching Survey Teaching Testing The Business of Programming The Dobbs Challenge The Future Theory Topology Transhumanism Travel on the Job Twitter Types Unix Upgrade Usability Use Cases USENET User Experience User Interface Design Version Control video virtual machines Virtualization Visual Studio Visual Studio Sponsored Post WCF Web Development Windows Windows 7 Windows Live Wireless WOA WPF X Window System yacc

Subscribe to Dr. Dobbs Newsletter

Email:
Dr. Dobb's Update
Delivered twice a week, Dr. Dobb's Update provides unbiased and objective news, commentary and technical features spanning the entire software development marketplace.

Latest Comments

Jonathan's Last Day at Sun
For the 8 years I worked there, it was fantastic. I worked there under McNealy and I have undying admiration for the guy. I only knew Jonathan periphe...
Implementing Thread Local Storage on OS ...
Back in the day, I did a fair amount of work with PThreads. Wonderful design. Some quirks, but basically really, really nice. Although I wrote a lot ...
More Technonecrophilia with Snobol One-L...
Yeah, It's probably identical except for the (embedded) copy number, I would think. Once it became freely distributable, the copy I've been distribut...
More Technonecrophilia with Snobol One-L...
There's a spitbol-3.7-win.exe at http://code.google.com/p/spitbol/downloads/list . I found it via Dave Shield's blog page http://daveshields.wordpress...
Jonathan's Last Day at Sun
Sadness.

The Latest From Our Member Blogs

How To Select Trainees
Written by Joel Wiesen   
01/27/10
Hiring the right trainee can be harder than hiring a trained programmer.  One approach is described at my website: http://www.aprtestingservices.com/business/lpat/
 
Technical Job Interviews
Written by Keith Kerlan   
01/20/10
What is the best way to interview for software developer positions?  I've been on both sides of the job interviewing table, but have been on the interviewee side of some not too  great inter
 
Timers/timeouts in multi-threaded event-loops
Written by Christof Meerwald   
01/03/10
The traditional way to integrate timeout handling (or timers) in (single-threaded) event loops was to just pass the appropriate timeout value to the select/poll/epoll syscall. While this works fine
 
C vs C++
Written by Issam Lahlali   
12/04/09
I think that the debate "C vs C++" will end when the two langages died, and each one have its advantages and inconvenients, the choice of one instead of another depend on the application c
 
Great Jobs at CISCO
Written by Brent Rogers   
11/30/09
Hello! I am a recruiter at CISCO. We have a number of great jobopportunities at CISCO right now. Please take a look at the job links listedbelow and please send me an updated resume if you are interes
 
OK Labs, ST-Ericsson, and the Mobile/Wireless Ecosystem
Written by Steve Subar   
11/17/09
Two weeks ago, OK Labs and ST-Ericsson announced the selection of OK Labs as ST-Ericsson's mobile virtualization partner. To earn this coveted position, OK Labs prevailed in a rigorous evaluation
 
C++ Ninjas Needed in Santa Clara, California
Written by Brent Rogers   
09/30/09
Hello! I am a recruiter at CISCO. Our PostPath teamin Santa Clara is building a new Email SaaS business at CISCO. We are looking forsenior developers with Zimbra expertise to help us accomplish this t
 
Fighting Fragmentation with Mobile Virtualization
Written by Steve Subar   
09/21/09
Last week Motorola and T-Mobile announced the launch of a new and innovative Android-based smartphone, the Cliq. This attractive, feature-rich slider handset happens to build on a chipset and firmware
 
Insights into Router Design: Unit Testing of Networking Protocols
Written by Rajesh Kumar Venkateswaran   
09/07/09
  Unit testing is a software validation methodology through which a programmer tests individual modules or units of source code. If the programmer has been responsible for developing a networ
 
Insights into Router Design: Implementation of Networking Protocols
Written by Rajesh Kumar Venkateswaran   
09/06/09
  Modern data networking consists of a large number of networking protocols, each of which has its own domain of applicability. Some run on end stations (also called hosts), some on enterp
 
Insights and Innovations in Networking
Written by Rajesh Kumar Venkateswaran   
09/05/09
Networking devices such as routers and switches have evolved quite a bit over the past years, both in the service provider network and in the enterprise. It is a challenge to build these devices, bo
 
reddit threads community
Written by Christof Meerwald   
08/30/09
I have just started a threads community over at reddit to cover topics such as multithreading, concurrency and parallel programming. Feel free to join if you are interested. -- cmeerw.org 
 

The Latest From Dr. Dobbs

DDJ