Intermediate Form

Margin of Safety

Previous Entry | Home | Next Entry

One of the biggest differences between writing computer software and other forms of engineering is that it's not obvious how to design safety margins into software in the same way that safety margins can be designed into more physical engineering projects. This is why coming up with correct designs and implementations of computer programs seems to be much harder than coming up with good designs for physical systems.

When designing a building, safety margins can be, and are, built in. While it's probably possible to try to absolutely minimize the amount of material used in the construction of the building, doing so is generally not very useful. Instead, one designs with margins over the absolute minimum. These margins can cover for small errors in the specifications, design, and construction. While massive overdesign may be wasteful of material, some margin probably reduces total cost, by making the design problem simpler.

At least at the analog level, electronics can be similarly overdesigned. I am by no means an expert on this, but from what I understand, in digital circuits ranges of voltages are recognized as high and low. A gate may consider voltages below 1 volt low, and above 4 volts high. These ranges allow the gates to function even when voltages vary from the ideal 0 and 5 volts. (When it comes to working with the gates themselves, the problems are similar to those faced by software designers.)

Software, on the other hand, suffers from the fact that it's not generally possible to overdesign it in the same way. Take a simple program, one that controls a traffic light. It needs to cycle the lights in the appropriate order. It's not obvious what we can do to overdesign this system. The only safe additional thing to do is to make all lights red, but that isn't a useful behavior in a standard case.

Software, to be correct, has to do what a specification says. In general, it's not sensible for it to do anything else. There's no general way to overdesign software systems in the same way as physical ones. That makes software hard to right, as for it to be correct, it needs to be perfect all the time, rather than simply on one side of a dividing line.

Now, it's possible for a software system to be designed robustly, and to keep small failures from turning into large ones. The best systems are designed this way, although it's not obvious to me that there's a general way for doing so. But at the same time, I expect small glitches to occur from time to time when using my computer, while not expecting the building I'm in to fall down.

- Tom | permalink | changelog | Last updated: 2003-11-20 23:10

Previous Entry | Home | Next Entry

Comments

Posted on Friday, November 21, 2003 by Chris:

There are some things that you can do. For example, you can run a processes concurrent to the one responsible for the current state to make sure that the current state is correct. I did this for a webserver solution I built and it dramatically helped turn-around time from errors (i.e. after an error in the underlying fileserver, it would be noticed and fixed in under 60 seconds).

There are similar sorts of error checks that can be done. looping constructs can include code to check for infinite loop conditions (sort of). Also, one can be rigorous about always maintaining maximal state to catch errors (e.g. thrown exceptions) and to be able to restart from them. What do the shuttlecraft people do?

Posted on Friday, November 21, 2003 by Tom:

I'd say running a second process to monitor the first is an example of robustness. The first process has failed to do its job. It's the equivalent of having home insurance, to rebuild your house after it has fallen down.

Coding standards help, but they don't mitigate the fundamental problem.

Assuming by "shuttlecraft" you mean the Space Shuttle, there are two ways it tries to achieve correctness. First, it has multiple computers running the programs, with mechanical summing of their results, thus turning software decisions into hardware ones. Secondly, there's a second set of software written by a completely different team, hopefully with a different set of bugs.

It's possible to reduce the frequency and impact of bugs, but it's hard to eliminate them entirely, since there's no way to add a margin of error.

Posted on Friday, November 21, 2003 by Chris:

I disagree that running a second process is like car insurance. It's much closer to putting guardrails on highways to protect against mechanical failures in cars. It means that in failure situations the failure doesn't have as much impact.

It's not really analagous to over-designing bridges, though, you're right about that. How about redundant error checking? Checking return values and correctness throughout code paths. E.g. every function checks the validity of its arguments, even if its only called internally by another function which is supposed to check the validity of its arguments as well. That would seem to be analogous.

Checking the return value of every function, even those which should never return null (back in the days of C). That seems to be structural redundancy.

Also, using bignum floats with 100 digits of precision in calculations with only 2 sigfigs would be analogous. To generalize, if a calculated number requires n sigfigs, use m + 100 digits of precision (where m is the number of digits of precision required to get n correct digits at the end of the calculation). To make it a little more concrete, imagine if python defaulted to using bignum floats with 1000 digits of precision instead of doubles when doing floating point calculation. I think that that would be directly analogous to building a 10,000 lb load bridge to a 50,000 lb load specification just to be safe.

Posted on Friday, November 21, 2003 by Tom:

Precision only gets you so far, as it's impossible to improve upon the precision of a boolean value. Boolean values tend to appear far more often in computers then they do in the real world.

Error checking isn't really the same thing as adding margin in. Adding error checking increases the complexity of a design, generally by requring a specification of what to do in each error case. Margin allows the same design to work for a larger range of inputs.

It's the large gap between true and false that needs to be overcome far more often in the digital world than the real one.

Commenting has been suspended due to spam.