I’ve talked about failure modes before, but there was an incident last weekend that reminded me of them again. United 328, from leaving Denver for Honolulu, had a failure. A pretty catastrophic one. Most likely they lost a fan blade, which took out the rest of the engine. But as scary as it was for the folks on the plane and on the ground under it, no-one got hurt. That’s because a group of engineers thought about the safety case and the possible failure modes and planned for it.
And that meant that the aircraft had systems built in to minimize the impact and risk of rapid, unplanned, disassembly. Things like a containment system for the flying fan blades, automated fuel cutoff and fire extinguishers, and sufficient thrust from the other engine to keep making forward progress. Not just mechanical systems, but also operational systems like cockpit resource management systems, ATC procedures, and cabin crew plans. All of which worked together to get the plane safely back on the ground.
What makes this even more impressive is that all of those mechanical safety systems are heavy, and one thing you try to eliminate on planes is weight. Because every pound of an aircraft’s total weight is a pound of revenue generating stuff you can’t carry. That’s why the seats are so uncomfortable. And all those processes and training costs time and money, but they do them anyway. Because it’s just that important.
The same thing applies to software in general, not just robot software. What are the potential failure modes? What can you do to make sure you don’t make things worse? What do you need to do to ensure the best possible outcome when something goes wrong?
We can’t ensure 100% safety, but we need to do everything we can to minimize risk before something goes wrong and then do everything we can to get to the best result possible if it does.