by Leon Rosenshein

Failure Modes

How does your system fail? What are the impacts of the most common/likely modes of failure? What are the workarounds? Does it fail-safe? Does it Fail-Degraded?

Consider the lowly moving walkway. The Pittsburgh airport has a bunch of them. They just sit there, going around and around. They make your life easier. But sometimes they fail. They can fail in multiple different ways, but the most common is to simply stop moving. No-one likes that, but you know what happens when a moving walkway stops moving? It becomes a floor. It stops reducing your workload, but you can still walk on it. Similarly, for an escalator. If your escalator stops moving it just becomes a set of stairs. It might be steeper than you want, and the riser at the top and bottom are likely different than the ones in the middle (that's a building code violation in most places), but you can still go up and down them. The Denver airport is like that. Yes, every escalator has nearby stairs for if there's a problem, especially when they're being worked on, but if it stops moving, you don't have to. That's fail-degraded.

Elevators have a different failure mode. When they stop moving, at best they're closets. The closets are safe, but there's not useful functionality. Fire doors often have a mechanism to hold them open, but if something happens and the power fails, they close. Not ideal, but safe. Back at the airport again, that luggage cart you rented has a bar you need to hold on to or it won't move. If the linkage breaks the cart doesn't roll away, it's stuck. Truck and train air brakes work on the same system. If there's no pressure the brakes are full on. You need to apply the default pressure to release the brakes, then you add more to engage them again. If the hose/line breaks you stop. Fail Safe . Your car, on the other hand, won't stop if there's a hole in the brake line.

So how does this apply to software? How do you respond to failures of subsystems or bad input? Can you keep operating, store things in some durable fashion, and then catch up? How do you ensure that you don't make things worse by only applying part of a change? If your snappy local cache is down, can you go back to the source of truth yourself? Or do you just say no and let the upstream thing, with more (hopefully enough) knowledge of the situation do the right thing.