by Leon Rosenshein

For Want Of A Nail

We build fault tolerant distributed systems. Not just fault tolerant distributed systems, but fault tolerant systems built on top of fault tolerant distributed systems. And they're connected by a distributed, fault tolerant network, using diverse fibers from different companies. That's a lot of fault tolerance. You'd think something that tolerant couldn't fail. Yet sometimes we get failures.

Most of the time they're simple failures and the system is degraded. We lose a fiber and traffic gets a little slower. We lose a few nodes and the rest of the system picks up the work. Some bad data slipped into the system so instead of surge pricing the cost stays flat. There are problems, but they're localized, isolated, and the system gets better when the problem is fixed.

But sometimes the problem triggers a feedback loop. That's when things get interesting, and get interesting fast. Think back to that fiber we lost in the earlier example. Generally, no big deal. Traffic is rerouted and things move on. But what if we lost 10% of our capacity and we only had 15% overhead? Everything's fine, right? Usually, but, not always. Suddenly things take longer. So we get a few timeouts. But since the system is fault tolerant instead of throwing an error, we retry the message. Now we've increased our network traffic, so we get more timeouts. which increases the traffic. Then some service somewhere gets overloaded because of the traffic increase. But that's ok because we've got auto-scaling that notices and spins up more instances of the service. That spreads the load, but increases the traffic further, and now our healthchecks are impacted, so the proxy decides to stop sending traffic to those "failed" instances. So the load on the remaining instances increases, and they actually fall over. Now we've got even more traffic, and nothing to respond to it.

It's always the little things. Someone with an excavator dug a little too deep or a little too wide and broke a buried fiber-optic cable in Kansas. Some messages were retired to route around the gap. Retries caused healthchecks to fail. Failed healthchecks concentrated traffic to a few instances of a service in Virginia. Increased load meant the service couldn't respond in time. Jane Public in Capetown pushed a button and a car didn't show up. And that's a cascading failure. And that's how well designed, isolated, localized, fault tolerant systems die.

One of the hardest problems when dealing with fault tolerant distributed systems is that the fault tolerance that keeps them working up to a point is the very thing that takes them out when you pass that point. And they're very hard to recover from, because all of the things you're doing to recover are suddenly making the problem worse. To the point where sometimes the best solution is turning it off and back on. Sometimes physically, and sometimes you can get away with a virtual reboot.

There are lots of ways to help prevent cascading failures, but they mostly come down to figuring out when to stop trying to be fault tolerant and just fail quickly. And restart quickly. You can read more (and find more links) in these articles.