by Leon Rosenshein

Chaos

Distributed systems are hard. And they're hard in lots of ways. One of them is emergent behaviour, or the idea that simple rule changes can have big impacts. Sometimes good, sometimes bad, but often surprising.

Consider the simple Nginx load balancer. Let's say you've got 5 stateless backends behind a single address. By default Nginx does round-robin load balancing, passing the same number of requests to each backend. Great. You'll end up with consistent, even load. Or not. You only have consistent, even load when all of the requests are consistent. If 20% of your load takes 3x the time to handle then what happens? Well, if that ⅕ of your requests are evenly distributed and you have 5 servers then 1 server is going to get most of them, become overloaded, and fall over. Now you only have 4 servers, which may not be enough to handle the load, so they start failing and suddenly you have users at the gate with pitchforks and torches.

But you can't think of everything. So what can you do? That's where chaos engineering comes in. Chaos engineering is the idea that instead of waiting until oh-dark-30 and for something weird to happen you build systems that cause those weird things to happen while you're awake and watching. Then you notice them, figure out how to mitigate and prevent them, and then don't worry about them happening in the middle of the night.

There are lots of things that your chaos injector can do. It can add latency, remove instances, fuzz data, increase load, eat memory or disk, fail a sensor or otherwise muck with inputs and capacity. And it might do them one at a time, or it might do them in combination, because if you run low on memory you can just swap to disk, unless the disk is also full. Then what? It falls over.

For us it's software, but really it's just another application of failure analysis of systems. In aerospace we had the "iron-bird" Hook up as many real parts as you can. Simulate/work around the rest and add an external environment. Then break something and see what happens. More advanced systems break things with switches and valves, but I've talked to people who have simulated failures of hydraulic systems by taking an axe to a hydraulic line. Kind of messy, but very realistic.

About 10 years ago Netflix popularized the concept in the software world. Things have come a long way since the original chaos monkey turned off a server just to see what would happen. So think about what adding that kind of testing to your systems might show you.

And in case you were wondering, the solution to the Nginx problem we came up with was to change the distribution police from round-robin to least-used. It's a little bit harder for Nginx to keep track of, but it does a better job of balancing time spent handling requests instead of balancing the count of requests handled. In our case that made a big difference. YMMV