by Leon Rosenshein


Sometimes the wrong solution is the right solution at a particular moment in time. Times when code that might be less than elegant is appropriate. Back in the day of boxed products and in-store shopping, getting the gold master disk to the manufacturer in time to get your production slot was one of them. Miss your slot and you might miss holiday sales. Today it’s outage mitigation. The longer an outage goes on the more your customers suffer.

Which is not to say that you can/should write bad code at those times. Code that is full of bugs and edge cases is just as much a problem when you’re in crisis as it is when you’re not. But a quick  hard-coded if <wierd, well defined special case> then return at the right time and place can get your system up and running and keep your customers satisfied. And you almost certainly need to add some work to your backlog to go back and develop the long term fix. But that doesn’t negate the importance of the quick fix.

Shipping/Mitigating is a feature. And it’s probably the most important one. Even if you get everything else perfect, if it never ships, have you really done anything? You’ve expended energy, but have you done any work?

So how do you know when to do the “right” design vs the “expedient” design? Which one adds more customer value today? How much will the expedient solution slow you down tomorrow? How long will you need to get back the time spent on the “right” solution? If the quick solution takes 5 minutes and doesn’t change your overall velocity but the elegant solution will take a couple of man-months then you probably shouldn’t bother being elegant.

A classic example of this is the Aurora scheduler for Mesos. Aurora is a stable, highly available, fault tolerant, distributed scheduler for long running services. It has all of the things you’d expect from such a scheduler. It handles healthchecks, rolling restarts/upgrades, cron jobs, log display and more. If something happens to the lead scheduler another one is automatically elected. If the scheduling system falls over when it comes back it looks at where it thinks it left off and reconciles itself with the actual running system. It also has an internal memory leak. There’s something in the code that the lead scheduler runs that will cause it to eventually lose its mind. I don’t know exactly what it is, and neither do the authors/maintainers. But they know it’s there and they have a solution. Every 24 hours the lead scheduler terminates itself and restarts. At that point the rest of the system chooses a new lead scheduler and everything continues happily. I don’t know how much time they spent looking for the problem, but the solution they came up with took a lot less time. It works in this case because the system is designed to be resilient and they had to get all of those features working anyway. They just found a way to take advantage of that work to handle another edge case. With that fix in place there’s no need to spend more time worrying about it. It’s an isolated, testable (and well tested) bit of code, so it doesn’t impact their overall velocity.

On the other hand, adding more and more band-aids and special cases to your event handler will eventually lead to emergent behavior and your velocity will slow. Back in the Opus days we had a thing called the Flying Spaghetti Monster, or FSM. It’s job was to manage the lifespan of your data. Basically you would put a tag on a directory and then the system would delete it when it’s lifespan was over. It only had 3 modes. TimeSinceCreation, ChildTimeSinceCreation, and Quota. But the logic of how you could nest them and the time periods on the first two could lead to some odd corners, so whenever we had to touch that code the first thing we’d do was add more tests for the new case. Because it just kept getting more complex. After about a year I realized that it would take me less time to redo the whole set of logic than it would to modify the house of cards it turned into. Luckily we had all the tests to make sure the logic didn’t change.

So when do you do the right thing vs. the expedient thing? It depends on which gives more customer value over time, remembering that real value now is much more valuable than potential value later.