by Leon Rosenshein

Stable Code

Stable code is a good thing, right? Write it, build it, deploy it, and it just keeps working. It just runs and runs. You don't need to worry about it anymore. What more could you ask for? How about repeatability? There's more to a stable service than the first release. There's all the subsequent releases as well. You need to keep doing it. New requirements and requests come in, you write some more code, build it again, then deploy it.  In most cases those releases come on a fairly frequent cadence, and you get pretty good at it. And if you're not that good at it you get better. You practice. You write runbooks. You automate things. You evolve with the rest of your environment.

But sometimes it doesn't work that way. Maybe it was the first release, or maybe it was the 10th, but eventually it happens. You wrote the code so well, with so much forethought and future-proofing that months go by and you don't have to touch the system. Your monitoring and your customers are telling you everything is going great so you stop thinking about it and move on to other things. And inevitably a customer comes to you with a new requirement.

This happened to us the other day. We had a website that had been working great for months. A stateless website that was adaptive enough to handle all the changes in upstream services we threw at it without changes. Then the upstream service we knew would always be there went away and we needed to change one hard coded value. So what do you do in that case? First, you remember where you put the code and the runbooks. Then you build it so you can make a simple change and test it. Simple right?

In this case, not really. Turns out the world had moved on. The runbooks didn't work. Our laptops had upgraded to a new set of tools. The build and deployment systems had changed. A testing dependency wasn't available. But we needed to deploy, So we adapted. We removed the missing dependency. We turned off a bunch of validation tools. We got some bigger hammers and made it work. At the expense of repeatability. Luckily for us we're decommissioning the tool, so hopefully we won't need to ever update it again.

Bottom line, because of our apparent success we took shortcuts and didn't make sure our build system was not only stable and repeatable, but hermetic enough to stand up to the passage of time. We were offline for about 6 hours because our builds relied on external dependencies we had no control over and we had to adjust to work around them.

The lesson? Be hermetic. Own your dependencies. Exercise your systems even if you haven't needed them.