by Leon Rosenshein

You Broke What?

They say with motorcycles there are riders that have laid their bike down at least once, and there are riders that will lay there bike down in the future. That there are no other kinds of riders. I’m not a motorcycle rider, so I can’t comment on the accuracy of that statement or not.

I am, however, a software developer, and I can say that there are 10 kinds of developers. Those that understand binary and those that don’t. And there are 2 other kinds of developers. Those that have broken production software at least once, and those that will.

Web comic from workchronicles.com. One developer worrying about being fired for breaking production and another telling him not to worry. If developers were fired for breaking production everyone would be fired.

There are two important things to take away from that. First, you will break production at some point. Don’t let that paralyze you. Second, don’t just YOLO it over the wall and walk away. Pay attention to what happens. Make sure things are doing what you expect, and not doing what you don’t expect. And be prepared to fix things.

Actually, there are a lot of things you should take from that besides those two important things. And the biggest of those revolve around 2 area. Preventing issues and recovering from issues.

You want to do everything you can to prevent issues from happening. The things you’ve seen happen. The things you’ve heard about happening. The things you can think of going wrong. The things others can think of going wrong.

Things like running unit and acceptance tests first. If possible, shadowing production. Then sending a percentage of production work to the new version. And things like making sure the sequencing is in the right order. You add things before you reference them. You stop referencing things before you delete them. Making sure everything is done before flipping to new versions. Those sorts of things. And don’t forget Hyrum’s Law. Somewhere, someone is using your system in ways you didn’t expect, and probably relying on something you didn’t realize was happening. When you find it you’ll need to deal with it.

Bottom line? You don’t want to keep getting bitten by the same kinds of problems when you can find and prevent them.

The other thing to take away is that you need to be prepared to deal with problems. The simplest one to be prepared for is the roll-back. Something bad happened? Putting it back the way it was should be a fast, one-click, operation. If you can’t do that, work towards it. Because it will happen. And don’t forget to periodically test the roll-back. There are much better times than when you’re under pressure to find out that you have a write-only backup.

And after the fact, don’t forget the blameless review. You might call it a post-mortem, an incident review, lessons learned, or something else entirely. Amazon calls them COE’s. Some parts of Microsoft call them Root Cause Analysis. Regardless of what you call them, you should do them. And pay attention to the results. Check out what the Pragmatic Engineer has to say about them.

How careful should you be? It Depends. It depends on how bad things can go and how fast you can recover. So think about what you’re doing. Think about how you can recover. Then make the change. And eventually, you’ll join the club of those who have broken production. I look forward to welcoming you.