Emergency Procedures
The other day I ran into a quote on the internet about the problem with emergency procedures. I generally agree with it. The quote went like this:
If you wouldn’t use your emergency process to deliver normal changes because it’s too risky, why the hell would you use it in an emergency?
But, as always, It Depends. It’s about the risk/reward ratio. You want the ratio to be low. If the system is down, the risk of breaking the system is low. If it’s down, you can’t crash it .
In general, you want one, and only one, deployment process. You want it to be easy, automated, idempotent, and recoverable. You want it to be well exercised, well documented, well tested, and fail safe. And in almost all cases, you should use it. All of the checks, validations, and audit trails are there for a reason (see Chesterton’s Fence).
The main goal of any deployment process is to make sure the user experience is not degraded. Or at least only temporarily degraded by a very small, broadly agreed upon amount. That means making sure that nothing happens by mistake and without an appropriate amount of validation. There can be tests (unit, integration, or system) that need to pass. There can be configuration validators. There can be business, legal, and communications sign-off. All in service of making sure no-one has a bad interaction. Actually deploying the new thing is often just a tiny part of the actual process. There’s a high risk of something going wrong, so you need to be careful to keep the overall risk/reward ratio down.
In an emergency situation though, the constraints are different. If the system is down, you can’t crash it. You can’t reduce the throughput rate. You can’t make the experience worse for users1. The risk of making things worse is low, so the risk/reward ratio is biased lower.
In fact, many things you normally do to make sure you don’t have an outage are unneeded. You don’t need to keep in-flight operations going (because there are no in-flight operations). Instead, you can skip the step of your process that drains running instances. You don’t need to do a phased update to maintain your throughput. When nothing is happening, getting anything running is a step forward. Because nothing is running, you don’t need to do a phased roll-out to check for performance deltas or emergent behavior or edge cases. After all, things can’t get much worse. There are just a few of the things you don’t have to worry about when you’re trying to mitigate an outage.

Or to put it more simply, outage recovery is a different operation than system upgrade. When dealing with an outage, the first step is to mitigate the problem. When doing an upgrade, the most important goal is to have customers/users only see the good changes. There should be no negative changes to any user. Many steps can (and should) be shared between the two processes. But the goals are different, so the process is going to be different.
-
Ok, there are things you can do to make it worse. Like loose data. Or expose personal data. But generally speaking, if your system is down, you can’t make the user experience worse. ↩︎