I'm not talking about Nyquil or Tylenol PM or [Program|Project|Product] Manager. I'm talking about Post-Mortems, or Post-Incident Reviews as they're starting to be known. Whatever you call them, they're critical to improving the system. And that's the goal. Make the system better. Make it more reliable. More self-healing. To do that you need to figure out two things. What happened, both leading up to and during the incident, and why those things happened. You don't need to figure out where to place blame, and trying to do that makes it less likely that you'll be able to find out the important things.
What and why leading up to the incident. How we got into this situation. The 5 whys. It's so important to know how we got there that it's baked right into the incident response template. But here's the thing. We call it the 5 whys, but it often ends up being the 5 whats.
- The server was overloaded.
- The server was overloaded because there were 3 instances instead of 5.
- There were 3 instances instead of 5 because the autoscaler didn't scale the service up.
- The autoscaler didn't scale up because there was no more capacity
- There was no more capacity because growth wasn't in the plan.
That's what happened, not why. Why didn't we know the service was getting close to overloading? Why did we get down to 3 instances without knowing about it? Why didn't we know the autoscaler wasn't able to autoscale? Why didn't we know we were close to capacity? Why didn't we anticipate more growth? You need to answer those questions if you want to make sure to fix that class of problem. Each of those "whys" has a solution that would have prevented the problem, or at least let you know before it got that bad.
A good PM goes over what happened during the incident. And that's good. Understanding the steps is important if you want to make an SOP in case something like that happens again. But there's more to it than that. Unless your incident/solution is such a well worn path that you know exactly what to do and just do it (in which case, why isn't it automated?), chances are there was some uncertainty going at the beginning. Those 6 stages of debugging. So instead of just recording/reporting the steps that were taken to mitigate the problem, think about that uncertainty. Why did it happen? Was there an observability gap? A documentation gap? A knowledge gap? Maybe it was permissions, or too many conflicting signals. Ask questions about why there was uncertainty and what could be done to reduce the time spent clearing up that uncertainty.
We naturally gravitate to the what questions. They typically have very concrete answers and it's easy to point to those answers and the process and say that's why there was an incident. And while we're focusing on the what we tend to ignore the why. However, if you really want to make the system better, to prevent things from happening in the first place, you need to spend less time on what happened, and more time on why it happened.