When we talk about incident management there are lots of things to keep in mind, and priority is important. My formative years were in aviation so to keep the priorities in order I always think of it as aviate, navigate, communicate.
Something bad happened. First and foremost, keep the airplane flying, don’t make or let it get worse. Find someplace safe to land. Figure out how to mitigate the problem. Call ATC and declare an emergency and keep them informed. Open an incident and update your customers. That’s the priority, and in the heat of the moment it’s important to keep your priorities straight.
But that’s the priority order. Just because something is lower priority, it doesn’t mean it’s less important. In fact, over the long term the last part, communication, can be more important than the others. Because properly communicating the experience and following through on the learnings can keep the problem from happening again.
Consider this. One of the things that often happens during an incident is that a quick fix is put in place to alleviate the symptoms. You find that suddenly your service is taking twice as much memory and crashing. So as a mitigation you allocate 2x the memory and things are working again. You’re done, right. Time to head home and celebrate.
It might be time to celebrate, but you’re not done. That quick fix is a pressure cooker, quietly sitting there until the next time it explodes and takes out your service again. You might even be able to keep turning up the allocation a few times, but you haven’t solved the problem. It will come back and bite you later. At some point you’ll need more memory than you can get. What are you going to do then?
In reality, that quick, brittle fix is an obligation you’ve taken on. An obligation to yourself, your team, and the business. An obligation to take the time to fix things the right way. The forever fix that identifies why it happened and keeps it from happening again. Because incidents happen. What’s important is how we deal with them.