by Leon Rosenshein


Things go wrong. We make mistakes. We don't always consider all the possibilities. New, different things can happen. The goal is to do better in the future. That's where Root Cause Analysis comes in. And hindsight is 20/20, right, so we're done, right?

First, as I talked about last year, there's a difference between the proximate cause and the root cause. We need to avoid the trap of thinking that the first time a person could have done something different to prevent the issue is the root cause. It almost never is so we need to keep digging.

Second, while our view of the past may be 20/20 (it probably isn't, but that's a different issue), it also tends to suffer from target fixation. As part of mitigating/solving the problem and then preventing it, we (hopefully) wrote down all the steps taken and noted problems along the way. And if we're really on top of things at the post-incident review we even created tasks on the backlog for the longer-term things that need to happen to prevent a recurrence.

But what we almost never do is document the struggles we had diagnosing the problem and getting to the root cause. We don't document/remember the dead ends we followed, the tools we built (or wish someone else had built), or the side learnings. Even when doing RCA, we follow the happy path through the solution and forget about many of the possible error cases.

And that's bad. For a bunch of reasons. Those dead ends are probably incidents waiting to happen. Yes, this time the problem wasn't bad input from a sensor we weren't properly filtering, but it can still happen. Unless we fix the underlying issue, by capturing it in the backlog and then completing it, in the fullness of time it will happen. And then we'll go through the whole process again.

At which point we'll reach for the same tool we wished we had but didn't, and find that it's still missing. So we'll throw something together. Which will do the trick, but slow us down. Which means that we should have but building that tool on the backlog as well. As a side benefit I'll bet you find that there are other uses for that tool as well.

When you're dealing with an incident and doing RCA you're an explorer. While it's possible that you got lucky and your path to the root cause was simple and straightforward, more likely it wasn't. And each of those bends and twists is a part of the map that currently says "terra incognita", but you've got an opportunity to update it and make the blank spot smaller. Next time someone reaches that spot and has a working map they'll be happy you did. And that person might even be you.

So remember. Even when dealing with errors there's more to life than the happy path, and there are lots of learnings there to be had, by keeping your focus broad enough to see them.