I've been doing the software thing for 30+ years, but I come out of the Mechanical/Aerospace Engineering world. Man rated, safety of flight discussions. Long product cycles. Lots of physical prototypes, bench test models, and iron birds, up to and including destructive testing to figure out exactly where the limits are. Then flying models and test pilots to figure out where the edge of the envelope is.
And still, sometimes things go wrong. Pilots find themselves outside the envelope. Often they can recover, but not always. What happens next though is what I want to talk about. Whether it's the formal processes of the NTSB, articles in AOPA or Aviation Week magazine, or a group of pilots sitting around the FOB, hands waving around and saying "There I was …", taking one out of the luck bucket and putting it in the experience bucket, there's a discussion about what happened, how they got into that situation, how they got out, and how to avoid it as well.
Similarly, the medical field has the Morbidity and Mortality meeting. The goals here are the same. Something bad happened. What can we learn from it? What can we do so that things go better next time, or, if possible, never happen again.
In the software world it's the blameless post incident review (PIR). Sub-optimal things happen. We see build breaks, service outages, downed websites, late releases, or anything else that didn't go as well as planned. We have runbooks for dealing with them (your team does have them, don't they) while they're occurring and incident management SOPs for keeping people informed during the incident. And right there at the end of the incident management SOP is the PIR.
It's at least as important as resolving the incident. Do the PIR correctly and we can avoid not only that specific issue, but other similar issues. Whether you call it 5 Why's, Root Cause Analysis, or Continuous Learning, it's critical to understand not just what happened, but why it happened. What part of the system design/implementation allowed it to happen? Only by understanding why can you do something to keep it from happening again, not just in that specific case, but in cases just like it. Think of it like EMPOC for incident prevention.
There are a lot of things that go into a good blameless PIR, but I think the most important are:
- Make it blameless. Talk about what was done and what happened, not who did what
- Document what you're going to do (and when) to keep it from happening again
- Have an accurate timeline. You need to know what was cause and what was effect
- Know how you mitigated and then resolved the issue