by Leon Rosenshein

Cost Of Control

The other day I ran across a presentation which appeared to have a very different take on incident management. And it talked about some of my hot topics, hidden complexity, cognitive load and their costs, so I was looking forward to seeing the recommendations and seeing how I might be able to apply. But the presentation kind of fizzled out, ending up with some observations of how things were, and identifying some issues/bottlenecks, but not taking the next step.

Thinking about it, what I realized was that what the presentation described was a C3 (command, control, and communication) system, and the differences between examples that worked well and ones that didn't wasn't the process/procedure, it was the way the teams worked together. Not the individuals on the team, not the incident commander (IC) or the person in the corner who came up with the answer alone, but how those individuals saw each other and worked together.

The example of a situation where an incident was well handled was the Apollo 13 emergency, and the counterexample was a typical service outage and zoom. And that took me back to something I learned about early in my Microsoft career. The 4 stages of team development. Forming, Storming, Norming, and Performing. There's no question that how the IC does things will impact the team performance, but generally the IC isn't the one who does the technical work to solve the problem.

And that's how Gene Kranz brought the Apollo 13 astronauts home safely. Not by solving the problem or rigidly controlling the process and information flow, but by making sure the team had what it needed, sharing information as available, providing high level guidance, and above all, setting the tone. And he was able to do that because he had a high performing team. People who had worked together before, knew what others were responsible for, and capable of doing, and that had earned each other's trust.

Compare that with a typical service outage. You get paged, open an incident, and then you try to figure out which upstream provider had an issue. You get that oncall, then someone adds the routing team and neteng gets involved. Pretty soon you've got a large team of experts who know everything they might need to know about their system, but nothing about the other people/systems involved. It's a brand new team. And when a team is put together they go through the forming/storming phases. The team needs to figure out their roles, so you end up with duplicate work and work dropped between the cracks. You get both rigid adherence to process and people quietly trying random things to see what sticks. We've got good people and basic trust, so we get through that quickly and solve the problem. Then we tear the team down, and the next time there's an incident the cycle starts again.

So what can we do about it? One thing that Core Business is doing is the Ring0 idea, A group of people that have worked together to form the core of incident response, so the forming/storming is minimized. Another thing you can do is get to know the teams responsible for tools/functionality you rely on. They don't need to be your best friends, but having the context of each other really helps. Be aware of the stages of team development. Whether you're the IC or just a member of the team, understanding how others are feeling and how things look to them helps you communicate. Because when you get down to it, it's not command and control that will solve the problem, but communication.

Finally, if you're interested in talking about the 4 stages of teams let me know. There's a lot more to talk about there.