A system that breaks randomly but frequently…and requires very short bursts of work to resolve the issue… will sap the morale of a team far out of proportion to the “size” of any ticket
-- John Cutler
When you overload a database and it falls over no-one is surprised that things stop. When that happens the team(s) involved dig in, pick it back up, and get things going again. Next they figure out the series of unfortunate events that caused the outage and do something to make sure it doesn’t happen again, or at least warn them before it does. Then they close out the issue and move on.
But sometimes that doesn’t happen. Instead, what happens is that an alert fires, but by the time someone starts looking the alert clears. Or maybe there aren’t any alerts, but throughput drops. Or someone tries to access a different S3 bucket and gets an error. There’s no outage, but there’s always something not quite right.
In many ways those on-call shifts are worse than ones with an outage. Because you never get a chance to catch your breath. Or think about root causes. Or do something about it.
What’s a team to do in that situation? The simple answer is to not ever let it get like that. And maybe that’s possible. But probably not. Because when the rubber meets the road you find out that things aren’t the way you expected. It could be because there was a constraint you didn’t see, or a set of known constraints were reached at the same time. Or there’s just a new use case with very different characteristics. It doesn’t really matter which of those reasons it is or if it’s something different. The situation will arise, so you need to deal with it.
IME the best way to deal with it is to own it. Acknowledge both the pain you’re feeling and that it needs to be dealt with. Acknowledge it as individuals and as a team. Take the time to understand what’s really going on. Then take the time to fix it right. Sure, it’s going to impact near term deliverables. If you spend a few days analyzing, experimenting, and delivering a fix you have slipped your immediate delivery by that many days. Own it. Tell you customers. But also recognize that after you put in the effort you’re going to be in better shape.
Drag will be lower. You’ll move faster. With everything else staying the same you’ll find that pretty soon you’re ahead of where you would have been if you hadn’t taken the time to fix the issue.
And you know what happens if you don’t take care of the issue? You’ve created a precedent. Similar to the broken windows theory. If it looks like things aren't being taken care of they won't be. When the next issue comes up people will look around and see that others haven’t done anything about their issues. Which makes them less likely to do something about the new one. Now you’ve got an even stronger precedent. Which makes it that much more likely that the third issue will be ignored.
Then one day you pick your head up and realize there’s a culture of not worrying about the little things. That’s not a culture anyone wants. Friction builds up. Progress slows. Deadlines are missed. You see things like this on the big display at the airport that's supposed to tell you when your flight is departing.
So instead, build a culture where the little things are important. Where people and morale are important. Where problems are acknowledged, owned, and acted on. And you’ll find that you not only have happier, more fulfilled people and teams, but you’ll get more done as well.