by Leon Rosenshein

System Goals

As CEO and Co-Founder of Honeycomb, Charity Majors has a lot to say about observability. She also has a lot to say about software development in general. And software leadership. Over the years I’ve learned a lot from her stuff.

Recently she was on the How AI Is Built channel. It’s a pretty good overview of how to do observability and why it’s important. On top of that, the title of the episode really spoke to me. It’s one of those non-functional requirements, even if it’s not strictly worded as an -ility.

Build systems that can be debugged at 4am by tired humans with no context

That right there. That’s the goal. Make sure the maintainer has life as easy as possible.

Developer working late at night thinking about code

Of course, we want to write perfect code. Code that always works. That handles upstream and downstream issues. That scales seamlessly. That never fails and doesn’t need maintenance. But the reality is that code can fail. You will need to maintain it. Especially when the environment it’s running in changes. Different inputs. Different scale. Different latency requirements.

And of course, when it does fail, it won’t be at 10:00 in the morning while the original author is standing there expecting it. Instead, it will be early on a weekend morning, and you, as the on-call engineer, need to notice there’s a problem, figure out what’s happening at that moment, come up with a remediation, and implement it. While you’re barely awake working on code you’ve only seen before in passing.

That’s where runbooks help. They tell you where to start looking and how to deal with similar issues. That’s when observability shines. It tells you what’s really happening. Not what you expect to be happening or what you think_ is happening, but the actual ground truth. (Current) Comments and design docs are critical as well. They tell you want the original (and hopefully current) intent and constraints were. That’s when you need accurate logs and databases to show you how you got to where you are.

At 4 AM you have no context. You have no understanding. Before you can do anything, you need to build the context to understand the situation. Then, and only then, can you make a considered change to remediate. Sure, you can just start changing things and seeing what happens, but more likely to make it worse than to fix it when you do that. Remember, Slow is smooth and smooth is fast.

And that’s Coding for the Maintainer