Monitoring vs. Observability
Pop quiz. What’s the difference between monitoring and observability? First of all, Monitoring is a verb. It’s something you do. You monitor a system. On the other hand, Observability is an adverb. It’s one of the -ilities. Observability describes a capability of a system.
Monitoring is collecting logs and metrics. Metrics like throughput, latency, failure rates, and system usage. Monitoring is going over logs and looking for issues, patterns, and keywords among other things. Advanced metrics include some way of tracing through a system, whether it’s a distributed system or not. And in many cases, you compare the results to a threshold and fire some kind of alert if that threshold is crossed.
Observability is the ability to know and understand the system’s current state by looking at the metrics. Not just the explicit bit of state you get from the logs and metrics, but the overall system state. Not just if some threshold of a particular metric has been crossed, like how much RAM is being used, but why that amount of RAM is being used.
In other words, good observability includes monitoring, but just having monitoring doesn’t mean you have good observability.
In a system without good observability, if you’ve got a good understanding of your system, you can look at the metrics and deduce if there’s a problem. You can look at the logs and work back from the problem/error and identify where there’s a log message that shows where the problem started. In that case you might be able to use metrics to find the root cause of why a threshold was crossed (and an alert fired that woke you up at 2 0’clock in the morning).
If you don’t have that understanding then what? You start reading code. And tracing it. Looking for code paths that might have caused the issue. You might search for where log messages are being generated and try to understand why. You’re trying to build understanding of the system and rebuild the internal state of the system at the same time. Meanwhile the fire rages.
If your system has good observability then you don’t need that deep understanding. The metrics (and logs and traces) point you to where the problem is, not where the symptom surfaced. So when your p95 response latency crosses your threshold you know not only what happened, but what caused it.
And you usually want that ability. Because at first, before you fully understand a system you need help figuring out why something when wrong. Later, you understand the system. You’ve put in mechanisms to deal with issues you’ve learned about. You’ve got good metrics to tell you about that state just in case something does happen. So what happens next? Instead of the problem you understand, something you don’t understand happens. Something where your deep knowledge of the system isn’t deep enough. And you’re right back to the not understanding state.
Which is why good observability is so important.