by Leon Rosenshein

Lies, Damn Lies, And Statistics

When you’re running code in production, metrics are important. You need to know that things are working correctly. You need to know when things are starting to go in the wrong direction before any of your users or customers notice. Metrics can tell you that.

The problem with metrics at scale is that, in order to make sense of them, they’re often aggregates. Mean, median, mode, p90, and p95. All ways to turn lots of numbers into a single representative number. Unfortunately, aggregates like that are very lossy compression. The more variable the data, the less useful things like the average are.

Consider Van Gogh’s The Starry Night. According to John Rauser the average color in The Starry Night is #4C5C6D. That’s interesting, but it really doesn’t say much about the painting. Who wants to look at a colored rectangle?

There’s much more information in the painting, and I’d much rather look at it.

Van Gogh's Starry Night

Not only is there a lot of missing information, I don’t think that average color even exists in the painting.

Or consider something like a P99 latency. Not only are there many ways for the P99 value to be wrong, but even if it’s accurate you could still easily be missing the point. Statistics can be tricky things, especially percentiles. Sure, your P99 value means 99% of the values are lower, but it doesn’t tell you anything about why that last percent of items is higher, nor does it tell you how much higher.

Image you have 10,000 web requests be served by 100 servers. 9,900 complete in less that 30ms, so your P99 is 30ms. That’s a pretty good P99. That’s probably within your SLA, so according to the SLA you’re all set. What that P99 doesn’t tell you is that the other 100 take 5 seconds each. Somewhere in there you’ve got a significant problem. The other thing it doesn’t tell you is that all 100 of the slow requests were handled by the same server. Which means you’ve probably got a routing/hardware problem. Just looking at your P99 and you wouldn’t have known there was a problem, let alone that you it’s on the physical side, not the software side.

Which is another way of saying don’t just look at the aggregates, look at your raw data and see what else it’s telling you. It can tell you a lot about how things are really going, if you let it.