by Leon Rosenshein

Situational Awareness

We all have biases. Some we know about, and some we don't. One of the benefits of being data driven is that, if you pick your metrics correctly, you can avoid those unconscious biases.

Going back to my aviation roots, that's why one of the leading causes of aircraft accidents is controlled flight into terrain. Basically it's the pilot not knowing there's a problem until it's too late to do anything about it. In aviation it's most likely to happen when flying on instruments. In those cases your eyes and ears will lie to you, telling you you're flying along straight and level, when actually you're spiraling into the ground. That's what we think happened to JFK Jr. off the coast of Martha's Vineyard. It's a mindset problem. It's a "what you know that just ain't so" problem. Kennedy was a relatively new pilot, and as experience shows, and the Dunning-Kruger effect explains, that's one of the most dangerous times for a pilot. He "knew" he was doing fine, so he ignored what his instruments were telling him.

But what does this have to do with software development? The same thing happens to us. We "know" things are true, so we don't take the time to make sure they really are. We make network requests, and we check the result to make sure things worked. But what about timing? Yes, the response is almost always very fast, but depending on the environment, it might be 30 seconds or more before the request times out and you get a failure message. Sure, you'll do the right thing, but it will be too late.

We ran into this with our healthchecks. We were doing user level testing of a distributed system, starting a job and making sure we could stop it. Everything was fine, usually. But every once in a while that process would take 10+ minutes. By itself that might or might not be OK, but because it was a blocking call in a set of serial tests it meant everything got delayed. Which threw off metrics, which caused other problems.

So we took it as a learning experience. We ended up fixing it in two ways. First, we parallelized the tests, which cut down the total time in the happy path. Second, we added a timeout on the test, so we'd know it took too long, mark it as a failure, and move on. We also did a quick review of other systems and added some other timeouts we hadn't realized we needed.

So what's your blind spot, and how can you close it?