by Leon Rosenshein

Error Types

Statistics is all about the null hypothesis. You assume it’s true and try to prove if it is false. Consider a fire alarm. If it’s not ringing you assume there’s no fire. If it is ringing then the assumption is that there is a fire. The state of the alarm is a a simple binary. It’s either ringing or it’s not. And either there is a fire or there’s not. So you have the following truth table

                 Alarm
           Ringing   Not Ringing
         |---------|-------------|
 No Fire | Type I  |   CORRECT   |
         |---------|-------------|
    Fire | CORRECT |   Type II   |
         |---------|-------------|

Simple and clean. Two correct states and two error cases. The Type I, or false positive, and the Type II, or false negative.

As developers we need to deal with this kind of problem all the time. One of the more common is alerting, or error detection. If you have a perfect signal for an error case then you can always do the right thing. If your service is not running that’s an error. Simple. But what if you’re not getting the signal that your service is running? What type of error is that? Does that mean your service isn’t running or that the signal is blocked? What do you do in that case?

Well, it depends. Mostly it depends on the various costs of being wrong and benefits of being right. For monitoring the datacenter most of our alerts will fire if we don’t get the signal. It’s a Type I error and we do that for a few reasons. First, even if the DC is ok, the fact that we’re not getting a signal is a problem, and the cost of DC outages is high. Even if it’s not something in our control (i.e. the fiber seeking backhoe strikes again),

FSB

we still want to know so we can do something about it. Second, the actual cost of the alert is pretty low. Just a phone call, albeit potentially in the middle of the night. Actually, the cost for a single event is low.

The problem is that this is a distributed system. There are latencies. There are networks. There are many reasons why we might not get a datum on time, and if we fired the alert every time that happened we’d quickly succumb to alert fatigue and start ignoring them. That’s not an error, but it is a real problem.

So to avoid that we build some latency into the system, The signal needs to be bad for some time before we fire. The longer the time, the less likely we are to have a Type I error. Unfortunately, the longer the time, the more likely we are to have a Type II error, a false negative. And the cost of those is high. 100’s of people and thousands of tasks failing. We really don’t want that, so it’s a balance.

Your situation might be different. For mission critical safety decisions you might choose to eliminate Type II errors in favor of more Type I errors. It’s not comfortable, but a spurious hard braking event is better than no brakes and hitting something. Other things might be OK to just ignore.

And that completely skips the Type III (the right answer to the wrong problem) and Type IV (the right answer for the wrong reason) errors, but those are topics for another time