by Leon Rosenshein

Oracles

One of the things I was taught about driving, particularly night driving, way back when, was that you should never drive faster than you could see. Taken literally, that either makes no sense, or says never driver faster than the speed of light. But it does make sense. It means you should make sure you always have time to avoid any obstacle you suddenly see or, make sure you can stop before you hit something you suddenly see.

Simple enough, but there’s more to it than making sure you can identify something further away than your braking distance. Especially if you define braking distance as how far you go before you stop after the brakes are fully applied on the track under ideal conditions. It depends on things like road conditions, brake temperature, gross vehicle weight, lag in the system, and tire condition. And then there’s the person in the loop. How long does it take you to notice that something is in the way? Then decide that you need to stop? Then actually mash the brakes? Depending on your speed, that can take longer than actually stopping.

The same kind of thing applies to automatic scaling as well. You want to automatically scale up before you run out of whatever you’re scaling so your customers don’t notice, but you don’t want to scale up too much or too soon and have to pay for unused resources because it’s possible you might need them. The same goes when you scale down. You don’t want to hold on to resources too long, but you also don’t want to give them up only to need them again 2 seconds later. Which means your auto-scalers have to be Oracles

Sure, you could be greedy and just grab the most you might ever need, but while that might work for you, it’s pretty hard on everyone else, and if you are in some kind of multi-tenant system it won’t work if everyone is greedy all the time,

One place we see this a lot is with AWS and scaling processing. Not only do we need to predict future load, but we also need to take into account the fact that scaling up the system can take 10 minutes or more. Which means if we wait too long to scale up our customers ping us and ask why their jobs aren’t making progress, and if we don’t scale down we get asked why utilization is so low and can we do more with less? To make matters worse, because we run multiple clusters in multiple accounts in someone else’s multi-tenant system we’ve seen situations where one cluster has cannibalized another for resources. There are ways around it, but it’s non-trivial.

Bottom line, prediction is hard, but the benefits are multitude. And the key is understanding what the driving variables are.