by Leon Rosenshein

Current State

State is important. And state is hard. It's one of the two hard things in computer science. Even something as simple as checking if a file exists or the value of a variable as a test to do/not do something. Between the time you check and the time you act the state could change. Unless you do some kind of mutual exclusion to prevent it. a mutex if you will. Having multiple cores and multiple CPUs in a system running multiple simultaneous processes make it even harder, but with global locking and integrated test and set routines in the silicon it's doable. Painful and hard on performance, doable.

Move to an even more distributed system, say a bunch of computers connected over a mostly reliable network, and it gets even harder. And for the times that you distributed synchronized state there are things like Paxos, Raft, and Zookeeper. But that just provides you with a distributed/available source of truth. Using it is still hard, and can have scaling issues.

One important thing that can help here is knowing how "correct" and "current" your view of the state needs to be. Sure, you can ask the system for everything each time you need to know, but as the size of the state grows and the number of things that need the state increases simple data transfer will become the choke point.

The traditional way around that is a cache. That's great for data with good persistence. And it matches the usual mental model of pulling data when you need it. In the right situation it works really well. That's all a CDN is really, and the internet lives on them.

Another way is to turn the system around and switch from pull to push. Instead of having the user ask for state when needed, it maintains its own view and receives deltas from the source. If your rate of data change is small (relative to the size of the state) AND the changes don't happen at predictable intervals (time-to-live is random) then having the source of truth tell you about them might make sense. It's more complicated, requires changes on both ends of the system, makes startup different from runtime, and adds the requirement of reconciliation (making sure the local view does in fact match reality), but it can greatly reduce the amount of data flying around your system and make the data more timely.

We've run into this a few times with our kubernetes clusters. As a batch cluster the size (both in nodes and tasks) is pretty dynamic. Some things run per-node and try to keep track of what's going on locally (I'm looking at you fluentd) and in steady state it's not too bad, but in rapid scale-ups all of those nodes, asking for lots of data at the same time, can put the system under water. It's partly the thundering herd, which we manage with splaying, but it's also the give me everything every time problem. That we're solving with watchers. So far it looks promising. There will be some other scale issue we'll hit in the future, but that got us past this one.