by Leon Rosenshein

GC At Scale

Many of us have used garbage collected languages. Golang, Java, C#, or maybe experimenting with Rust. When we ran the HDFS cluster in WBU2 we routinely had heap sizes of 250+GB, and GC would sometimes reap 50+ GB and take seconds. That caused all sorts of runtime problems with our customers as the Namenode would go unresponsive and clients would time out. We solved that one by switching garbage collectors and tuning it better, as one does. But that's not the biggest or longest GC I've had to deal with.

Back in the Bing Maps days we managed a 330 PB distributed file system that was intimately connected to our distributed processing engine. Every file it tracked was associated with the job that created it. From the smallest temporary file to the biggest output file. There are lots of benefits to that. Since we also tracked all the inputs and output of each task it was easy to track the provenance of any given file. So if we found a problem somewhere in the middle of a processing chain it was easy to fix the problem then go and find everything that depended on it re-run those tasks with the same input labels and get everything up to date.

Now disk space, just like memory, is a precious resource. And we were always running out of available disk space. One of the features we had was the ability to mark every file generated by processing as task scope or job scope. Task scope files went away when the overall job was marked as done or cancelled. Job scope files, on the other hand, only went away when the job was cancelled. And by go away I mean they were marked as deleted, and the cleaner would come along periodically and delete them out of band. So far, so good. 

And of course, whenever we got close to running out of disk space we'd remind users to close or cancel any old jobs so the space would be reclaimed. And they would and we would move on. But remember when I said every file was associated with the job that created it? That included the files that were imported as part of bootstrapping the lab. Those files included the original source imagery for pretty much everything we had when the cluster came online, and they were added with a bunch of jobs called import_XXX.

You can probably guess where this is going. We had never marked those import jobs as done, and as part of one of those periodic cleanups, someone cancelled those import jobs instead of closing them. And the system then happily GC'd all of that data. About 30 PB worth of data. Now that's garbage collection at scale :)