by Leon Rosenshein

Sharding Shouldn't Be Hard

When you get down to it, most of what we do is keep track of things. Objects in motion. Objects at rest. Files. Data in files. Where we put the files. Metadata about the data. Who wrote the data. Who read the data. For source control systems (Git, TFS, Perforce, VSS, etc), that's all they do. And when that's your business you keep track of a lot of data. Typically that means storing the data in a database.

And when you're starting out you keep everything in the same database to make joins and queries faster and management easier. As your business grows you grow your dB. That's what we did when we were making maps for Bing. We used Microsoft SQL server to track every file, ingested or created, in the system. At its peak we were managing billions of files across 3K machines and 300 PB of spinning disk. All managed by a single SQL server instance. Just keeping that beast happy was a full time job for a DBA, a SQL architect, and a couple of folks who worked on the client. We had a dedicated high-speed SAN backing it and 128 CPU cores running things.

When Uber acquired us we were in the process of figuring out how to split things up into something we could scale horizontally, because even without needing to pay SQL's per core license fees we had pretty much hit the ceiling on vertical scaling. Every once in a while someone would run a bad query, or the query would be larger than expected, and everything would come crashing to a halt because of table locks, memory limits, or just lack of CPU cycles. We knew we had to do something.

And it wasn't going to be easy. We had 100's of cars out collecting data, dozens of folks doing the manual part of map building, and of course those 3K nodes (~100K cores) doing the automated part of map building. We couldn't just stop doing that for a few days/weeks while we migrated, and we didn't have the hardware to set up a parallel system, so we had to migrate in place. So we thought about it. We came up with our desired system. Isolated, scalable, reliable. And we girded our loins and prepared to make the switch.

Then we got lucky. We got acquired and never had to go through with our plans. Not everyone is that lucky. GitHub ran into many of the same problems, but they had to implement their fix. Read on to see how that worked out and why there have been so many GitHub incidents recently.