by Leon Rosenshein

IP Tables To The Rescue

HDFS is great for storing large amounts of data reliably, And we had the largest HDFS cluster at Uber and and one of the largest in the world in WBU2. 100 PB, 1000 Nodes, 25,000 Disks. HDFS manages all those resources, and it does it in a reasonably performant manner. It does it by keeping track of every directory, file, and file block. You can think of it like the master file table on a single disk drive. It knows the name of the file, where it is in the directory hierarchy, and it's metadata (owner, perms, timing, blocks that make it up, their location in the cluster, etc). The datanodes, on the other hand, just keep track of the state and correctness of each block they have. For perf reasons the datanodes keep track of what info they've sent to the namenodes and what the namenode has acknowledged seeing. From there it only sends deltas. Much less information, so less traffic on the wire, fewer changes to process. Goodness all around. And to keep the datanodes from all hitting the namenode at once, the updates are splayed out over time.

The problem we had was that our users were creating lots of little files. And lots of very deep directory structures with those tiny files sprinkled within them. So the number of Directories/Files/Blocks grew. Quickly. To well over a billion

At that point startup became a problem. We had a 1000 node thundering herd sending full updates to a single namenode. 1000 nodes splayed out over 20 minutes gives you just over 1 second per node. With 100 TB of disk per machine, transmitting and processing that data took well over a second. So the system backed up. And connections timed out, And the namenode forgot about the datanode, or the datanodes forgot about the namenode and the cycle started all over again. So we had to do something.

We could have turned everything off and then turned the datanodes back on slowly, but hardware really doesn't like that. Turning the nodes in a DC off/on is a really good way to break things, so we couldn't do that. And we didn't want to build/maintain a multi-host distributed state machine. We wanted a solution that ran in one place where we could watch and control it.

So what runs on one machine and lets you control what kind of network traffic gets in? IPTables of course. Now IPTables is a powerful and dangerous tool. And this was right around the time someone misconfigured IPTables at Core Business and took an entire data center offline, requiring re-imaging to get them back, so we wanted to be careful. Luckily we knew which hosts would be contacting us and what port they would be contacting, so we could write very specific rules.

So that's what we did. Before we brought up the namenode we applied a bunch of IPTable rules. Then we watch the metrics and slowly let the datanodes talk. That got things working, but it still took 6+ hours and lots of attention. So we took it to the next level. Instead of doing things manually, a bash script that applied the initial rules, then monitored the live metrics until the namenode was feeling ok, then released another set of nodes. And that got us down to ~2.5 hours of automation. We still worried about it since we were either down or had no redundancy at that time, but it was reliable and predictable.

And that's how we used IPTables to calm the thundering herd.