by Leon Rosenshein

Disk Errors

I've been working on datacenters and parallel computing for over 10 years now. Back in 2007 we had about 100PB of spinning disk and had our own home-grown distributed file system. Roughly equivalent to HDFS, but of course Windows (NTFS) based since we were part of Microsoft. We had filesets (roughly equivalent to an HDFS file) composed of individual NTFS files (blocks in HDFS terms), and instead of a namenode/journalnode we had a monstrous MS-SQL server that kept track of everything. And, like HDFS we had redundancy (nominally 3 copies) and that usually worked. But sometimes it didn't. And we couldn't figure out why.

Luckily, as part of Microsoft we had access to the folks that actually wrote NTFS and the drivers. So we talked to them. We explained what we were doing and why. Number of hosts. Number of disk drives. Number of files and filesets, Average file size. What our data rates were (read and write). We talked about our pipelines and validation. We listened to best practices. We learned that no-one else was pushing Windows Server that hard with that many disks per node. And all through the discussion there was one guy in the back of the room scribbling away furiously.

After about 30 minutes of this, he picks his head up and says. You have at least 2 problems. First, at those data rates at least 7 times a day you're going to write a zero or one to the disk, and it's not going to stick. Second, in your quest for speed in validation you're validating the disk's on-board write through cache, not the actual data on disk.

We'd run into one of the wonders of statistics. All disk writes are probabilistic, but we were pushing so much data that we were pretty much guaranteed to have multiple write errors a day. And then we compounded the problem by doing a bad job of validating. So we fixed the validation problem (not much we could do about physics) and things got a lot better. After that, over 10 years and scaling up to 330 PB of data, we only lost one file that wasn't caused by people deleting them by mistake. Human error is a story for another day.

But that didn't solve all of our problems. We solved the data loss problem, but not the availability problem. That was a whole different problem. That one was caused because we were too smart for ourselves. For safety we distributed our files randomly across the cluster. And many of our filesets contained hundreds of files, so we made sure we didn't put them together. That way when we lost a node we didn't take out all of the replicas. Sounds good so far. But that caused its own problems.

We had about 2000 nodes. How do you do a rolling reboot of 2000 nodes when you can only take down 2 at a time? You do it 2 at a time 1000 times. At 10 minutes a reboot that's 10000 minutes, 166+ hours if everything goes smoothly. 20+ days if you do it during working hours. That sucks. Also, disks popped like popcorn. Power supplies failed. Motherboards died. On any given day we'd have 10-20 nodes down. You wouldn't think that was a problem, but the way we did things, filesets were the unit of availability, not files, so if one file was broken then that instance of the fileset was down. And 3 different files on three different machines could, and often did, make all 3 versions of a fileset unavailable.

So how do you solve those problems? With intelligent placement. Since we had 3 copies of every file we didn't spread things out randomly, we made triplets. 3 nodes that looked exactly like each other. You could reboot ⅓ of the nodes at once. Do the whole lab in a morning. When a node went down we fixed it, and the other two carried the load. If we ever had 2 nodes in a triplet down ops knew and jumped all over it. And that's how we solved the availability problem.

So what's the lesson here? We ran into issues no-one had ever seen. Issues at all sorts of levels. From the firmware level on the harddrive to the macro level of how do you manage fleets of that size. Could we have foreseen them? Possibly, but we didn't. We had to recognize them as they occurred and deal with them. Because scale is a problem all its own. The law of large numbers turns the possible or occasional problem into a daily event that you just have to deal with. It's no longer an exception. It's just another state in the graph.