Scale Out, Not Up
If you've spent much time around folks who work on large scale data centers you've probably heard them talking about pets and cattle and wondered why that was. Yes, data centers are sometimes called server farms, but that's not why. Or at least it's not that simple.
It all goes back to the idea of scale and how that has changed over the years. 50 years ago computers were big. Room sized big. And they took a lot of specialized knowledge just to keep running, let alone doing anything useful. So every computer had a team of folks dedicated to keeping it happy and healthy. Then computers got smaller and more powerful and more numerous. So the team of specialists grew. But that can only scale so far. Eventually we ended up with more computers than maintainers, but they were still expensive, and they were all critical.
Every computer had a name. Main characters from books or movies, or the 7 dwarfs. Really big data centers used things like city or star names. It was nice. Use a name and everyone knew what you were talking about, and there just weren't that many of them. We had a render farm for Flight Sim that built the art assets for the game every night. They had simple names with a number, renderer-xx. But we only had 10 of them, so they all had to be working. Even the FlightSim team had a limited number of machines.
Eventually machines got cheap enough, and the workloads got large enough, that we ended up with 1000s of nodes. And the fact is that you can't manage 5000 nodes the way you manage 5, or even 50. Instead, you figure out which ones are really important and which ones aren't. You treat the important ones carefully, and the rest are treated as replaceable. Or, to use the analogy, you have pets and cattle. Pets are special. They have names. You watch them closely and when they're unwell you nurse them back to health. Cattle, on the other hand, have a number. If it gets sick or lost or confused, who cares. There's 5000 more just like it and it can be replaced.
That concept, coupled with some good automation, is what allowed a small team to manage the 5000 node WBU2 data center. Out of those 5000 nodes we had 28 pets, 28 semi-pets, and 4900+ cattle. 8 of those 28 pets were the namenodes I talked about in our HDFS cluster. We really cared about them and made sure they were happy. The other 20 we watched carefully, and made sure they were working, but re-imaging them was an easy fall-back if they got sick. And we really didn't care about the cattle. At the first sign of trouble we'd reimage. If that didn't work we'd take offline and have someone fix the bad hardware. And our users never knew. Which was the whole point of pets and cattle.