by Leon Rosenshein

Debugging

It's always the little things. Ran into a problem yesterday with the opus tool and pushing to the WBU2 docker registry. Docker is a wonderful thing, and we've made it scale to Terabyte registries serving data to 1000's of nodes simultaneously, but it has its warts. Crossing the CORP/PROD boundary and http/https are a couple of them. On OSX you have the added bonus of docker-machine, so there are lots of moving parts. Add in Kraken for distribution and there are lots of places to look when you can pull but not push.

I'm on-call and we got notification that people were having trouble pushing images. At first it was an isolated incident, so it must be that person's machine, right? Start down the debug path. What OS? What version of the software? What network? How big is the image? Are your tunnels working? Everything seems to be ok there, and look, another user with the same problem.

Keep digging. Hey look, there are error messages in the log. Something about "seek after file closed". But wait, the file referenced in the error message isn't there. In fact the folder path doesn't exist after 4 levels. Something's really not right here. Sounds like a disk error. Must be the hardware. Nope, or at least nothing in the logs, and this is a distributed system, and all of the logs are showing similar messages, so probably not a specific disk.

Keep digging. What's common to all if the nodes? The persistent store, HDFS. Well that's working. lots of reads/writes. Check the quota. Well below the limit for both space and objects, so that's not it. FSCK says that HDFS is OK. Maybe someone changed perms on a file? Let's check the perms and owner of some files.

Oh, that's interesting. hdfs dfs -ls crashed with a GC out of memory error. That shouldn't happen. The only time I've seen that is when there are too many files in a folder. Let's try counting the files. hdfs dfs -count -h shows 1M files. Now that's an interesting number. Did you know that the default limit in HDFS is 1M files/subfolders per folder. If you try to write more it fails.

Time for a workaround. Rename the folder from _upload to _upload_old. Pushes start working. Great. We've mitigated. Now to figure out what happened. First, actually look at the files in that folder. A few magic HDFS env variables later and we've got a list of files. Turns out files that were supposed to be temporary weren't. There are files in there from 18 months ago, and the size of the folder is 1.7 PB. That also explains why we're using so much space in our registry.

At this point, as on-call, I'm done. I've mitigated the issue and gotten our customers moving, opened issues with the Kraken team to better manage their space so it doesn't happen again, and hey, I've even got a post-incident review here.

But there are a few lessons to take out of this. First of all, clean up after yourself, and make sure you do. Second, error messages w/ incorrect info send you down the wrong path. Yes, the file that was being seek'd on was closed, because it didn't exist. It didn't exist, because a write failed, but that error wasn't logged, so we didn't know. We would have saved a lot of time if the error was

"ERROR: Unable to write to hdfs://oppusprodda3e-2/infra/dockerRegistry/_uploads/<filename>"

or something like that. We would have gone right to that folder and seen the problem. Third, don't just log errors to stdout, emit metrics and alert on them. User complaints should never be your alerting mechanism.