by Leon Rosenshein

Murder Mystery Theater - Acting All Roles

-- By Andrew Gemmel

There’s been a mild annoyance bothering developers on our team - and likely others - for a few months now. Occasionally the ssh-agent on development machines will die. Needing a ussh cert for most remote actions, the remedy for one terminal session is a quick eval $(ssh-agent) or more permanently, restarting the machine. We all chalked it up to a bad chef configuration or similar, at least until today.

Today, @mike.deats was debugging a separate IDE issue on his machine and noticed something odd. Without fail, he could reproduce this issue by running all tests in the atg-services repo. Ok, that’s disconcerting. A quick bisect effort isolated the problem to a single Golang package. One that I had written. Heavily unit tested, in fact notoriously so. This package is the taskhost program for the BatchAPI. If you’ve ever run a BatchAPI job, you can thank this code for its success. 

The taskhost is the thin wrapper between kubernetes and your user code that reports any issues back to the BatchAPI and ensures that your logs end up in the right place. The tests for this program basically mimic various job scenarios in kubernetes, kicking off a number of taskhost processes masquerading as docker containers and observing the state of the filesystem and output streams that result. 

In order to do this in a test environment, the taskhost always interacts with the outside world through a dependency injection context that provides things like a filesystem, log writers, AWS clients, and a shell process runner. 

Or at least, that was true until pull request 981 was landed. This was a late-night code change that I deployed while on-call to mitigate an outage. Long story short, an issue with the rNA log-reader was overwhelming the disks in our cluster and causing machines to hit 100% disk usage and get wedged. To mitigate this, that change deletes the log-reader cache in /tmp between each BatchAPI task run.

If you read through that PR carefully, you’ll notice that the RemoveContents() function I so carefully copy and pasted from StackOverflow does not use the dependency injection filesystem. That’s right, every single time the taskhost unit tests run on a machine, they delete everything in /tmp on the user’s machine, including the ssh agent’s ussh cert.

Wow. It’s a miracle that killing ssh agents was the worst thing that this mistake did. The corresponding fix was as simple as deleting that mitigation code, as the underlying log-reader problem has long since been remedied.

There’s a few lessons here. One, hot fix code is a necessary evil but checking it in without careful audit is A Bad Thing. Second, when that evil code is checked in, a ticket to ensure it’s removed as soon as possible would be A Good Thing. Third, debugging can often become a game of murder mystery theater where you are not only the detective, but the murderer and victim too.