Scream Tests

devops backups hoarders

The other day I mentioned to someone that we had a scream test scheduled for a few weeks from now and they gave me a very confused look. Since I know that it’s my responsibility to make sure my messages are received properly I asked if they knew what a scream test was. Turns out they didn’t. That day they were one of the lucky 10,000. If you don’t know, today’s your day.

In short, a scream test is when you think something is unused, but you’re not sure and not sure how to find out. One way to be sure is to simply turn it off. If no-one screams then it was unused. If there are screams you’ve identified the users. It’s a pretty simple and accurate test.

I’m honestly not sure when I first heard the term, but it was probably sometime in the late 1990s. That’s when I was first running servers and services that other folks used on a regular basis. They were file shares, and back then storage was both expensive and hard to get, so dealing with hoarders was a full time job. But we never knew for sure who was using the data. We’d ask folks to clean up. We’d tell people it was going to go away on a given day. We’d remind them. Sometimes we get a couple of folks saying they didn’t care, but generally, silence. So eventually we’d make the data unavailable. Usually nothing happened. So we’d wait a week, archive to tape, do a restore (this is a critical step in backup planning), then remove it from the server. Space was recovered and everyone was happy.

But every once and a while about 30 seconds after we took the folder offline someone would do the office gopher and scream “Where’s my data?” Then we knew who was using that data. We’d either keep it or figure out exactly was needed and only keep that. Not perfect, but not a bad outcome.

Even more rarely, something completely unrelated would stop working. Then the real fun began. Why would the monthly accounting run care if the asset share for a 3 year old game was offline? It cared because the person who wrote the script had originally used their desktop for temporary storage as part of the process, but stopped doing that because they got a memo to turn off their machines at night to save power and the script failed. So they found some temp storage they could use and started using it. That’s wrong for many reasons. Not the least of which was that everyone at the company had access to that share and could have seen the data if they knew it was there. In that case we identified not only an unknown user, but we also closed a security hole that we didn’t even know we had.

Over the years since I’ve done similar things with file systems, internal and external services, and once even and entire data center. Turns out turning off a DC can break all sorts of things. Starting with DNS and routing 😊

And now you know what a scream test is.

Scream Tests

About Me

Latest Articles

Categories