by Leon Rosenshein

Back It Up There

Back in the house now and mostly back to normal. Heat and electricity are on, but we still need to boil our water. Turns out most of our processes and preparations worked, but not all of them. The biggest failure appears to be the emergency alert system. The city/county uses Everbridge to share alerts. That’s fine and dandy, and they’re highly available and distributed. The problem is, they’re OPT IN, so if you don’t register for alerts you don’t get them. And of course, if you don’t know about the system you don’t register. Well, we’re registered now, so we’ve got that going for us.

Which got me thinking about systems testing and assumptions. When something absolutely, positively, has to work right, how do you verify it? Do you believe that if your car’s “Check Engine” light is off everything is fine? That’s a pretty bold statement. Especially if you don’t have a test to tell you if that light even works. Even if everything is working as planned, is there something that can go wrong that the light doesn’t tell you about?

Having a little green light that says everything is OK is a little better. At least then if the bulb burns out you know something happened and you can check. But even that leaves you at the mercy of what the system behind the light checks.

The only way to really know if your system can handle a particular fault is to try it out. Back when I was working on Falcon 4.0 the team down the hall was working on Top Gun, and they were trying out a new source control system, CVS. It seemed great. It understood merges. Multiple people could work concurrently on the same file and deal with it later. And our IT team took good care of the server. RAID array for the disks, redundant power supplies, daily backups, the whole 9 yards. And of course, the server died. It was the Pepsi Syndrome. Someone knocked over the server. Disks crashed. Motherboards broke. Network ports got ripped out.

Short story, it wasn’t coming back. But that’s ok. We’ve got backups. Weekly full backups and daily incrementals. Pick up a new server, restore the backup, and keep developing. Just a couple of days delay. Until we tried to restore the backup. Turns out we had write-only backups. The little green lights came on every week. The scripts returned successfully. The tapes were rotated off site for physical safety as planned. We just couldn’t restore the server from them.

Luckily we were able to piece together a close enough approximation of the current state from everyone’s machines and the project moved forward. We also added a new step to the weekly tape rotation process. Every Monday we’d restore from tape and apply an incremental on top of it to make sure the entire process worked. And of course, after that we never needed the backups again.

So next time you trust the absence of a red light, or the existence of a little green light, make sure you know what you’re really trusting.