Incident Response
As I’ve mentioned before, I’m a Mechanical/Aerospace Engineer by training, and plane geek. One thing I’ve noticed is that there’s a lot that software engineers can learn from aerospace engineers and aviation in general. One of those things is incident response.
A few days ago a couple of Alaska Airlines jets headed for Hawaii had tail strikes, within 6 minutes of each other.
Now tail strikes happen occasionally. There’s a procedure for it. The airplane returns to the departure point, gets checked out, and depending on the intensity of the strike, the passengers either continue on the same plane or a different plane is brought in and the passengers leave on that one.
Two in a row (6 minutes apart), at the same airport, from the same airport, with a similar destination, however, is not normal. There’s no defined procedure for it. Instead, Alaska’s director of operations looked at the situation and made a call. He declared an incident. All Alaska flights on the ground would stay on the ground until the issue was understood and mitigated if needed. It’s a safety of flight issue. When you don’t know what’s going on, stop and figure it out.
In this case, it turns out that a recent software update had some kind of issue. The flight planning software that was supposed to figure out weight and balance information and use it to set takeoff parameters had a concurrency/load issue. When operations got busy and built too many flight plans too quickly it got the weights wrong on heavy flights leading to incorrect power settings, which eventually lead to the tail strikes.
There are lots of lessons to be learned here. And that’s ignoring how the test plan for the software itself didn’t find the bug. First, the system has defense in depth against failures. The flight plan is designed to use less than full power to save fuel and wear and tear on the engines, but it doesn’t use the absolute minimum required. There’s a good sized safety margin built in. The pilots have a checklist involving the length of runway remaining, acceleration, speeds, and altitudes as the takeoff proceeds. If any of those checklist items hadn’t been met on takeoff the takeoff would have been aborted. Even with the margins and checklist, there were a couple of tail strikes, but there was no injury or loss of life, and minimal damage to the planes. One of the planes continued on its way about 4 hours later. The system itself is resilient.
Second, while there are no pre-defined triggers around number of tail strikes in a short period of time, the director of ops understands how the system works, recognized an anomalous situation. He was empowered to act. So he pulled the Andon Cord to stop things. There was a system in place to stop operations, so things quickly stopped.
The next step was information gathering. It quickly became apparent that the weight info was sometimes wrong. To resume operations the pilots, the line operators, who were most familiar with the weights and balances, were empowered to make a decision. They were told to check the weight, apply a TLAR (That Looks About Right) check, and if they felt there was a problem, to check with operations and get the right number.
22 minutes after operations stopped, they were restarted. Meanwhile work continued on a long-term solution to the problem. They handled things in the right manner. Stop the damage. Mitigate the problem and resume operations. Identify and implement the long-term fix. Figure out how the problem got into the system in the first place and make sure it doesn’t happen again.
That’s not just good incident response in a safety critical system. That’s how you should respond to every incident.