by Leon Rosenshein

Let It Crash

This is the philosophy of Erlang. Only do the right thing, as specified in the requirements. If you can't do that, fail. Don't try to make it better. Don't try to hide it. Don't pretend it didn't happen. Just fail. And let something else deal with the problem. Something with more scope/visibility/understanding of what's going on. And that's not a bad idea. So maybe we should just let it crash.

I don't mean the robot. I mean the code. And I don't mean crash in some uncontrolled manner and leave it there. What I mean is that if you don't have the information/understanding/scope to appropriately handle a failure, don't. That's what the VIM is for.

And this doesn't mean you get to ignore errors. Oh no. Quite the opposite. You need to handle them, but you need to handle them in the right place and the right way. It might be killing and restarting the process. It might be failing over to a different master. It might be a voting algorithm to decide who's in charge now. If you have a triple-redundant flight control computer and one of them gets confused, rather than try and fix the problem, restart it and pick up from where you are. Embedded systems often have supervisor or watchdog processes that know when something bad happens and trigger restarts rather than continue the bad behavior. In high availability distributed software you often find multiple instances ready and waiting to take over if something happens to the current master.

Everything is a trade-off. Can we build everything with standby systems ready to take over at a moment's notice? Do we need to? Should we do it anyway? Is it enough to make sure that we "fail safe" and that we can recover later? While there are wrong answers, there's not always a right answer. The important thing is to think about it and understand what the failure cases are and how we're going to deal with them.