It Happens
"That hardly ever happens is another way of saying 'it happens'."
When someone says “That hardly ever happens.'' there are two ways to approach the situation. The first is to take it at face value and put just enough effort into the rare case to ensure the system continues to operate, even if that little part fails. Like adding timeouts and retries to network requests. Something might be delayed or a user might have to click a button again, but things still happen. And the happy path stays happy.
The second is to focus on those “rare” events. Put just enough effort into tracking the happy case so you know things worked and spend the rest of your time dealing with the outliers. Like silent failures when writing data to disk. Consider BatchAPI, which, like it’s predecessors, handles millions of tasks per week, sometimes per day. Even with 6 9’s of reliability, at those scales you’re going to have multiple tasks failing every day. And those are the ones that people care about. The ones people want to dig into and understand what went wrong.
In cases like that, in platform code, “hardly ever happens” is the place where you need to focus. As much as scale is important, error handling is more important. Both internal and external. Internal, platform, errors get handled inside the system, and ideally never impact the user (other than potentially a delay in seeing the result). That needs to be rock solid. Redundant. Failsafe. Because, like those disk write errors, something will happen. It’s up to the platform to make sure there’s no impact to users. And at the very least, the platform should clearly take responsibility for the error when it fails.
On the other hand, when a user’s bit execution fails, then what? It’s a successful failure. All of the tooling and framework code did what it should, and correctly, but the user’s code failed. Now you need to help the user. What failed? Why did it fail? How can it be reproduced? Was it a transient problem and a retry will just work?
As a platform you need to think about these things and how to respond. Because when you get down to it, the platform is really just a giant error handling system. And regardless of how rarely it happens, it will happen.