by Leon Rosenshein


Ever run into an error dialog on a web page that said something like "An error has occurred. Hit ok to continue"? Not very helpful was it. What about "An error has occurred and has been logged. Please try again later." Still not really helpful, but at least you feel better because someone knows about it now.

Well this isn't about that error dialog (although some of it applies). This is about that log that got mentioned. We've got a decently sized team, and every 7 weeks I'm on call, so I get to deal with those log files. We (and our customers) use lots of 3rd party libraries and tools, so the log I get to wade through are of varying quality. Lots of information. Lots of status. Some indication of what's happening, and the occasional bit of info about something that's gone wrong. Spark and HDFS in particular are verbose, but not all that informative. There are lots of messages about making calls that succeeded and the number of things that are happening. The occasional message about a retry that succeeded. And then the log ends with "The process terminated with a non-zero exit code".  Thus starts the search for the error. Somewhere a few pages (or more) up in the log is the call that failed.

So what can we do to make it better? First and foremost, log less. Or at least make it possible to log less by setting verbosity levels. In most cases logging the start, middle, and end of every loop doesn't help, so only do that if someone sets a flag. All that logging does is add noise and hide the signal.

Second, don't go it alone. Use existing structured logging where possible. Your primary consumer is going to be a human, so log messages need to be human readable, but the shear size of our logs means we need some automated help to winnow them down to a manageable size, so add some machine parsable structure.

Third, When you do detect an error log as much info as possible. What were you doing? What was the actual error/return code? What was the call stack? What were the inputs? The more specific you can be the better. The one thing you want to be careful about is logging PII/Financial info. If you're processing a credit card payment and it fails, logging the order id and the error code is good. Logging the user name, address, credit card number and security code is too much.

Finally, think about the person who's going to be looking at it cold months from now. Is the information actionable? Will the person looking at it know where to start? There's a high probability that the person dealing with it will be you, so instead of setting your future self up for a long search for context, provide it up front. You'll thank yourself later.