by Leon Rosenshein


I'm one of the folks down in the engine room (the infra team). Our job is to make sure that all of our customers are happy and their appetite for processing (batch, long running services, stateful, stateless, etc) and storage (HDFS, Blob, Posix, etc) is met. As such, we're essentially a service team. That doesn't mean we write services (we do that), but that we provide a service to our customers. And our customers use those services to do their work, whatever it is. And like other service teams, when things are going well and our customers needs are being met they don't think about us too much. And that's the way it should be. Processing and storage as reliable as running water.

Of course, when things go wrong people notice. That's an incident. So we do what's called Incident Management. And that's comprised of two things. The first is dealing with the problem. We mitigate the problem as soon as possible, understanding not just what happened, but why, then we do what we can to make sure it doesn't happen again.

The second part, which is at least as important, is communicating. Making sure people know there's a problem, and that we're working on it. Making sure people understand what caused it. This communication helps in lots of ways. It keeps people from wasting time trying to figure out what they're doing wrong when the problem isn't theirs. It keeps people from wondering what's going on and distracting us from fixing the problem. It helps our customers understand how they can make better use of our services without impacting others.

This communicating is such a big part of the process that it's one of the named roles in our SOP. We typically have two people on call for each area. When there's a problem the primary is the lead, responsible for making sure the problem is solved and that the right people are working on it. The secondary is the communicator. Their job is to keep everyone else informed so that they primary can work the problem. This includes posting to slack in #breaking, updating our dashboard with a banner, creating/updating an incident, and making sure that there's a Post-Incident review. They also handle all the ad-hoc queries that come in during the incident.

And while not everyone at ATG is working on services handling 1000s of requests/sec, we all have customers. Some internal, some at Core Business, and some external. And they all have expectations. So, whether you're responsible for a function call, library, REST/gRPC service, website, or a car showing up, what's your plan for when your customer complains? If you take a look at our SOP, it explicitly says it's not about how to solve the problem. It's about how to manage the process of solving the problem and making sure everyone who wants to know what's happening knows.