by Leon Rosenshein

Surprise And Delight

You read about how you should always surprise and delight your customer, but really, you just want them then be delighted. And they really shouldn't be surprised to be delighted. An unexpected new feature they like is OK, but really indicates a failure to properly market the new feature and create demand. The flip side of surprise is when your customer expects something to work and it doesn't.

There are lots of ways that can happen. Some are accidents or outside of our control, like when a worker violates safety protocol then shorts 5 MW to ground blows all the transformers for your new DC. Others could have been predicted in retrospect, like when a new dataset is sufficiently different from the usual ones that it causes some processing to fail. The final kind is when you know it's going to happen, but you still surprise your customers.

That last kind happens, but it really never should. I'm not saying that you can't make breaking changes or that you can't do heavy maintenance with downtime, just that your customers shouldn't be surprised. And that's called Change Management. There are lots of ways to do it. In the physical world there are whole departments dedicated to managing the change management process, from the reason for the change through the scheduling and implementation of it. If you need to make a change to a multi-million dollar production line you better have a pretty good idea of what you're doing.

Another kind of change management is a calendar. Just list and publicize changes that might impact others. It's based on the honor system. List the things you're doing and when, and watch the list for changes that might impact you. It's decentralized and can be low impact, but it's not no impact. You need to take the time to look for what might impact you for two reasons. First, there's no central authority that will do it for you, and second, even if there was, they don't know everything you do, so they might miss something.

So how do we do this? Right now we have 2 basic mechanisms, RFCs, and blameless postmortems. The RFC process lets you know what's coming and you can think about how it interacts with whatever it is you do. In many cases you can ignore the implementation details as long as you understand the "what" and "why" well enough to know how you interact with it. If there is some interaction, then you can figure out if it's breaking and what you need to do to coordinate.

The second is less for libraries and more for services/tools. If you have a planned outage/upgrade/scream-test let people know. Create a pro-active incident. Send out a few announcements before it happens. Give people time to respond, then remind them when it's about to happen.

Now go and delight your customers without surprising them.