Recent Posts (page 20 / 71)

by Leon Rosenshein

Remediation Surprise

Some call them Postmortems, Post-Incident Reviews (PIR), or Root Cause Analysis. Some call them something else entirely. Regardless of what you call them, taking the time to figure out why something bad happened and what you’re going to do to make sure it doesn’t happen again is a crucial part of a stable, performant, flexible systems. Unfortunately they also lead to remediation surprise.

A key part of the PIR is the list of action items. The things you’re going to do to ensure it doesn’t happen again. Or at least that you’re alerted to it sooner so you can react faster. After all, if you don’t do anything with the new learnings, why bother to learn them?

Things on that list of action items have varying priorities. Some are things that needed to happen soon. They’ll make the future better. They’re not going to change the day-to-day but will make the future better. Some are anti-things. Don’t turn knob “X” without first adjusting knob “Y”. Put that one in the knowledge base and a reminder to yourself to do some validation later.

And there are things that need to happen right now. Things that are critical to continuing to make forward progress. If you don’t do them, you’re going to continue to spend way too much time fighting fires. Some of those are simple changes, so you just do them immediately. They eat into your slack time, but that’s what it’s there for and doesn’t change any of your schedules.

Others though are bigger. And algorithm change. A dataflow change. Things that are much bigger then would fit into your slack time. You need to get them done, but something’s got to give. You just don’t know what. And neither do the folks depending on you. Which is bad. You want to surprise and delight your customer/user, but that’s not the kind of surprise they want.

By the time you get to this kind of remediation you’ve already surprised your customer once, so you really want to avoid surprising them again. One really good way to do that is to tell them what you’re not going to do. What you need to delay in order to do the work to stop giving them bad surprises. And then ask them if the trade-off makes sense from their perspective. Maybe there’s something else they’d rather delay. Or drop entirely. You won’t know until you ask. When you ask, you avoid the remediation surprise.

by Leon Rosenshein

Poka-Yoke

I’ve talked about guardrails and Root Cause Analysis before. And one of the questions I often ask candidates during an interview is what they (or their team) did to prevent a recurrence of a mistake with significant impact. These are all ways to make things safer in the future.

A general term for building that kind of safety into a system is poka-yoke. The idea that you can, by design, put special steps or checkpoints into a system so that mistakes are hard to miss. Checkpoints along the way that are easier to acknowledge than ignore. So you keep mistakes in the process from turning into defects in the results.

There are many examples of this in manufacturing. Your typical Ikea bookshelf comes with a bunch of boards and a cardboard sheet with the exact number of nuts, bolts, anchors, pegs, and tools needed to build the bookshelf. When you get to the end of the build, if you’ve got parts left over you’ve missed something. You might not know exactly what the problem is, but it is clear that there’s a problem.

There are ways to do this with software as well. Some of it is input validation. It can be as simple as using an Enum instead of a string in your API. If you take a type called HttpVerb, and the only valid values are HttpVerb.Get, HttpVerb.Post, HttpVerb.Put, HttpVerb.Head, HttpVerb.Delete, HttpVerb.Trace, HttpVerb.Connect, HttpVerb.Options, and HttpVerb.Patch then the user can’t specify an invalid value. And if your API only supports a subset of those you can only define that subset, again keeping the user from making a mistake.

The Builder pattern is a great example. You create a builder, set all the properties you want, then call the “Build” method. If you get something back it will work. If you get an error you know why. There’s no way to get a half-baked, inconsistent, or otherwise invalid thing. It might not be exactly what you want, but it is what you asked for. Similarly, another way is through proper separation of concerns. If the only way to get an instance of something is to ask the service/library/factory for it then you don’t have to worry about someone creating an invalid one on their own.

You can also have a poka-yoke step later in the process. If a you’re doing some kind of image analysis and the number of answers is supposed to be the same as the number of images analyzed you can add a step to the process that validates the number of results. If doesn’t prevent a problem, and it doesn’t make sure the analysis is correct, but it does ensure that everything is analyzed. You can make it even simpler. If a step in the process is supposed to have an output then check if there is any output. If not, it’s an error. Regardless of whether the step says it succeeded or not.

Regardless of whether it happens early or late in the process, the important part is that you detect the mistake, and make sure the mistake is recognized, as soon as possible so it can be handled then instead of turning into a problem for others.

How it gets handled is outside the scope of poka-yoke and a topic for another day.

by Leon Rosenshein

Raspberry vs. Strawberry Jam

    “The wider you spread it, the thinner it gets.”

         – Gerald M. Weinberg

Otherwise known as the Law of Raspberry Jam. You take some raspberry jam and spread it over so many slices of toast that it just add as little bit of color and a hint of sweetness. None of the wonderful flavor or texture of the raspberries you were trying to add. I’ve called it peanut buttering things before. Either way, you declare that you’re going to take advantage of something. Then you take the idea, set of resources, your time, or level of effort and spread it so thin that you don’t get any of the benefits you’re expecting. When it doesn’t work (or at least not as well as you expected) you declare that doing it was a waste of time and go back to the old way.

That just doesn’t seem honest to me. If you’re going to give something a chance, you should give it a chance. Really try it out. Put enough effort behind it to know if it works. Unfortunately, in reality you can’t always drop everything and put full effort into an attempt. There is, however, something you can do.

Because Weinberg has another law. The law of Strawberry Jam.

    ”As long as it has lumps, you can never spread it too thin.”

That’s the important difference between strawberry jam and raspberry jam. Strawberry jams, at least the good ones, have lumps in them. Things that you just can’t spread too thin. No matter how hard you try, those lumps of strawberry don’t get spread around. They might move around, but they don’t get smaller and they retain that great strawberry taste. The idea is to apply that idea to the thing you’re doing/trying. The indivisible nuggets of the idea, resource, time, or effort you invest.

Consider the simple idea of a sprint. If all you do is change your planning from every quarter to every 2 weeks all you’ve done is change the cycle length. You haven’t actually changed much. Instead of seeing how many stories you can fit in a sprint, start by deciding what the goal is. Define what benefit/value your customer/user will see when you’re done and then ensure that you’re working on the right things to make that happen.

That’s the indivisible nugget. Defining the customer value and implementing it. Do that and you’re not just changing the cycle time for the sake of changing the time. You’re changing the time so that you deliver more customer value sooner.

That’s giving the idea a chance.

by Leon Rosenshein

Planning and Time Sinks

Paraphrasing Eisenhower and von Moltke, planning is important, plans, not so much. The biggest reason for that is that no matter how much or how well you plan, things don’t always go the way you expect. Some things go faster, some go slower, and some just don’t work. Which leads to Hofstadter’s Law. Of course, Hofstadter’s Law is self-referential, so you can’t just plan it away.

What you can do, though, is understand why things take longer. There are lots of specific reasons, but generally they break down into 5 time thieves. They are:

  • Too much WIP: Context switching and always being busy makes you slower
  • Unknown Dependencies: It’s the unknown unknowns that you find later that slow you down
  • Unplanned Work: It’s hard to prevent the next fire while you’re fighting the current one
  • Conflicting Priorities: Not knowing what to do first/next. Changing focus (or even goals) leaves you with lost time (and potentially wasted effort)
  • Neglected Work: The tech debt that slows you down. All the things that make it harder to get things done

Their all related. Unplanned work is usually because of Conflicting Priorities. Something comes up that is suddenly higher priority than what you’re doing. Unknown Dependencies are often caused by Neglected work. A dependency pops up because you haven’t done the work to define or update your domains or context boundaries when things changed. And anything that has you switching tasks before their done leads to Too much WIP.

Together, all those things mean that your plans shouldn’t be written in stone. Things will change, and you need to be adaptable. But thinking ahead will help you know the things that might happen and lets you prepare for them. And being prepared lets you minimize their impact. You can’t eliminate the impact, but you can make reduce it.

Which brings us full circle. As Eisenhower said,

Plans are worthless, but planning is everything.

by Leon Rosenshein

Are Your KRs Really Tasks?

OKRs. Objectives and Key Results. The Objective comes from your vision, and the Key Results define what success looks like on your KPIs (Key Performance Indicators). One nice things about Objectives is that they can be decomposed into smaller, more focused objectives as you got down the org hierarchy. That works great at the company, SVP/VP, and director level, and can work for managers of managers.

When the org is using OKRs every level is expected to have their objectives support one of the OKRs at the level above. This helps with prioritization and ensuring the right level of effort/support is put behind each high level OKR. And it provides a path for issues to bubble up so they can be addressed while they’re still small issues, not big problems.

But it often breaks down at the individual team level. The team is trying to increase customer value and decrease developer workload. The problem comes when decomposing the higher level Objectives into team specific Key Results. At a task level the work looks something like “Document service X”, “Implement sharding for the database”, “Enable real-time queuing for requests”, or “Improve error handling”. Those may be critical tasks, and are probably the right things to do, but they’re tasks, not Key Results.

Unfortunately, what often happens is that those tasks end up in the OKR roll-up as Key Results. But they’re not Key Results. They’re not even results at all. They’re just tasks. Remember, Key Results are supposed to be the desired values of the KPIs.

What you need to do is figure out what the right KPIs are for the Objective. KPIs that define how things look from the outside. Then come up with how you expect the completion of those tasks to impact the KPIs. Those are the Key Results. The improvement (outcomes, not outputs) that comes about because of the work done.

That doesn’t mean the team doesn’t need to keep track of the tasks. In fact, it’s critical to understand what the work needed to reach the result is. But it’s even more important to ensure that the result is what’s being tracked, not the work being done.

by Leon Rosenshein

Heros

I’ve spent the last 10 years or so down in the metaphorical engine room. Working on developer experience, infrastructure, platforms, and tools. The things you often don’t notice or talk about unless there’s a problem. Then folks complain about it. Then, when it’s better they thank the person who told them about the fix. And I think that’s a problem.

Because it leads to things like this comic.

hero

Two people each see a small problem. One fixes the one they see. The other watches it grow bigger, impacting lots of people. When everyone is aware of the big problem the second person fixes it. The first person is ignored, and the second is marked as a hero

Two people saw small problems but their responses were very different. The first person saw the problem, fixed it, and moved on. The second person just looked at theirs, or maybe didn’t even notice it. Then, when the problem was big and lots of folks saw it and were impacted, they fixed it. The first person was ignored. The second person was declared a hero.

And that’s wrong. Preventing the big fire is much more impressive. Noticing a small issue and keeping it from growing has a much bigger impact on the group as a whole. We need to be thanking the first person.

And understanding why the second person didn’t do anything originally. And getting them to fix problems when they find them.

The first person is being a hero. That’s what we should be celebrating. A culture of quality.

by Leon Rosenshein

Are we there yet?

No, I’m not channeling my kids on a 20 hour car ride (although that has happened multiple times). If you’ve ever made the drive from New York to Miami you know what I’m talking about. You drive for hours. All day in fact. A long day. A very long day. You finally get to Florida. Yay. We’re almost there. Then you realize you still have 6 hours to go. You’re only 2/3 of the way there. Getting to Florida is a significant milestone, but you’re not done yet.

In many ways development in large, complex systems and environments is like that. You spend time figuring out where you want to go (identifying the requirements and sequencing them). You start out heading down the best route you’ve found (getting a tool/service/library working in your local or test environment). You make some side trips, either because you need more supplies (modify an upstream service) or something looks relevant (understanding a dependency). There can be detours along the way (pop-up tasks, outages, etc.), and sometimes you take a wrong turn (incorrect architecture and algorithm choices). And you have some intermediate stops planned along the way (milestones). Eventually you get to where you’re going (your customer is happy with the results).

Getting to Florida on the trip from NYC to Miami is like getting it to work in your local environment. It’s a key milestone, and you should celebrate it, but it’s not the then of the journey. You still need to get the work you’ve done into your customer’s hands. You have to deploy it. Until then, the journey isn’t finished.

Sometimes that’s easy. The change is small, or additive. Or the thing you’re deploying has no state. So after some testing in various environments you deploy it. Other times though, deploying it can be as (more?) complicated than writing it. Test environments can be hard to work with. There might not be enough automation. There might be stateful services that need careful coordination to replace. All of the above. And the final deploy to production, to your customers, can be all that and more.

So two bits of advice:

  1. Don’t say something is done until it’s in your customer’s hands. If you’re using a Kanban style tracking system make sure you have a column for deployment. Track the time it’s in the To Be Deployed state and you’ll find your bottlenecks pretty quickly
  2. Make your test/deployment process as automated as possible. Use scripts, throw away environments, stored data, live data streams, or whatever it takes to make it simple to go through the process. And since your task isn’t done until the deployment is done, you’ve got both the time and incentive to make it faster/easier to do.
by Leon Rosenshein

Tasking

As a general and President, Eisenhower had a big impact on the US. One of the more relevant to development is the Eisenhower Matrix. It breaks tasks down into 4 types and helps you prioritize which ones you should work on when. It also tells you which ones to ignore.

But what if 4 categories is too many for you? Shreyas breaks it down into 3 categories, giving you one less thing to remember.

TaskPriority

The three categories are:

  • Leverage: Tasks that have an inordinate amount of impact on the future. Do a great job on these. They’re the important ones and the time won’t be wasted.
  • Neutral: Tasks that are important, but the world goes on. Do what you can, but don’t obsess over them.
  • Overhead: Tasks that are overhead. Someone needs them done, so you need to do them. However, actively try to do as little as possible to meet the requirement.

If you do that you spend the time needed on the important tasks and get the overhead out of the way with minimal effort. You get to focus and your customer partners are happy.

Of course, the trick is to make sure you get the categorization right. Invert that and no one will be happy. Personally I prefer Eisenhower’s matrix, but I’ve seen many people be successful with this method.

by Leon Rosenshein

All Aboard

Today I started the latest chapter in my life. Just started my new job with Amazon, working on Just Walk Out technology. This is a big job change for me. It’s the first time in almost 25 years that I had to interview for a new job. It’s also the first time in almost that long that I made the transition alone.

Prior to Amazon I was at Aurora Innovation. I got there as part of an acquisition. Before that I was at Uber’s ATG subsidiary. I joined that organization, along with the rest of my team, as part of a re-org as Uber decided not to source its own maps from the ground up. I moved to Uber in a similar fashion. Microsoft decided that it wanted out of the map-making business and sold part of Bing Maps to Uber. While I was at Microsoft I spent most of my time in either the Bing Maps org or the Simulations org. Inside (and even between those two orgs) I moved to follow interesting work and advance my career. No (or very informal) interviews, and teams moved together.

This time it’s very different. I’m the only one moving. I had to interview with people I didn’t know and were unfamiliar with my work. Despite the fact that I’ve interviewed ~1500 people over the years, being interviewed was very different. Sure, the experience helped, and Amazon’s interview process was really smooth, but there’s still a lot of pressure.

And now, instead of working on the same things I’ve been working on, or learning new systems with the rest of my team, I’m drinking from the firehose. So much to learn. New technologies. New people. New processes. I can’t wait to see where it leads.

by Leon Rosenshein

Chesterton's Fence

I’ve mentioned Chesterton’s Fence in a couple of previous blogs, but I never did define it. Chesterton’s Fence is a parable about how lacking context can be dangerous. In his 1929 book, The Thing, Chesterton wrote:

In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.

Or, in other words, if you don’t know why something was done, don’t undo it without figuring out why it was done in the first place. Only when you know why something was put in place can you decide if it’s not be needed any more. Consider the typical bathroom sink.

Many of them have a small hole or two in the near side a little below the top. It just sits there. Nothing comes out of it. Nothing goes into it. So they should be removed, right?

Not so fast. In normal operation those holes are unused. But bathroom sink drains are notorious for being slow. Not blocked, because that’s obvious and gets fixed, but slow. And very often, when in the bathroom using the sink with the water running you’re doing something else. Brushing your teeth, shaving, removing makeup, or just distracted.

That’s when those holes come into play. As the drain runs slowly the water builds up in the sink. Eventually so much water builds up that it starts to run through those holes and down the drain. It’s not perfect, but it helps, and gives you a little more time to correct the problem. But if you never saw them used you might by a sink without them, then find a small lake in your bathroom.

The same thing can happen in code. You want to break long lines at a space if possible. As you’re going through the code you come across a strange byte-counting/examining loops. You’re confused because it doesn’t just start at column 80 and go backward looking for a ``. It starts at the beginning, checks values, skips spaces checks more, and keeps track of the last space it found. Very complicated. So you remove it and go with the simple solution.

You write your tests, deploy, and close out the task. Done and done. Then suddenly you start getting bug reports. Turns out you assumed that all of the input was simple ASCII code, but in reality it’s Unicode/MultiByte characters, so you can’t do the simple thing. Without enough context, you broke the code.

That’s Chesterton’s fence. So the next time you want to remove some seemingly unimportant code, make sure you know why it was added in the first place.