Recent Posts (page 14 / 65)

by Leon Rosenshein

Hangers and Optimization

I’ve talked about context before. I know context is important. Even so, I still get reminded of its importance by the smallest of things. After just over 2 years I’m on a business trip again and something happened last night that reminded me again how important context is.

When I got to my room I went to hang up my shirts because I don’t want to have to iron them. When I went to hang them up some of the hangers were hung hook in and some were hung hook out. Which begs the question, hook in, hook out, or random?

Which gets right to the question of context and optimization. My father was a policeman. And a volunteer fireman. If you asked him how hangers should be arranged the answer you got would depend on which hat he was wearing. The fireman would say make them all be pointing the same way, preferably in. That would make it easier to grab lots of them at once and take them off the bar if you needed to move them in a hurry. The policeman, on the other hand, would say that you should hang them randomly. That will make it harder for someone to grab an bunch of them and run off. So which way you hang your clothes depends on what you’re trying to optimize for.

In engineering, software or otherwise, the answer is also often, it depends. There are lots of ways to meet the base requirements. Which one to choose depends on what you’re trying to optimize for. Are you optimizing for resource (memory, disk, CPU, network, etc.) usage? Maybe it’s throughput, or latency. It might be an one of those non-functional requirements, like reliability or uptime. Are you optimizing for time? Is the short term more important, or the long term? Or maybe it’s a business requirement. Minimizing cost in one area. Minimizing waste in a division. Maximizing revenue on a product.

Remember that regardless of what you’re optimizing for, there’s always something you’re giving up. Minimizing resource usage might increase latency or decrease throughput. What’s best for tomorrow might not be best for next week, next month, or next year. But there are times when you need to optimize for short term or there won’t be a long term to worry about. You can minimize development costs by not having any developers, but that’s really going to delay your release date.

The trick is to recognize that just because you’ve found a local maximum (or minimum), that doesn’t mean you’ve found the global maximum (minimum). Whether or not that’s the right answer depends.

by Leon Rosenshein

Loops and Ratios

Test Driven Development, as I’ve said before, is really design by example. But it’s also a systems thinking approach to the problem. Loops withing loops. Feedback and feed forward. Change something (a requirement or a result) in one loop and it impacts the other.

TDD loops within loops

What that image doesn’t capture though is the impact of time spent on various parts of the loop. It doesn’t touch on how the ratios of those times form incentives. And it really doesn’t say anything about how having the wrong ratios can lead to some pretty odd incentives. Which then leads to behaviors which are, shall we say, sub-optimal.

Since time isn’t part of the picture, a natural assumption is that each step takes the same amount of time, and the arrows don’t take any time. Of course, that’s not true. “Make the test pass” is where you spend the majority of the time, right? If that’s true then things work out the way you want them. The incentive is to make the tests pass and you don’t need to compensate for anything. Things move forward and people are happy.

We want continuous integration. We want commits that do one thing. We want reversible changes. We don’t want long lived branches. We don’t want big bang changes. Those lead to code divergence and merge hell. And no one wants that. When most of the time is spent in “Make the test pass” then incentives match and we’re naturally pushed in that direction.

Unfortunately, that distribution of time spent isn’t always the case. There are loops hidden inside each step. Those loops have their own time distribution. To make matters worse, the arrows are actually loops as well. Nothing is instant, and time gets spent in places somewhat outside your control. It’s that time, and the ratio of those times, that lead to bad incentives.

For instance, “Make the test pass” is really a loop of write code -> run tests -> identify issues. If the majority of time is in “write code”, great. But if it takes longer to run the tests than it does to write the code or identify issues then the incentive changes from make a small change and validate it to make a bunch of changes and validate them. This makes folks feel more efficient because less time is spent waiting. Unfortunately, it might make the individual more efficient, but it has all the downsides of bigger changes. More churn. More merge issues. Later deployment.

Consider the arrow to deployable system. It’s not just and arrow. That arrow is hiding at least two loops. The Create/Edit a PR/Commit -> Have it reviewed/tested -> Understand feedback and the Deploy to environment X -> Validate -> Fix. Those loops can take days. Depending on what’s involved in testing it can take weeks. If it takes weeks to get a commit deployed imagine how far out of sync everyone else is going to be when it gets deployed. Especially if you’ve saved some time by having multiple changes in that deployment.

All of those (and other) delays lead to bigger changes and more churn across the system. Churn leads to more issues. Incompatible changes. Bad merges. Regressions. All because the time ratios are misaligned. So before you think about dropping things from the workflow think about how you can save time. After all, chances are those steps in the workflow are there for a reason

by Leon Rosenshein

Weird Person #2 or First Joiner?

Who’s more important, the person who comes up with a weird idea, or the person who validates it? Sure, if no one comes up with the idea then it won’t come to fruition. But if you someone comes up with the idea and no one joins in then it probably won’t happen then either.

According to legend Steve Wozniak was happily building technology demonstrators for himself, but probably would have stayed at his HP day job if it wasn’t for that other Steve. Steve Jobs, who saw what he was doing, thought it was cool, joined in, and started a movement. That movement grew and the rest is history.

Or consider the Sasquatch Music Festival in 2009. Some got up and started dancing. Nothing happened for a while. The guy was clearly noticed. Someone was recording him, so he was noticed. But not joined. Then someone joined him. That did two things. First, it encouraged the first guy to continue. Second, it changed the narrative from being one weird guy dancing to a couple of guys enjoying the music. The second guy made it acceptable for others to join in. Eventually the whole group joins in and it turned into an event.

dance party

It’s the same if you’re developing tools and libraries. Someone builds an API. It doesn’t matter if it’s REST, gRPC, a C++ shared library, or a Go package. Until someone uses it it’s just like that dancing guy. Out there, noticed but unused, waiting for someone to try it out. Your job, as a potential person #2, is to decide if you should try it out. Does it appear to fill a need you have? How reversible is the decision to try it out? While you’re trying it, what are you not doing? If it fills a need, appears to add something valuable, is a reversible decision, and you’re not risking too much time/energy, maybe give it a shot. You might end up being person #2, bringing goodness and joy to the world.

by Leon Rosenshein

Remediation Surprise

Some call them Postmortems, Post-Incident Reviews (PIR), or Root Cause Analysis. Some call them something else entirely. Regardless of what you call them, taking the time to figure out why something bad happened and what you’re going to do to make sure it doesn’t happen again is a crucial part of a stable, performant, flexible systems. Unfortunately they also lead to remediation surprise.

A key part of the PIR is the list of action items. The things you’re going to do to ensure it doesn’t happen again. Or at least that you’re alerted to it sooner so you can react faster. After all, if you don’t do anything with the new learnings, why bother to learn them?

Things on that list of action items have varying priorities. Some are things that needed to happen soon. They’ll make the future better. They’re not going to change the day-to-day but will make the future better. Some are anti-things. Don’t turn knob “X” without first adjusting knob “Y”. Put that one in the knowledge base and a reminder to yourself to do some validation later.

And there are things that need to happen right now. Things that are critical to continuing to make forward progress. If you don’t do them, you’re going to continue to spend way too much time fighting fires. Some of those are simple changes, so you just do them immediately. They eat into your slack time, but that’s what it’s there for and doesn’t change any of your schedules.

Others though are bigger. And algorithm change. A dataflow change. Things that are much bigger then would fit into your slack time. You need to get them done, but something’s got to give. You just don’t know what. And neither do the folks depending on you. Which is bad. You want to surprise and delight your customer/user, but that’s not the kind of surprise they want.

By the time you get to this kind of remediation you’ve already surprised your customer once, so you really want to avoid surprising them again. One really good way to do that is to tell them what you’re not going to do. What you need to delay in order to do the work to stop giving them bad surprises. And then ask them if the trade-off makes sense from their perspective. Maybe there’s something else they’d rather delay. Or drop entirely. You won’t know until you ask. When you ask, you avoid the remediation surprise.

by Leon Rosenshein

Poka-Yoke

I’ve talked about guardrails and Root Cause Analysis before. And one of the questions I often ask candidates during an interview is what they (or their team) did to prevent a recurrence of a mistake with significant impact. These are all ways to make things safer in the future.

A general term for building that kind of safety into a system is poka-yoke. The idea that you can, by design, put special steps or checkpoints into a system so that mistakes are hard to miss. Checkpoints along the way that are easier to acknowledge than ignore. So you keep mistakes in the process from turning into defects in the results.

There are many examples of this in manufacturing. Your typical Ikea bookshelf comes with a bunch of boards and a cardboard sheet with the exact number of nuts, bolts, anchors, pegs, and tools needed to build the bookshelf. When you get to the end of the build, if you’ve got parts left over you’ve missed something. You might not know exactly what the problem is, but it is clear that there’s a problem.

There are ways to do this with software as well. Some of it is input validation. It can be as simple as using an Enum instead of a string in your API. If you take a type called HttpVerb, and the only valid values are HttpVerb.Get, HttpVerb.Post, HttpVerb.Put, HttpVerb.Head, HttpVerb.Delete, HttpVerb.Trace, HttpVerb.Connect, HttpVerb.Options, and HttpVerb.Patch then the user can’t specify an invalid value. And if your API only supports a subset of those you can only define that subset, again keeping the user from making a mistake.

The Builder pattern is a great example. You create a builder, set all the properties you want, then call the “Build” method. If you get something back it will work. If you get an error you know why. There’s no way to get a half-baked, inconsistent, or otherwise invalid thing. It might not be exactly what you want, but it is what you asked for. Similarly, another way is through proper separation of concerns. If the only way to get an instance of something is to ask the service/library/factory for it then you don’t have to worry about someone creating an invalid one on their own.

You can also have a poka-yoke step later in the process. If a you’re doing some kind of image analysis and the number of answers is supposed to be the same as the number of images analyzed you can add a step to the process that validates the number of results. If doesn’t prevent a problem, and it doesn’t make sure the analysis is correct, but it does ensure that everything is analyzed. You can make it even simpler. If a step in the process is supposed to have an output then check if there is any output. If not, it’s an error. Regardless of whether the step says it succeeded or not.

Regardless of whether it happens early or late in the process, the important part is that you detect the mistake, and make sure the mistake is recognized, as soon as possible so it can be handled then instead of turning into a problem for others.

How it gets handled is outside the scope of poka-yoke and a topic for another day.

by Leon Rosenshein

Raspberry vs. Strawberry Jam

    “The wider you spread it, the thinner it gets.”

         – Gerald M. Weinberg

Otherwise known as the Law of Raspberry Jam. You take some raspberry jam and spread it over so many slices of toast that it just add as little bit of color and a hint of sweetness. None of the wonderful flavor or texture of the raspberries you were trying to add. I’ve called it peanut buttering things before. Either way, you declare that you’re going to take advantage of something. Then you take the idea, set of resources, your time, or level of effort and spread it so thin that you don’t get any of the benefits you’re expecting. When it doesn’t work (or at least not as well as you expected) you declare that doing it was a waste of time and go back to the old way.

That just doesn’t seem honest to me. If you’re going to give something a chance, you should give it a chance. Really try it out. Put enough effort behind it to know if it works. Unfortunately, in reality you can’t always drop everything and put full effort into an attempt. There is, however, something you can do.

Because Weinberg has another law. The law of Strawberry Jam.

    ”As long as it has lumps, you can never spread it too thin.”

That’s the important difference between strawberry jam and raspberry jam. Strawberry jams, at least the good ones, have lumps in them. Things that you just can’t spread too thin. No matter how hard you try, those lumps of strawberry don’t get spread around. They might move around, but they don’t get smaller and they retain that great strawberry taste. The idea is to apply that idea to the thing you’re doing/trying. The indivisible nuggets of the idea, resource, time, or effort you invest.

Consider the simple idea of a sprint. If all you do is change your planning from every quarter to every 2 weeks all you’ve done is change the cycle length. You haven’t actually changed much. Instead of seeing how many stories you can fit in a sprint, start by deciding what the goal is. Define what benefit/value your customer/user will see when you’re done and then ensure that you’re working on the right things to make that happen.

That’s the indivisible nugget. Defining the customer value and implementing it. Do that and you’re not just changing the cycle time for the sake of changing the time. You’re changing the time so that you deliver more customer value sooner.

That’s giving the idea a chance.

by Leon Rosenshein

Planning and Time Sinks

Paraphrasing Eisenhower and von Moltke, planning is important, plans, not so much. The biggest reason for that is that no matter how much or how well you plan, things don’t always go the way you expect. Some things go faster, some go slower, and some just don’t work. Which leads to Hofstadter’s Law. Of course, Hofstadter’s Law is self-referential, so you can’t just plan it away.

What you can do, though, is understand why things take longer. There are lots of specific reasons, but generally they break down into 5 time thieves. They are:

  • Too much WIP: Context switching and always being busy makes you slower
  • Unknown Dependencies: It’s the unknown unknowns that you find later that slow you down
  • Unplanned Work: It’s hard to prevent the next fire while you’re fighting the current one
  • Conflicting Priorities: Not knowing what to do first/next. Changing focus (or even goals) leaves you with lost time (and potentially wasted effort)
  • Neglected Work: The tech debt that slows you down. All the things that make it harder to get things done

Their all related. Unplanned work is usually because of Conflicting Priorities. Something comes up that is suddenly higher priority than what you’re doing. Unknown Dependencies are often caused by Neglected work. A dependency pops up because you haven’t done the work to define or update your domains or context boundaries when things changed. And anything that has you switching tasks before their done leads to Too much WIP.

Together, all those things mean that your plans shouldn’t be written in stone. Things will change, and you need to be adaptable. But thinking ahead will help you know the things that might happen and lets you prepare for them. And being prepared lets you minimize their impact. You can’t eliminate the impact, but you can make reduce it.

Which brings us full circle. As Eisenhower said,

Plans are worthless, but planning is everything.

by Leon Rosenshein

Are Your KRs Really Tasks?

OKRs. Objectives and Key Results. The Objective comes from your vision, and the Key Results define what success looks like on your KPIs (Key Performance Indicators). One nice things about Objectives is that they can be decomposed into smaller, more focused objectives as you got down the org hierarchy. That works great at the company, SVP/VP, and director level, and can work for managers of managers.

When the org is using OKRs every level is expected to have their objectives support one of the OKRs at the level above. This helps with prioritization and ensuring the right level of effort/support is put behind each high level OKR. And it provides a path for issues to bubble up so they can be addressed while they’re still small issues, not big problems.

But it often breaks down at the individual team level. The team is trying to increase customer value and decrease developer workload. The problem comes when decomposing the higher level Objectives into team specific Key Results. At a task level the work looks something like “Document service X”, “Implement sharding for the database”, “Enable real-time queuing for requests”, or “Improve error handling”. Those may be critical tasks, and are probably the right things to do, but they’re tasks, not Key Results.

Unfortunately, what often happens is that those tasks end up in the OKR roll-up as Key Results. But they’re not Key Results. They’re not even results at all. They’re just tasks. Remember, Key Results are supposed to be the desired values of the KPIs.

What you need to do is figure out what the right KPIs are for the Objective. KPIs that define how things look from the outside. Then come up with how you expect the completion of those tasks to impact the KPIs. Those are the Key Results. The improvement (outcomes, not outputs) that comes about because of the work done.

That doesn’t mean the team doesn’t need to keep track of the tasks. In fact, it’s critical to understand what the work needed to reach the result is. But it’s even more important to ensure that the result is what’s being tracked, not the work being done.

by Leon Rosenshein

Heros

I’ve spent the last 10 years or so down in the metaphorical engine room. Working on developer experience, infrastructure, platforms, and tools. The things you often don’t notice or talk about unless there’s a problem. Then folks complain about it. Then, when it’s better they thank the person who told them about the fix. And I think that’s a problem.

Because it leads to things like this comic.

hero

Two people each see a small problem. One fixes the one they see. The other watches it grow bigger, impacting lots of people. When everyone is aware of the big problem the second person fixes it. The first person is ignored, and the second is marked as a hero

Two people saw small problems but their responses were very different. The first person saw the problem, fixed it, and moved on. The second person just looked at theirs, or maybe didn’t even notice it. Then, when the problem was big and lots of folks saw it and were impacted, they fixed it. The first person was ignored. The second person was declared a hero.

And that’s wrong. Preventing the big fire is much more impressive. Noticing a small issue and keeping it from growing has a much bigger impact on the group as a whole. We need to be thanking the first person.

And understanding why the second person didn’t do anything originally. And getting them to fix problems when they find them.

The first person is being a hero. That’s what we should be celebrating. A culture of quality.

by Leon Rosenshein

Are we there yet?

No, I’m not channeling my kids on a 20 hour car ride (although that has happened multiple times). If you’ve ever made the drive from New York to Miami you know what I’m talking about. You drive for hours. All day in fact. A long day. A very long day. You finally get to Florida. Yay. We’re almost there. Then you realize you still have 6 hours to go. You’re only 2/3 of the way there. Getting to Florida is a significant milestone, but you’re not done yet.

In many ways development in large, complex systems and environments is like that. You spend time figuring out where you want to go (identifying the requirements and sequencing them). You start out heading down the best route you’ve found (getting a tool/service/library working in your local or test environment). You make some side trips, either because you need more supplies (modify an upstream service) or something looks relevant (understanding a dependency). There can be detours along the way (pop-up tasks, outages, etc.), and sometimes you take a wrong turn (incorrect architecture and algorithm choices). And you have some intermediate stops planned along the way (milestones). Eventually you get to where you’re going (your customer is happy with the results).

Getting to Florida on the trip from NYC to Miami is like getting it to work in your local environment. It’s a key milestone, and you should celebrate it, but it’s not the then of the journey. You still need to get the work you’ve done into your customer’s hands. You have to deploy it. Until then, the journey isn’t finished.

Sometimes that’s easy. The change is small, or additive. Or the thing you’re deploying has no state. So after some testing in various environments you deploy it. Other times though, deploying it can be as (more?) complicated than writing it. Test environments can be hard to work with. There might not be enough automation. There might be stateful services that need careful coordination to replace. All of the above. And the final deploy to production, to your customers, can be all that and more.

So two bits of advice:

  1. Don’t say something is done until it’s in your customer’s hands. If you’re using a Kanban style tracking system make sure you have a column for deployment. Track the time it’s in the To Be Deployed state and you’ll find your bottlenecks pretty quickly
  2. Make your test/deployment process as automated as possible. Use scripts, throw away environments, stored data, live data streams, or whatever it takes to make it simple to go through the process. And since your task isn’t done until the deployment is done, you’ve got both the time and incentive to make it faster/easier to do.