Recent Posts (page 1 / 70)

July 16, 2025 by Leon Rosenshein

Who Do You Love?

On the topic of leading with the why, emotions are important. Two of the strongest emotions are love, and it’s not opposite, hate¹. Three of the biggest drivers of innovation are necessity, laziness, and love. If you want to a strong driver of innovation, love is a pretty good choice.

Like George Thorogood asked, the question is, Who Do You Love?. Answer that question and you’re on your way. Get the whole team loving the same thing and you’ll be amazed at what you create. So, who should you love? In Inspired: How to create Products Customers Love Marty Cagan gives us the answer.

Fall in love with the customer’s problem, not your solution.

That’s it. It’s that simple. Love (and hate) your customer’s problem. Be passionate about it. If that problem is the most important thing in your universe, and you hate it so much you want to eliminate it, your customer is going to be surprised and delighted.

It means understanding the domain and the problem so well that you understand not just what happens, but why. You understand the fundamentals and the theory behind them. You know not just what the problem is, but why it exists.

What you don’t want to do is fall in love with your solution. Yes, you should deeply understand your solution. What it is, how it works, and why it works.

But you also need to remember that your solution is not the important thing. It’s just a means to an end. A way to eliminate the customer’s problem.

The problem with loving your solution is that it becomes the important thing. You start seeing it as the solution to not only the customer’s specific problem, but the reason you’re doing the work in the first place. And that’s just wrong.

When the solution is the most important thing, you start rephrasing the problem to match the solution. You stop delighting your customer. Instead, you start teaching your customer why they’re misunderstanding their problem, and they should listen to you about what the problem really is². And that’s the tail wagging the dog.

When you and your team share a why of love (and hate) for the customer’s problem, you’ve tapped into some of the most powerful emotions and drivers of innovation.

Which will result in delivering maximum value for the customer.

Yes, yes, yes, love and hate are in many ways opposites. However, the opposite of love is also indifference. Love and hate are strong emotions directed at a person/idea. The opposite of a strong emotion is no emotion, indifference. ↩︎
Loving the problem doesn’t mean you’ll never have to educate your customers. Very often your outside view gives you an insight that customers are too close to the problem to see. The difference is that you’re not educating them on how wonderful your solution is, your helping them understand the problem better. After that, the solution comes naturally. ↩︎

July 14, 2025 by Leon Rosenshein

Practice Makes Perfect?

it depends trying to do context value

As I heard in marching band, don’t practice until you get it right, practice until you can’t get it wrong. It was certainly true there. It’s mostly true in Software Development too. Not that we’re doing the same thing, day in and day out, like a marching band, but there are a lot of process we practice every day, and getting better at those processes, learning them so well that we can’t get them wrong, is a good thing, right?

Well, It Depends. Generally, it’s better to improve your understanding of, and ability to use, those processes. But not always. Consider this quote:

Do you want to get better at what you’re doing, or find a better way to get the results you want?

That’s a pretty powerful question. It cuts right to the heart of the matter. As I often ask, “What are you really trying to do?” Your goal is to add value. As much value as you can over time. Sometimes you can do that by getting better at doing what you’re doing. But sometimes, you can add even more value by rethinking the situation and approaching it differently¹.

Here’s an example for you. Over the years I’ve build about half a dozen batch/parallel processing systems. The first few were very bespoke. They were ad-hoc, distributed build systems for game assets. Turning complex 3D models built with expensive modeling tools into runtime versions that could be rendered in-game quickly. No options on where sources would be found or where to put results. Run one command and hope it worked. Retry it if it didn’t. That was about it.

Over the years they got more generic. Multi-step pipelines. Then map-reduce like things. We added some error handling. We figured out how to do steps that involved people. We started to track the data so we could re-run things based on changed data, automatically. We got better and better at handling different team’s specific needs and workflows. But through all that, changes to the pipeline required changes to the system. That only the framework team really understood. And that bottle-necked the overall system.

Then we had an epiphany. We were getting better and understanding our customers and their needs, but we would never understand them as well as they did. No matter how much better we got, we were rapidly approaching a capability limit.

So instead of getting more detailed, we got more abstract. Instead of doing things for our users, we gave them more power and let them do things themselves.

We gave our users the ability to write their own control logic. They could look at their data, do some thinking and planning, then tell the processing system what their data and processing flow looked like. All we had to do was do what they wanted. And do that well.

Instead of expanding our responsibility to include someone else’s domain and doing an OK job, we restricted our domain to what we knew best. Turning a graph of data processing tasks into an optimal (ish) usage of hardware and network resources. So that our customers could do what they needed to do, in their domain.

That’s just one example of how re-imagining the situation and really solving the user’s problem is what we want to do. There was no way we could be good enough at everyone’s domain and be the best at our domain at the same time. Getting better at undertanding other team’s domains helped, but there’s a limit to how well we could understand it. There were multiple teams that we needed to support at the same time. We just couldn’t do all of it will.

So, instead of trying to get better at the wrong thing, we made sure we understood the other teams to know what they needed, then we focused on our strength². That’s what let us add even more value.

Even more problematic, there’s often no path that keeps adding value that will lets you keep getting better at what you’re doing as you switch to getting good at doing something more valuable³.

For those mathematically inclined, it’s the difference between a local maximum and a global maximum. ↩︎
More on this later, but for those interested, check out Now Discover Your Strengths ↩︎
Step changes (0 -> 1) are hard. But that’s a topic for another day. ↩︎

July 11, 2025 by Leon Rosenshein

Power Dynamics

context responsisbility power

Positional power is an interesting thing. The HIPPO effect is real. When you’re in a position of power, by definition, you have that power. Whether you’re aware of it or not. Whether you use it intentionally or not.

Of course, with great power comes great responsibility. One of the hardest things to remember is that because of that power, people don’t always hear things the way they are intended. And as the person who’s trying to communicate something, it’s on you to make sure the message gets across. The misinterpretation is your fault, not theirs.

Cartoon from workchronicles.com. Frame 1: Boss makes an offhand remark about color on a website. Frane 2: Team hears the comment. Frame 3: Team assumes there’s a problem with the color and starts a deep investigation. Frame 4: Two weeks later team presents the results of their research. The boss has no idea what they’re talking about or why they did it.

One of the most famous examples of this is from the movie Beckett, where King Henry II exclaims “Will no one rid me of this meddlesome priest”, and the next day, the priest is dead. While this almost certainly didn’t happen exactly as depicted in the film, it’s not hard to image that something similar did. Did the King order his firend’s death? Not directly. However, if the King of England, arguably the most powerful man in Europe at the time, complained about something, it’s not at all surprising that his faithful followers did something about it.

England in the Middle Ages is not the only time that happened. It happens all the time. The less contact a person has with someone with significantly more positional power¹, or the bigger the power differential, the more likely the subordinate is to listen very closely to the words spoken and do something about it. Whether the person speaking is a King, a General, or just someone with strong influence over your paycheck.

So as the person with that kind of power, it’s critical to think about how your words are heard. That doesn’t mean you shouldn’t say anything. That doesn’t mean you shouldn’t be relaxed and make comments or say what you think. It does, however, mean that you need to think about how your words are perceived. And you need to be explicit about the difference between orders, suggestions, questions, and personal opinions.

I learned this the hard way myself, back when I was a new manager. I asked one of the developers on my team why they had chosen to implement things in a new way when we already an implementation that was very close. I commented that I would have probably just tweaked the existing functionality. At the time, to me, it was just a comment.

The next Monday I got to the office to find that the developer had done a major refactor of the old code to support the existing functionality and be able to handle the new situation. The new code was good. Clean boundaries, no repetition, and very flexible. And over the next 4 years we never used that flexibility. We did occasionally have to go back in and separate the functionality even further. The developer gave up a weekend of their personal time to make a change that wasn’t needed, and in fact, cost us time later. All from a casual comment I made.

Because without enough context, it’s hard to distinguish between a personal opinion, a suggestion, and explicit direction. And in the absences of clarity, people often defer to power.

It’s not just positional power that can skew interpretation. Situational power can do the same thing, as can reputation. Or even volume. ↩︎

July 7, 2025 by Leon Rosenshein

Exsqueeze Me?

broken window theory hyrum's law testing unit test

With all due respect to Mike Meyers as Wayne Campbell, I saw something on the internet and the only possible response was Exsqueeze Me?. The quote started out OK, not great, but OK, then, right there at the end, it took a sharp left into crazy town.

Good teams can and will delete tests that have high false positive rates – or that never fail.

Here’s the thing. If you’ve got a test with a high false positive rate, that’s bad. Flaky tests are very bad. Bad in many dimensions. To list just a few of them,

They waste your time: Every time a test fails, you’re expected to go analyze what failed and why. Then figure out how to prevent that specific issue. However, if you go analyze the failure and it turns out the really wasn’t a problem, you’ve wasted however long it took you to figure out there wasn’t a problem.
Failing tests become business as usual: It’s the broken window effect. When no tests are failing then a sudden failing test is noteworthy. When you have a test that randomly fails then an additional failing test isn’t nearly as noticeable. If it wasn’t important enough to fix the only failing test, then each additional failing test is that much less important.
Alert (or Alarm) Fatigue is real: When the siren sounds it’s supposed to be unusual and noteworthy. If your alarm is going off all the time then it’s not an alarm, it’s just life. Just like the boy who cried wolf, if the alarm keeps going off and there’s nothing you can, should, or must do, you start to ignore it.
Flaky tests indicate a lack of understanding: It could be a lack of understanding of the domain, the environment, the test setup, or any combination of those three. If you don’t understand the system and situation in this specific case, what else aren’t you understanding? What are you missing that’s going to cause you problems later on?

That’s just some of the reasons flaky tests are bad. Deleting them isn’t the worst thing you could do, and it will fix the first three problems above, but it doesn’t do anything to fix the fourth. In fact, it just hides the problem. Ignoring a problem rarely makes it go away.

Therefore, most of the quote is almost correct. Instead of just removing a flaky test, a much better response is to fix the test so that it’s not flaky. It could be a bug in the code, a bug in the test, a problem with your test methodology, or a lack of understanding. Whichever it is, once you make the test pass consistently, you’re in much better shape. You don’t waste time. You’re incentivized to keep things clean. Alerts mean something. You understand your situation that much better. Which means you get to sleep that much better at night.

It’s the last part of the test that’s just plain WRONG. Some will say that a test that never fails serves no purpose and it’s wasting resources. Time. Bandwidth. Cognitive load. For no measurable benefit. They haven’t stopped a single bug from getting through.

That facts are real. All tests take time, bandwidth, and add some amount of cognitive load to the developers. But all of that, for all of your unit tests, should be minimal. If they’re not minimal then you have other problems (bad tests) you should fix¹.

Just because a test hasn’t caught a bug yet, you can’t know that it won’t ever catch a bug. Even if no-one is changing the code directly, those tests can still help keep you safe. They do things like:

Protect against changes in dependencies: Dependencies outside of your control can change. Those changes can make your code break. If you don’t test, you don’t know.
Protect against environmental changes: There are lots of things in the environment that can change. Networks come and go. Clock speed changes. Processors get replaced and new ones show up. There can be subtle differences in the environment. If you don’t test, you don’t know.
Protect against bugs in tooling changes: Similarly, tools change. Runtime environments, compilers, and interpreters can change. Are you relying on undefined behavior? That can change without you knowing it. If you don’t test, you don’t know.
Provide examples of how to use the code being tested: Tests are great examples. They can be documentation. They can be used as a learning environment. They can be a reference design.
Acknowledge Hyrum’s Law: Given eough time and users, everything your code does, intended or not, is going to be relied upon by someone. You never want to change behavior on your users without knowing about it. That is not how you want to surprise your users.
Prevent bugs in the code under test: Finally, and certainly not the least important, you never know when a test is going to show you that you’ve broken something. Past performance is a good indicator of future performance, but it is not a guarantee. If you don’t test, you don’t know.

And that’s why the only possible response to someone saying you should delete tests that never fail is Exsqueeze Me?

Those tests might not be flaky, but they can be bad for many of the same reasons. And you should fix them, not delete them. But that’s a slightly different post. ↩︎

July 2, 2025 by Leon Rosenshein

The Messy Middle

messy middle it depends product platform

I’ve talked about a messy middle before, but that’s not the only messy middle. That one was about the land of It Depends. Where you can’t know the answer until you have all of the context. But there are other messy middles.

Consider the space between products and platforms. The split between what customers bought, and what the company built internally so that they could then build the things that customers bought. In almost all cases people know about the products, but almost no-one outside the company knows about the platforms.

A software layer diagram showing two distinct layers, product and platform, and a messy middle blurring the interface of the two layers

Not every company has this split. There has to be enough products, or at least enough engineers, for that to happen. At some point they become not just different things to work on, they become entirely different organizations within a single company.

It’s something you don’t see working at small, single product companies. You often don’t see it at companies with only a few products. Back when I was working for Microprose the company was working on multiple games at any given time, but each game development team was completely self-contained. There were a few people who worked across games, but the code for each one was essentially unique. Everything was a product.

The first place I really saw the split was at Microsoft, where there was a Windows team, that was the platform for everything else, a few other big teams, like Office, Developer Division, and SQL, and a bunch of smaller orgs, like Online, Home, Games, and Research. The odd thing there was that Windows was also a product itself.

At the time I was on the Product side, working on games, and didn’t pay too much attention to the platform side of the world. I only cared about what APIs it provided. Also, we were such a tiny part of the Microsoft world that Windows, our platform, might as well have been made by a different company. We could ask a few questions that outsiders couldn’t, but we didn’t have much influence. The Office and Developer Division teams had that.

Then I moved to Uber. When I started at Uber they had just gone through their split. By that time I was on the platform side, working deep in the engine room. At that time Uber was already big enough that we even saw differences between real-time and offline/batch processing.

You see, down in the engine room we spent a lot of time building a product agnostic processing engine. We didn’t care what folks did with our CPUs and persistent storage. We abstracted away the complexity of scheduling and placement engines and let our customers focus on data and control flow. And that’s where I ran head-first into another messy middle.

Besides making sure the system worked and was easy to use, I spent a lot of my time as an evangelist. Showing countless other teams how they could benefit from our platform. The more success I had, the more I saw that there was at least one missing layer. I saw multiple teams creating their own internal platforms, based on our platform.

Being closer to the product, those platforms were more opinionated than ours. Because they knew what their 2-5 teams were doing and didn’t care about the other 20 or 30 teams using the base platform, they were able to make assumptions and build decisions into their tools.

That’s where it started to get messy. Those little infrastructure teams didn’t really know about or talk to each other. They started to duplicate work. Or at least solve similar problems. Not just each other’s, but problems that we were solving. The line between the different teams and the platform ebbed and flowed as problems were identified and solved.

It was a bit of a surprise, but it wasn’t all bad. Yes, it was messy. Yes, there was some duplication. But there was also a lot of advancement. By not forcing groups of teams to wait for a shared solution those teams were able to move faster. By waiting for consensus on the solution to form and move into the platform we avoided churn. By consolidating known shared work we freed up resources for product specific work.

We turned what could have been a bottleneck into a competitive advantage. Through lots of communication. Of problems. Of intent. Of timelines. By keeping the cross coupling down. By encouraging teams with similar context to make those specialized solutions that kept domain specific work with the domain. By taking the common parts of those domain specific solutions and commoditizing them. By recognizing that one team’s platform is another team’s product.

June 30, 2025 by Leon Rosenshein

Speed Vs. Quality

code quality development unix way productivity cognitive load

As I’ve noted before, while I’ve worked all over the computing stack, from simulation for games and requirements analysis to 3D rendering to fleet management to tools, I’ve spent the last 15 years or so working down in the engine room. Working on internal platforms, tools, and infrastructure. And time and time again I’ve seen teams and entire organizations micro-optimize for immediate velocity only to run into a wall and find themselves stuck. I’ve seen it happen around data storage, data shipment, data ingestion, and data validation. I’ve seen it happen custom processing engines and backwards compatibility.

A group of developers running into a wall of code

Every time, what it came down to was every case becoming a special case, so that nothing was special. Instead of developers having to understand a few simple systems that did one thing and one thing well and could be composed as needed, developers had to understand lots of highly specialized systems that were used for one thing only. Worse, those specialized systems weren’t isolated. They had unique and subtle interactions that developers had to keep in mind at all times. The cognitive load became too large, and most changes did most of what they were supposed to, but also caused downstream issues that now needed to be fixed.

Not only that, but because deadlines were always approaching, decisions were made based on those deadlines. Each one adding more coupling and cognitive load. The rate of change stayed high, but actual progress got slower and slower.

Inevitably, the cost of progress would get to be so high that someone decided to do something. Somebody would say “I’m declaring a Code Red”. People would focus on the problem, get back to fundamentals, and re-work things to both solve the problem, at the moment, but also eliminate complexity and coupling. The result was not just the solving of a specific problem, but also a step change in real progress.

I’ve experienced it multiple times. Sometimes I’ve been able to help people avoid the problem by pointing out the end result of the path they were on. But what I’ve never done is a systematic study of the problem. I’ve been too busy doing the work.

Luckily, someone else has. In March of 2022 Tornhill and Borg published Code Red: The Business Impact of Code Quality– AQuantitative Study of 39 Proprietary Production Codebases.

It’s not perfect. It’s based on Jira and Git histories for 39 projects, so it’s pretty good at determining correlation. It’s based on projects of various sizes, but none of them are all that large. It’s a cross section of languages. There’s some selection bias because all of the code bases were ones that had already chosen to at least purchase a specific code quality too. Finally, because there are no actual experiments, it can’t prove anything about causality.

There’s lots of interesting conclusions in there, but to me, the most interesting is not about tickets per file or line of code, or how long ticket resolution took on average, but the variability of resolution time. In the code bases that were identified as the least healthy, the variability in resolution time for a ticket was 9 times higher.

graph showing that medium quality code has 2x the resolution time variability and poor quality code has 9x the variability of the high quality code

In other words, not only did it take longer to resolve issues in low quality code, but the issues were of wildly different sizes. We can’t be sure, but it’s likely that developers very often didn’t know how much work it would take to resolve the issue.

And that points at higher cognitive load. The more complex the problem, the harder it is to reason about, and the harder it is to solve. Particularly without making something else worse. And that means that you really have no idea how much work it’s going to be until you’ve actually done the work. That lack of predictability makes planning harder. And pushes you to the quick fix. Which just makes it worse.

But now, we’ve got proof that, at least in these cases, maintaining high quality and clean boundaries, avoiding coupling, and reducing cognitive load, is highly correlated with both faster execution and reduced variability.

While correlation is not the same as causation, it often is. And my personal experience lines up very well with their conclusions. So when someone asks you want the business value of quality code is, now you can show them the academic proof.

June 25, 2025 by Leon Rosenshein

Culture: Not Just For Yogurt

culture tuckman change

Culture isn’t just for yogurt and sourdough. It’s also for teams. Or really, any organization, formal or informal. There’s a lot that goes into culture. What your goals are. How you incentivize. How you teach. How you do things, in good times, and in bad. And perhaps most importantly, how you handle uncertainty.

Overhead on the internet:

Your “Company Culture” is more about what happens when somebody on a team says “I don’t know” than what team building events you plan.

That right there. That’s what culture really is. It’s not the perks you have or the benefits you provide. It’s not your mission or your goal. It’s not even what you do when things get hard. It’s what you do when you’re faced with uncertainty. It’s what happens when someone on your team is asked a hard question and they say “I don’t know”, or worse, make a mistake.

The first thing to realize is that if people feel confident enough to say “I don’t know” you’re at least on the right path. You’re getting honest answers, which is a pre-requisite for having the kind of culture that helps people do their best. It’s not the best place to be, but it could be a lot worse.

Worse would be where, instead of saying “I don’t know”, the answer is defensive, makes excuses, or even worse, redirects blame. Worse is no accountability and everyone putting themselves first. At the expense of their co-workers, their teams, and of whatever it is they’re supposed to be doing. Worse can be paraphrased as “I don’t have to be good, I just have to make sure someone else is worse.”

people helping each other vs people holding each other back

On the other hand, high performing teams, the ones with the best cultures, don’t just say “I don’t know.” They continue from there. They finish the sentence with something along the lines of “and we’re going to do X, Y, and Z” to find out.” That’s not only being accountable for where they are, that’s taking ownership for fixing the issue and moving forward. To the benefit of their co-workers, their teams, and whatever it is they’re supposed to be doing. The best can be paraphrased as “A rising tide lifts all boats.”

Assuming that’s what you’re aiming for, you might need to make some changes. Changing culture is hard. It’s been said that

In the battle between change and culture, culture almost always wins.

One of the biggest parts of building that modeling honesty and trust. One easy way to do that is when someone is honest enough to say “I don’t know”, don’t just jump in and say what you want or need, or start being directive. Instead. trust them to finish the sentence with both an explanation and next steps. Offer help. Help via guidance. Help with resources. Help with the time and space to take those next steps.

While doing it yourself might get you those answers sooner, it comes with a very high cost. You don’t get done what you were supposed to be doing. The next time this sort of situation comes up (and it will) the same thing is going to happen. The only thing people have learned is that you don’t trust them.

If instead, you wait for them to finish the sentence and then make sure they’ve got what they need to find the answer you still get to do your job. And next time, instead of them saying “I don’t know.”, they might say something like “It took a bit more work than I planned, but here’s the answer to your question.” And even if they don’t get it that time. They might get it the time after that. Or the one after that. What they will do is keep getting closer.

That’s a rising tide lifting all boats. That’s a better culture. And, you’ll have more fun at your team-building events. Because the team is trying to lift each other up, not keep each other down.

June 23, 2025 by Leon Rosenshein

Docs First And The Definition Of Done

working backward mmmss definition of done software engineering documentation

I’ve talked about doc culture before. About the importance of writing things down. About how you can use a narrative doc, that everyone reads before talking to make meetings more efficient. Most folks think of this as the Amazon approach.

Amazon does run on docs. Many meetings start with a document, not a slide deck, that people read in the meeting. They comment on it live. The authors provide direct feedback. Then, when everyone is in roughly the same place, the discussion starts. The meeting is then about the questions raised and not answered by the document. Hopefully from people bringing different contexts and perspectives, but sometimes just because the doc isn’t ready. In the latter case, the meeting gets rescheduled until the doc is more finished. Either way, it saves a lot of time.

There’s another way that documents can help. And again, Amazon is pretty well known for it. It’s working backwards. For a new product, you start with the press release and the matching frequently asked questions. You get buy in on the end goal, the user problem you’re solving, and what the business case for solving the problem is. When you’ve gotten that approved, you have your north star. Your mission.

What it doesn’t do is say how you’re going to get there. You might have references to things that explain why it’s important and how the user (and business) will benefit, but it’s not about the technology. It’s not about the mechanism of the solution. It’s about solving the problem

That’s what makes the documentation first approach so powerful. It puts the user front and center. It emphasizes their why. It’s your new north star. Whenever you have to make a decision, that North Star provides the context to make the decision.

It’s constraining and freeing at the same time. The press release that you’ve written has locked in the user benefit and why people are going to care. Sure, you can make small changes in the details, but if you did a good press release, you don’t need to. You can’t go solve some other problem because it seems like a good idea. You can’t make some visual change and claim that you’ve solved the problem. You know what you’re going to deliver.

At the same time, the press release and accompanying FAQ also give a tremendous amount of freedom. You know what you’re doing, but the how isn’t defined. Whether you’re doing something new, adding a new feature to an existing product, or just making your current users happier while they do what they’ve always been doing, you get to define how to do it yourself. Are there interface changes? Possibly. Internal data structure changes? Probably. New/better algorithms? Maybe. Whatever you need to do. It’s up to you, as the software engineer, to engineer the solution.

The nice thing is, you don’t need to be done before you share what you’ve done. You can share it internally and with your users. You can get feedback on how you’re doing as you approach your goal. You take small steps in a direction. You see if you’re getting closer to the north star. You take some more small steps. You check again. You keep working towards that goal. With the knowledge that when you get there, you’ve done something your users will consider worthwhile.

As Uncle Ben said, “with great power comes great responsibility”. You’ve got the power to do anything you want. But you’ve also got the responsibility to solve the user’s problem. How you use the power and how you handle the responsibility is up tp you.

June 20, 2025 by Leon Rosenshein

System Goals

non-functional requirements code for the maintainer observability charity majors

As CEO and Co-Founder of Honeycomb, Charity Majors has a lot to say about observability. She also has a lot to say about software development in general. And software leadership. Over the years I’ve learned a lot from her stuff.

Recently she was on the How AI Is Built channel. It’s a pretty good overview of how to do observability and why it’s important. On top of that, the title of the episode really spoke to me. It’s one of those non-functional requirements, even if it’s not strictly worded as an -ility.

Build systems that can be debugged at 4am by tired humans with no context

That right there. That’s the goal. Make sure the maintainer has life as easy as possible.

Developer working late at night thinking about code

Of course, we want to write perfect code. Code that always works. That handles upstream and downstream issues. That scales seamlessly. That never fails and doesn’t need maintenance. But the reality is that code can fail. You will need to maintain it. Especially when the environment it’s running in changes. Different inputs. Different scale. Different latency requirements.

And of course, when it does fail, it won’t be at 10:00 in the morning while the original author is standing there expecting it. Instead, it will be early on a weekend morning, and you, as the on-call engineer, need to notice there’s a problem, figure out what’s happening at that moment, come up with a remediation, and implement it. While you’re barely awake working on code you’ve only seen before in passing.

That’s where runbooks help. They tell you where to start looking and how to deal with similar issues. That’s when observability shines. It tells you what’s really happening. Not what you expect to be happening or what you think_ is happening, but the actual ground truth. (Current) Comments and design docs are critical as well. They tell you want the original (and hopefully current) intent and constraints were. That’s when you need accurate logs and databases to show you how you got to where you are.

At 4 AM you have no context. You have no understanding. Before you can do anything, you need to build the context to understand the situation. Then, and only then, can you make a considered change to remediate. Sure, you can just start changing things and seeing what happens, but more likely to make it worse than to fix it when you do that. Remember, Slow is smooth and smooth is fast.

And that’s Coding for the Maintainer

June 18, 2025 by Leon Rosenshein

Superpowers And Optionality

unix way bounded contexts distributed systems optionality cognitive load

A long time ago I talked about distributed systems superpowers. Those are great things to have. When you’re building a distributed system, you really want your core functionality to have those superpowers. They make building the features your users want much easier. That’s your superpower

A quick recap. Here’s the 5 superpowers I listed.

Idempotency: The system should do the right thing if asked to do the same thing twice.
Availability: The system should always work.
Scalability: The system scale (up/out) automatically when it gets slow.
Durability: The system to remember what I told it.
Consistency: The system should give the same (correct) answer to everyone, always.

As usual, there’s some tension there. Especially when you consider some of the common programmer’s fallacies. Those pesky laws of physics don’t change, so you can’t be everything to everyone all at once. You have to make trade-offs.

Those superpower are not, however, something you only need if you have a distributed system. There’s the simplest question of what are you building that isn’t a distributed system? Pretty much every integrated circuit you interact with, from simple logic gates to FPGAs to multi-core CPUs or custom built GPUs is itself a distributed system. A system made up of discrete parts that talk to each other and may, or may not, have some of all of those superpowers.

If everything is a distributed system, then those aren’t distributed system superpowers, they’re just plain old superpowers. And that’s true. But there’s another way to think about distributed.

Regardless of the spatial distribution of a system, almost all non-trivial systems have distributed ownership. When you think about distribution that way those superpowers become even more important. Because the stronger they are, the more they constrain how much you need to keep in your head. How much cognitive load you need to carry around just to understand the environment. How much work you have to put in before you go to actually do something.

What those superpowers do is help you define, build, and maintain your bounded contexts. They help you isolate things into small components that embody the Unix way. They do one thing and do it well. And that opens up so many possibilities. It gives you optionality.

Just like with Unix, you don’t know what you or your users are going to want tomorrow. But having components that are idempotent, available, durable, stable, and consistent gives you superpowers. It lets you safely combine them in new ways without having to worry about having unintended consequences on other parts of the system. And do that at the same time as other people are doing other new things.

That’s optionality. That’s flexibility. That’s having the ability to add value later, as you discover it. And that’s the ultimate superpower.

Superhero flying through a world of code

Older Newer