Recent Posts (page 1 / 71)

by Leon Rosenshein

Stuttering

There are two hard things in computer science: cache invalidation, naming things, and off-by-1 errors. Today’s post is about the middle one of the two.

Naming is hard. Noun clumps1 for data. Verbs2 for functions. Hungarian Notation3 for clarity? Everyone’s got an opinion on what’s right. Even languages have opinions on what’s allowed. Almost all languages allow almost all of the ASCII characters in identifier names. Some they have rules around what can be first in the identifier and how long it can be. Others assume certain types for different identifiers (I’m looking at you FORTRAN).

And some languages are more opinionated than others. APL has its own character set (and keyboard). Python doesn’t care much. Go, the language is not very opinionated. Go, the ecosystem and the community, on the other hand, are very opinionated.

gofmt, the Go formatter is so opinionated that, unlike every other formatting tool I’ve ever come across, it has ZERO options. You can add rules, but you can’t turn any off. You can’t pick the brace pattern. You can’t pick tabs, spaces, how much to indent each block, or anything else.

It’s simple, it’s fast, and all Go code looks the same. I consider that a win.

Go’s linters are similar. Both vet and golint let you choose which checks to run, but there is no ability to ignore a false positive. You’ve got to fix it or re-write the code to remove the false positive. I consider that mostly a win, but I have seen bad rewrites to make it happy.

Another thing golint will do for you is help prevent what it calls stuttering. According to golint, stuttering is when you have a public method in a struct with the same name as the struct itself. Something like run.Run(). I get that. Stuttering like that is both annoying and forcing ambiguity. When you’re talking about the code and you mention run, are you talking about the struct or the method? You need to be careful when you talk about it and the person you’re talking to needs to pay close attention to what you’re saying. That adds cognitive load, and part of Go’s design principles was the idea that it would be easy to ready. Both for a person and for the compiler. The Go Proverbs really lean into this. Make it easy to understand. A few extra lines or methods is fine.

But it’s not perfect. Sometimes code still stutters. And sometimes it’s hard to read and understand. Especially when there’s lots of text in the output to look at.

All of which is a very long way to say that naming is hard, and when a name stutters it adds to your cognitive load. They higher the cognitive load, the more likely you are to miss something. Missing something simple can cause you a lot of grief.

I was thinking of this because of an issue I tripped over the other day. I misread the name of a test in a test report because of stuttering issue. Starting from that simple mistake, I spent 4 hours questioning how computers worked, the correctness of a very simple, core feature of our build system, bazel, and my sanity.

You see, bazel has the ability to filter out tests based on the keywords you tag them with. Want to run only tests tagged with needs_network? Just add --test_tag_filters=needs_network to you bazel test command. Want to skip any tests with that tag, just add --test_tag_filters=-needs_network. Simple and straightforward.

And it seemed to work. I’d been happily marking bad tests with a tag and then fitering them out. Then, all of a sudden one of them showed up in my list of failures. Why was it running? It was supposed to be filtered out. And when I tried to run just that test and filtered it out it didn’t run.

However, when I ran all of the tests in that directory with the same filter, it did run. This apparent dichotomy led me to 4 hours of questioning myself and computers. I did all sorts of debugging. Printf, tracing, log grepping. And it just didn’t make any sense.

I started asking others if they had seen anything like that. No one had. Until finally, someone (thanx Alex) pointed out the simple thing that I was missing. Our test names stuttered. We had one test called //a/really/long/path//that/leads/to/the/thing:go_default_test4, and one test called //a/really/long/path//that/leads/to/the/thing/thing:go_default_test. I marked //a/really/long/path//that/leads/to/the/thing:go_default_test as a test that should be skipped, and when I tried to run it by itself, it got skipped. Then, when I ran all the tests under //a/really/long/path//that/leads/to/the/…5 and looked at the results, it was still skipped. But I didn’t notice it.

What I did notice was //a/really/long/path//that/leads/to/the/thing/thing:go_default_test. And because the prefix was the same for all the tests, my eye was drawn to the suffix. And I saw /thing:go_default_test, which I wasn’t expecting. It wasn’t supposed to be there. And because I was task focused I completely ignored the prefix. Instead of recognizing that these were two different tests, I thought the tools were broken. And down the rabbit hole I went.

To wrap this up, the real problem was that, because of the stutter, I had marked the wrong test. I was excluding a test that was fine, but running a test that sometimes failed. Then, when debugging it, I never noticed that there were two tests with almost identical names.

The moral of the story? First, beware of target fixation. Instead of looking at the bigger picture, I got stuck looking for how the tools were broken. Second, make it easier on yourself and don’t stutter in the first place. There didn’t need to be two tests that started with the same long prefix and ended with the same medium sized suffix, with just a tiny difference in the middle. That was a choice we made. We should have chosen better. If the tests names weren’t so similar I wouldn’t have missed the obvious.


  1. A bunch of nouns and adjectives that describe a thing. Generally, a good way to name a data item. ↩︎

  2. Methods should be named for what they do. If you find you want to put and in the name you’re probably wrong. ↩︎

  3. The real Hungarian Notation wasn’t about prefixing with base types, it was about prefixing with intent. And that’s not necessarily a bad thing. ↩︎

  4. Not only do we use bazel, we use gazelle, and we’ve been using it for a while, which makes things more complicated. ↩︎

  5. For the uninitiated, that weird notation basically means “all of the tests defined in or below this path”. ↩︎

by Leon Rosenshein

Optimization

I’m traveling on business this week. I’m staying in a hotel, as you do when traveling for work. As I spend time in the hotel, I’ve noticed some very interesting things and learned from them. Just like I’ve learned things from my dryer. One of the I’ve learned is about different kinds of optimization. Take a look at this photo.

Picture of disposable silverware and soap from a hotel stay

Look at that bar of soap. It’s got three holes all the way through it. It’s maybe half soap. That’s just the hotel being cheap, right?

Look at the silverware. They’ve all got big holes in the middle of the shaft . Only the ends of the silverware are covered. That’s just the hotel being cheap, right?

Wrong. That’s optimization. The question is, what are they optimizing for? Sure, cost is one of the things being optimized for. But it’s not the only thing. There are multiple other things. Like lifespan. Usefulness. Material used. Waste. And of course, customer satisfaction.

It’s not just any one of those things. Instead, it’s maximizing the collective impact of all those things. It’s got to be low cost, because the hotel goes through hundreds of thousands of them. If they’re too expensive, the price of the room goes up. If there’s too much material used then there’s more waste, which takes up more space and costs more to handle, which also drives up the cost of a room. On the other hand, if the silverware is made with too little material, it bends or breaks too easily and can’t be used.

The bar of soap needs to last for a few days, maybe a week. But that’s not a lot of soap. If it were solid, it would be too small to be used comfortably. The machine to make a bar of soap with holes in it is more complicated than one that makes a smaller solid bar. It’s more expensive to operate. But customer satisfaction is important.

The design of the silverware and the soap takes all that into account. They’re not the cheapest they could be. They’re not the best experience they could be. But they’re good enough. And they’re cheap enough. They’re not optimized for long term home use, and they’re not optimized for camping or restaurant use. They’re optimized to for short stays in a hotel.

Software is similar. It’s optimized for use in a specific context. Kubernetes is great if you need to orchestrate the deployment of hundreds of different workloads with multiple instances of each across thousands of computers. But if you’re hosting this static website then Kubernetes is a bad idea. Need to store a dozen rows for a database? A simple text file can work. If it’s a complicated schema then maybe JSON. You don’t need, or want, a Postgres or Cassandra database. On the other hand, of you need to store data related to millions of images with thousands being added every day, Postgres might be what you’re looking for.

And it’s not just in choice of technology. It applies to languages, architecture, and how you deploy things as well. These are all decisions we make on purpose.

Or at least we should be doing that. That’s what software engineering is really all about. Understanding not just the problem, but the context, the situation, and the environment, that you are solving the problem in. Then finding the combination of parameters that optimizes the value produced compared to the cost of production. I can’t tell you the right answer to all of those questions without a lot more context.

But I can tell you that the answer isn’t as simple as you think. And that the answer will change as the conditions change. So be prepared.

by Leon Rosenshein

Writing Legacy Code

One of the people I regularly ready is Tim Ottinger. His writing has either put into words things that I’ve felt but hadn’t figured out how to say, or said things that I’ve said, but in a much clearer/more powerful way than I have. His recent post is one of the former.

One of the dominant, less-disciplined, processes that programmers follow is:

  1. Write a bunch of code (being very careful, of course)
  2. Edit it until it will compile.
  3. Debug it until it gets the right answers.
  4. Look it over for other errors and correct them.
  5. Send it off to code review

If they write tests, they usually do so between (4) and (5).

Notice that by performing steps 1-4 first, they have placed themselves in the legacy code situation.

You can read the long form version on his website. That article has a lot to say about TDD, and, as usual, I agree with all of it.

But what I realy want to talk about here is New Legacy Code

One of the dominant, less-disciplined, processes that programmers follow is: 1. Write a bunch of code (being very careful, of course) 2. Edit it until it will compile. 3. Debug it until it gets the right answers. 4. Look it over for other errors and correct them. 5. Send it off to code review

Or more specifically, the fact that so often, we do it do ourselves. I see it all the time. Code that gets more and more complex as time goes on. Where the lines between domains blurs and functions start to do very different things depending on what parameters they’re called with. Where abstractions start leaking details up and down the call stack. Making changes (and testing) harder and harder as time goes on.

It starts with little things. Someone adds a special case based on one or two parameters, or if we’re in some particular state then skip a step. After a while the code starts looking like an arrow with all the nested conditionals. Instead of refactoring isolating things we make new connections to a database or call time.Now() directly. We take a function that did the one thing that was it’s name but add things to is until a better name would be doTheOldThingThenDoTheNewThingUnlessConditionXorYorZ().

I’m guilty of it. I get impatient. I have a problem. I have an idea for a solution. I start working on the solution. And of course, it’s not quite right, and I find myself in the legacy code situation. It doesn’t really feel like legacy code because at that moment I’ve got it all in my head, so making changes with context is easy. But months later, or even the next day, when I come back to it, all that context is gone. And I need to back my way into it.

It is undisiplined. I know better. Sometimes I forget. Usually I remember and do something about itm but sometimes I forget. Sometimes I get lazy and choose not to. Either way, I always regret when I don’t.

So go read Tim’s thoughts in his own words.

And remember that much of the pain you’re feeling when working without a net is self-inflicted.

by Leon Rosenshein

Who Do You Love?

On the topic of leading with the why, emotions are important. Two of the strongest emotions are love, and it’s not opposite, hate1. Three of the biggest drivers of innovation are necessity, laziness, and love. If you want to a strong driver of innovation, love is a pretty good choice.

Like George Thorogood asked, the question is, Who Do You Love?. Answer that question and you’re on your way. Get the whole team loving the same thing and you’ll be amazed at what you create. So, who should you love? In Inspired: How to create Products Customers Love Marty Cagan gives us the answer.

Fall in love with the customer’s problem, not your solution.

That’s it. It’s that simple. Love (and hate) your customer’s problem. Be passionate about it. If that problem is the most important thing in your universe, and you hate it so much you want to eliminate it, your customer is going to be surprised and delighted.

It means understanding the domain and the problem so well that you understand not just what happens, but why. You understand the fundamentals and the theory behind them. You know not just what the problem is, but why it exists.

What you don’t want to do is fall in love with your solution. Yes, you should deeply understand your solution. What it is, how it works, and why it works.

But you also need to remember that your solution is not the important thing. It’s just a means to an end. A way to eliminate the customer’s problem.

The problem with loving your solution is that it becomes the important thing. You start seeing it as the solution to not only the customer’s specific problem, but the reason you’re doing the work in the first place. And that’s just wrong.

When the solution is the most important thing, you start rephrasing the problem to match the solution. You stop delighting your customer. Instead, you start teaching your customer why they’re misunderstanding their problem, and they should listen to you about what the problem really is2. And that’s the tail wagging the dog.

When you and your team share a why of love (and hate) for the customer’s problem, you’ve tapped into some of the most powerful emotions and drivers of innovation.

Which will result in delivering maximum value for the customer.


  1. Yes, yes, yes, love and hate are in many ways opposites. However, the opposite of love is also indifference. Love and hate are strong emotions directed at a person/idea. The opposite of a strong emotion is no emotion, indifference. ↩︎

  2. Loving the problem doesn’t mean you’ll never have to educate your customers. Very often your outside view gives you an insight that customers are too close to the problem to see. The difference is that you’re not educating them on how wonderful your solution is, your helping them understand the problem better. After that, the solution comes naturally. ↩︎

by Leon Rosenshein

Practice Makes Perfect?

As I heard in marching band, don’t practice until you get it right, practice until you can’t get it wrong. It was certainly true there. It’s mostly true in Software Development too. Not that we’re doing the same thing, day in and day out, like a marching band, but there are a lot of process we practice every day, and getting better at those processes, learning them so well that we can’t get them wrong, is a good thing, right?

Well, It Depends. Generally, it’s better to improve your understanding of, and ability to use, those processes. But not always. Consider this quote:

Do you want to get better at what you’re doing, or find a better way to get the results you want?

That’s a pretty powerful question. It cuts right to the heart of the matter. As I often ask, “What are you really trying to do?” Your goal is to add value. As much value as you can over time. Sometimes you can do that by getting better at doing what you’re doing. But sometimes, you can add even more value by rethinking the situation and approaching it differently1.

Here’s an example for you. Over the years I’ve build about half a dozen batch/parallel processing systems. The first few were very bespoke. They were ad-hoc, distributed build systems for game assets. Turning complex 3D models built with expensive modeling tools into runtime versions that could be rendered in-game quickly. No options on where sources would be found or where to put results. Run one command and hope it worked. Retry it if it didn’t. That was about it.

Over the years they got more generic. Multi-step pipelines. Then map-reduce like things. We added some error handling. We figured out how to do steps that involved people. We started to track the data so we could re-run things based on changed data, automatically. We got better and better at handling different team’s specific needs and workflows. But through all that, changes to the pipeline required changes to the system. That only the framework team really understood. And that bottle-necked the overall system.

Then we had an epiphany. We were getting better and understanding our customers and their needs, but we would never understand them as well as they did. No matter how much better we got, we were rapidly approaching a capability limit.

So instead of getting more detailed, we got more abstract. Instead of doing things for our users, we gave them more power and let them do things themselves.

We gave our users the ability to write their own control logic. They could look at their data, do some thinking and planning, then tell the processing system what their data and processing flow looked like. All we had to do was do what they wanted. And do that well.

Instead of expanding our responsibility to include someone else’s domain and doing an OK job, we restricted our domain to what we knew best. Turning a graph of data processing tasks into an optimal (ish) usage of hardware and network resources. So that our customers could do what they needed to do, in their domain.

That’s just one example of how re-imagining the situation and really solving the user’s problem is what we want to do. There was no way we could be good enough at everyone’s domain and be the best at our domain at the same time. Getting better at undertanding other team’s domains helped, but there’s a limit to how well we could understand it. There were multiple teams that we needed to support at the same time. We just couldn’t do all of it will.

So, instead of trying to get better at the wrong thing, we made sure we understood the other teams to know what they needed, then we focused on our strength2. That’s what let us add even more value.

Even more problematic, there’s often no path that keeps adding value that will lets you keep getting better at what you’re doing as you switch to getting good at doing something more valuable3.


  1. For those mathematically inclined, it’s the difference between a local maximum and a global maximum. ↩︎

  2. More on this later, but for those interested, check out Now Discover Your Strengths ↩︎

  3. Step changes (0 -> 1) are hard. But that’s a topic for another day. ↩︎

by Leon Rosenshein

Power Dynamics

Positional power is an interesting thing. The HIPPO effect is real. When you’re in a position of power, by definition, you have that power. Whether you’re aware of it or not. Whether you use it intentionally or not.

Of course, with great power comes great responsibility. One of the hardest things to remember is that because of that power, people don’t always hear things the way they are intended. And as the person who’s trying to communicate something, it’s on you to make sure the message gets across. The misinterpretation is your fault, not theirs.

Cartoon from workchronicles.com. Frame 1: Boss makes an offhand remark about color on a website. Frane 2: Team hears the comment. Frame 3: Team assumes there’s a problem with the color and starts a deep investigation. Frame 4: Two weeks later team presents the results of their research. The boss has no idea what they’re talking about or why they did it.

One of the most famous examples of this is from the movie Beckett, where King Henry II exclaims “Will no one rid me of this meddlesome priest”, and the next day, the priest is dead. While this almost certainly didn’t happen exactly as depicted in the film, it’s not hard to image that something similar did. Did the King order his firend’s death? Not directly. However, if the King of England, arguably the most powerful man in Europe at the time, complained about something, it’s not at all surprising that his faithful followers did something about it.

England in the Middle Ages is not the only time that happened. It happens all the time. The less contact a person has with someone with significantly more positional power1, or the bigger the power differential, the more likely the subordinate is to listen very closely to the words spoken and do something about it. Whether the person speaking is a King, a General, or just someone with strong influence over your paycheck.

So as the person with that kind of power, it’s critical to think about how your words are heard. That doesn’t mean you shouldn’t say anything. That doesn’t mean you shouldn’t be relaxed and make comments or say what you think. It does, however, mean that you need to think about how your words are perceived. And you need to be explicit about the difference between orders, suggestions, questions, and personal opinions.

I learned this the hard way myself, back when I was a new manager. I asked one of the developers on my team why they had chosen to implement things in a new way when we already an implementation that was very close. I commented that I would have probably just tweaked the existing functionality. At the time, to me, it was just a comment.

The next Monday I got to the office to find that the developer had done a major refactor of the old code to support the existing functionality and be able to handle the new situation. The new code was good. Clean boundaries, no repetition, and very flexible. And over the next 4 years we never used that flexibility. We did occasionally have to go back in and separate the functionality even further. The developer gave up a weekend of their personal time to make a change that wasn’t needed, and in fact, cost us time later. All from a casual comment I made.

Because without enough context, it’s hard to distinguish between a personal opinion, a suggestion, and explicit direction. And in the absences of clarity, people often defer to power.


  1. It’s not just positional power that can skew interpretation. Situational power can do the same thing, as can reputation. Or even volume. ↩︎

by Leon Rosenshein

Exsqueeze Me?

With all due respect to Mike Meyers as Wayne Campbell, I saw something on the internet and the only possible response was Exsqueeze Me?. The quote started out OK, not great, but OK, then, right there at the end, it took a sharp left into crazy town.

Good teams can and will delete tests that have high false positive rates – or that never fail.

Here’s the thing. If you’ve got a test with a high false positive rate, that’s bad. Flaky tests are very bad. Bad in many dimensions. To list just a few of them,

  • They waste your time: Every time a test fails, you’re expected to go analyze what failed and why. Then figure out how to prevent that specific issue. However, if you go analyze the failure and it turns out the really wasn’t a problem, you’ve wasted however long it took you to figure out there wasn’t a problem.

  • Failing tests become business as usual: It’s the broken window effect. When no tests are failing then a sudden failing test is noteworthy. When you have a test that randomly fails then an additional failing test isn’t nearly as noticeable. If it wasn’t important enough to fix the only failing test, then each additional failing test is that much less important.

  • Alert (or Alarm) Fatigue is real: When the siren sounds it’s supposed to be unusual and noteworthy. If your alarm is going off all the time then it’s not an alarm, it’s just life. Just like the boy who cried wolf, if the alarm keeps going off and there’s nothing you can, should, or must do, you start to ignore it.

  • Flaky tests indicate a lack of understanding: It could be a lack of understanding of the domain, the environment, the test setup, or any combination of those three. If you don’t understand the system and situation in this specific case, what else aren’t you understanding? What are you missing that’s going to cause you problems later on?

That’s just some of the reasons flaky tests are bad. Deleting them isn’t the worst thing you could do, and it will fix the first three problems above, but it doesn’t do anything to fix the fourth. In fact, it just hides the problem. Ignoring a problem rarely makes it go away.

Therefore, most of the quote is almost correct. Instead of just removing a flaky test, a much better response is to fix the test so that it’s not flaky. It could be a bug in the code, a bug in the test, a problem with your test methodology, or a lack of understanding. Whichever it is, once you make the test pass consistently, you’re in much better shape. You don’t waste time. You’re incentivized to keep things clean. Alerts mean something. You understand your situation that much better. Which means you get to sleep that much better at night.

It’s the last part of the test that’s just plain WRONG. Some will say that a test that never fails serves no purpose and it’s wasting resources. Time. Bandwidth. Cognitive load. For no measurable benefit. They haven’t stopped a single bug from getting through.

That facts are real. All tests take time, bandwidth, and add some amount of cognitive load to the developers. But all of that, for all of your unit tests, should be minimal. If they’re not minimal then you have other problems (bad tests) you should fix1.

Just because a test hasn’t caught a bug yet, you can’t know that it won’t ever catch a bug. Even if no-one is changing the code directly, those tests can still help keep you safe. They do things like:

  • Protect against changes in dependencies: Dependencies outside of your control can change. Those changes can make your code break. If you don’t test, you don’t know.

  • Protect against environmental changes: There are lots of things in the environment that can change. Networks come and go. Clock speed changes. Processors get replaced and new ones show up. There can be subtle differences in the environment. If you don’t test, you don’t know.

  • Protect against bugs in tooling changes: Similarly, tools change. Runtime environments, compilers, and interpreters can change. Are you relying on undefined behavior? That can change without you knowing it. If you don’t test, you don’t know.

  • Provide examples of how to use the code being tested: Tests are great examples. They can be documentation. They can be used as a learning environment. They can be a reference design.

  • Acknowledge Hyrum’s Law: Given eough time and users, everything your code does, intended or not, is going to be relied upon by someone. You never want to change behavior on your users without knowing about it. That is not how you want to surprise your users.

  • Prevent bugs in the code under test: Finally, and certainly not the least important, you never know when a test is going to show you that you’ve broken something. Past performance is a good indicator of future performance, but it is not a guarantee. If you don’t test, you don’t know.

And that’s why the only possible response to someone saying you should delete tests that never fail is Exsqueeze Me?


  1. Those tests might not be flaky, but they can be bad for many of the same reasons. And you should fix them, not delete them. But that’s a slightly different post. ↩︎

by Leon Rosenshein

The Messy Middle

I’ve talked about a messy middle before, but that’s not the only messy middle. That one was about the land of It Depends. Where you can’t know the answer until you have all of the context. But there are other messy middles.

Consider the space between products and platforms. The split between what customers bought, and what the company built internally so that they could then build the things that customers bought. In almost all cases people know about the products, but almost no-one outside the company knows about the platforms.

A software layer diagram showing two distinct layers, product and platform, and a messy middle blurring the interface of the two layers

Not every company has this split. There has to be enough products, or at least enough engineers, for that to happen. At some point they become not just different things to work on, they become entirely different organizations within a single company.

It’s something you don’t see working at small, single product companies. You often don’t see it at companies with only a few products. Back when I was working for Microprose the company was working on multiple games at any given time, but each game development team was completely self-contained. There were a few people who worked across games, but the code for each one was essentially unique. Everything was a product.

The first place I really saw the split was at Microsoft, where there was a Windows team, that was the platform for everything else, a few other big teams, like Office, Developer Division, and SQL, and a bunch of smaller orgs, like Online, Home, Games, and Research. The odd thing there was that Windows was also a product itself.

At the time I was on the Product side, working on games, and didn’t pay too much attention to the platform side of the world. I only cared about what APIs it provided. Also, we were such a tiny part of the Microsoft world that Windows, our platform, might as well have been made by a different company. We could ask a few questions that outsiders couldn’t, but we didn’t have much influence. The Office and Developer Division teams had that.

Then I moved to Uber. When I started at Uber they had just gone through their split. By that time I was on the platform side, working deep in the engine room. At that time Uber was already big enough that we even saw differences between real-time and offline/batch processing.

You see, down in the engine room we spent a lot of time building a product agnostic processing engine. We didn’t care what folks did with our CPUs and persistent storage. We abstracted away the complexity of scheduling and placement engines and let our customers focus on data and control flow. And that’s where I ran head-first into another messy middle.

Besides making sure the system worked and was easy to use, I spent a lot of my time as an evangelist. Showing countless other teams how they could benefit from our platform. The more success I had, the more I saw that there was at least one missing layer. I saw multiple teams creating their own internal platforms, based on our platform.

Being closer to the product, those platforms were more opinionated than ours. Because they knew what their 2-5 teams were doing and didn’t care about the other 20 or 30 teams using the base platform, they were able to make assumptions and build decisions into their tools.

That’s where it started to get messy. Those little infrastructure teams didn’t really know about or talk to each other. They started to duplicate work. Or at least solve similar problems. Not just each other’s, but problems that we were solving. The line between the different teams and the platform ebbed and flowed as problems were identified and solved.

It was a bit of a surprise, but it wasn’t all bad. Yes, it was messy. Yes, there was some duplication. But there was also a lot of advancement. By not forcing groups of teams to wait for a shared solution those teams were able to move faster. By waiting for consensus on the solution to form and move into the platform we avoided churn. By consolidating known shared work we freed up resources for product specific work.

We turned what could have been a bottleneck into a competitive advantage. Through lots of communication. Of problems. Of intent. Of timelines. By keeping the cross coupling down. By encouraging teams with similar context to make those specialized solutions that kept domain specific work with the domain. By taking the common parts of those domain specific solutions and commoditizing them. By recognizing that one team’s platform is another team’s product.

by Leon Rosenshein

Speed Vs. Quality

As I’ve noted before, while I’ve worked all over the computing stack, from simulation for games and requirements analysis to 3D rendering to fleet management to tools, I’ve spent the last 15 years or so working down in the engine room. Working on internal platforms, tools, and infrastructure. And time and time again I’ve seen teams and entire organizations micro-optimize for immediate velocity only to run into a wall and find themselves stuck. I’ve seen it happen around data storage, data shipment, data ingestion, and data validation. I’ve seen it happen custom processing engines and backwards compatibility.

A group of developers running into a wall of code

Every time, what it came down to was every case becoming a special case, so that nothing was special. Instead of developers having to understand a few simple systems that did one thing and one thing well and could be composed as needed, developers had to understand lots of highly specialized systems that were used for one thing only. Worse, those specialized systems weren’t isolated. They had unique and subtle interactions that developers had to keep in mind at all times. The cognitive load became too large, and most changes did most of what they were supposed to, but also caused downstream issues that now needed to be fixed.

Not only that, but because deadlines were always approaching, decisions were made based on those deadlines. Each one adding more coupling and cognitive load. The rate of change stayed high, but actual progress got slower and slower.

Inevitably, the cost of progress would get to be so high that someone decided to do something. Somebody would say “I’m declaring a Code Red”. People would focus on the problem, get back to fundamentals, and re-work things to both solve the problem, at the moment, but also eliminate complexity and coupling. The result was not just the solving of a specific problem, but also a step change in real progress.

I’ve experienced it multiple times. Sometimes I’ve been able to help people avoid the problem by pointing out the end result of the path they were on. But what I’ve never done is a systematic study of the problem. I’ve been too busy doing the work.

Luckily, someone else has. In March of 2022 Tornhill and Borg published Code Red: The Business Impact of Code Quality– AQuantitative Study of 39 Proprietary Production Codebases.

It’s not perfect. It’s based on Jira and Git histories for 39 projects, so it’s pretty good at determining correlation. It’s based on projects of various sizes, but none of them are all that large. It’s a cross section of languages. There’s some selection bias because all of the code bases were ones that had already chosen to at least purchase a specific code quality too. Finally, because there are no actual experiments, it can’t prove anything about causality.

There’s lots of interesting conclusions in there, but to me, the most interesting is not about tickets per file or line of code, or how long ticket resolution took on average, but the variability of resolution time. In the code bases that were identified as the least healthy, the variability in resolution time for a ticket was 9 times higher.

graph showing that medium quality code has 2x the resolution time variability and poor quality code has 9x the variability of the high quality code

In other words, not only did it take longer to resolve issues in low quality code, but the issues were of wildly different sizes. We can’t be sure, but it’s likely that developers very often didn’t know how much work it would take to resolve the issue.

And that points at higher cognitive load. The more complex the problem, the harder it is to reason about, and the harder it is to solve. Particularly without making something else worse. And that means that you really have no idea how much work it’s going to be until you’ve actually done the work. That lack of predictability makes planning harder. And pushes you to the quick fix. Which just makes it worse.

But now, we’ve got proof that, at least in these cases, maintaining high quality and clean boundaries, avoiding coupling, and reducing cognitive load, is highly correlated with both faster execution and reduced variability.

While correlation is not the same as causation, it often is. And my personal experience lines up very well with their conclusions. So when someone asks you want the business value of quality code is, now you can show them the academic proof.

by Leon Rosenshein

Culture: Not Just For Yogurt

Culture isn’t just for yogurt and sourdough. It’s also for teams. Or really, any organization, formal or informal. There’s a lot that goes into culture. What your goals are. How you incentivize. How you teach. How you do things, in good times, and in bad. And perhaps most importantly, how you handle uncertainty.

Overhead on the internet:

Your “Company Culture” is more about what happens when somebody on a team says “I don’t know” than what team building events you plan.

That right there. That’s what culture really is. It’s not the perks you have or the benefits you provide. It’s not your mission or your goal. It’s not even what you do when things get hard. It’s what you do when you’re faced with uncertainty. It’s what happens when someone on your team is asked a hard question and they say “I don’t know”, or worse, make a mistake.

The first thing to realize is that if people feel confident enough to say “I don’t know” you’re at least on the right path. You’re getting honest answers, which is a pre-requisite for having the kind of culture that helps people do their best. It’s not the best place to be, but it could be a lot worse.

Worse would be where, instead of saying “I don’t know”, the answer is defensive, makes excuses, or even worse, redirects blame. Worse is no accountability and everyone putting themselves first. At the expense of their co-workers, their teams, and of whatever it is they’re supposed to be doing. Worse can be paraphrased as “I don’t have to be good, I just have to make sure someone else is worse.”

people helping each other vs people holding each other back

On the other hand, high performing teams, the ones with the best cultures, don’t just say “I don’t know.” They continue from there. They finish the sentence with something along the lines of “and we’re going to do X, Y, and Z” to find out.” That’s not only being accountable for where they are, that’s taking ownership for fixing the issue and moving forward. To the benefit of their co-workers, their teams, and whatever it is they’re supposed to be doing. The best can be paraphrased as “A rising tide lifts all boats.”

Assuming that’s what you’re aiming for, you might need to make some changes. Changing culture is hard. It’s been said that

In the battle between change and culture, culture almost always wins.

One of the biggest parts of building that modeling honesty and trust. One easy way to do that is when someone is honest enough to say “I don’t know”, don’t just jump in and say what you want or need, or start being directive. Instead. trust them to finish the sentence with both an explanation and next steps. Offer help. Help via guidance. Help with resources. Help with the time and space to take those next steps.

While doing it yourself might get you those answers sooner, it comes with a very high cost. You don’t get done what you were supposed to be doing. The next time this sort of situation comes up (and it will) the same thing is going to happen. The only thing people have learned is that you don’t trust them.

If instead, you wait for them to finish the sentence and then make sure they’ve got what they need to find the answer you still get to do your job. And next time, instead of them saying “I don’t know.”, they might say something like “It took a bit more work than I planned, but here’s the answer to your question.” And even if they don’t get it that time. They might get it the time after that. Or the one after that. What they will do is keep getting closer.

That’s a rising tide lifting all boats. That’s a better culture. And, you’ll have more fun at your team-building events. Because the team is trying to lift each other up, not keep each other down.