Recent Posts (page 5 / 65)

by Leon Rosenshein

0, 1, Many

Continuing on with looking at numbers, think about counting. We all know how to count. Or we think we do. But do we really think about how we should be counting. Consider the following quote.

“Common programmer thought pattern: there are only three numbers: 0, 1, and n.”     – Joel Spolsky

There’s more than a little truth to that statement. After all, from a linguistic standpoint there’s lots of precedent for it. My non-linguistic experience also tells me that there’s not just a quantitative difference between 0 and 1 and N, but there’s also a qualitative difference.

The qualitative difference shows up in many different ways. 0 is the same as never. That can’t/doesn’t happen, so don’t worry about it. 1 is the same as always. Count on it happening. Assume it already happened. Either way, always or never, 1 or 0, TRUE, or FALSE, it’s a constant. There are no decisions needed. N, on the other hand, is maybe. You don’t know. It might happen. It might not. You can’t count on it. You need to handle it happening. You need to handle it NOT happening. Be prepared for both cases1.

Another qualitative difference is that when there is a choice, it’s often not either/or, but one (or more) of many. In code that shows up as something that started as if/else, but eventually morphed into a series of if/elseif/elseif/elseif/…/else. Sure, that can work, but there are better ways. Listen to your data and let it guide you in your programming. This is where object-oriented programming, in it’s true sense, really comes into it’s own. You make the decision early about what the object is, then you just act on/with it in a linear sense. You get back to always (or never) for most decisions and let the object worry about what it means in that specific case.

Then there’s the learning case. I’ve said before that versions 1 and 2 of something are easy. It’s when you get to version N that things get interesting. Again, that first version is the never case. No one has done it before, so there are no special cases. Just do something and it will be ok. Version 2 is the always case. For anyone who has used version 1, it’s always been that way. There’s no ambiguity. Everyone, on both sides, knows what to expect. It’s only when you get the version 3+ that you get into the maybe case. You don’t know what your customer has experienced. You don’t know what they expect. They don’t know what is coming from you. And as I’ve said, that’s where the learning is. Dealing with the ambiguity is where you stretch and grow.

So, whether you’re thinking about your design, your implementation, your career, or life in general, think about how you deal with counting.


  1. Hint. You might think it’s a 0/1 situation, but check your assumptions. It might be a 0/1 situation, but our assumptions are often wrong, so think them through ↩︎

by Leon Rosenshein

1 > 2 > 0

I’m pretty sure this story is true, because I’ve heard it too many times, sometimes from people who could have been there. The people involved and timeline match up. Also, even if it’s not true, there’s still something to learn.

Amazon has always been about 2-pizza teams. You should be able to feed an entire team lunch with 2 pizzas. The idea is to keep them agile and innovative. To minimize communications delays and bottlenecks. It works pretty well too. It says nothing about the software architecture, only the scope of responsibility of a team.

Back around 2002, Amazon’s internal system was a large monolith. And there were lots of 2-pizza teams trying to work on it at the same time. It was pretty well designed, but with that much going on there were lots of interactions and coupling between teams and the work they were doing. So much that it really started to slow things down. It got so bad that Jeff Bezos issued an ultimatum.

All teams will henceforth expose their data and functionality through service interfaces.

That’s a pretty bold requirement. It meant that everything needed to change. From the code to the tooling to the deployment processes and automation. Over the next 2-3 years, Amazon did it. They changed to a Service-Oriented Architecture that endures to this day. It broke a lot of the direct coupling that had crept into the monolith. And it led directly to the creation of AWS and it being the dominant cloud platform. A cloud platform that can do everything from hosting Netflix’s compute and storage to hosting this blog.

It did that by clearly defining boundaries and letting teams innovate inside those boundaries to their hearts content. But it also led to some new problems. Because each team was responsible for everything inside those boundaries, teams started to write their own versions of things that were once shared libraries. And we all know that duplication code is bad. You duplicate bugs. It makes updates hard. Teams ended up with code they were responsible for that they didn’t necessarily understand.

Enter Brian Valentine. He’d recently joined Amazon (in 2006) as a Senior VP, coming from Microsoft, where he’d led, among other things, the core Windows OS team. A huge organization w/ 1000’s of people developing hundreds of libraries and executables that made it up. He looked at what was going on and realized that lots of teams were writing the same code. That there were multiple implementations of the same functionality scattered throughout the codebase. That it was inefficient and wasteful and that those sorts of functionality should be provided by a set of core 2 pizza teams so that the other teams could focus on their specific parts of the business.

He worked with his team and his peers to define a system where those core teams would be identified and created, then all the other teams would start using them. They wrote the 6-pager that defined the system, how it would be created, and all the benefits. Eventually it got to a meeting with Jeff Bezos, then Amazon CEO. I believe everything up to this point is true. Here’s where it gets apocryphal. I want to believe it’s true, but I just don’t know.

After the required reading, Valentine summarized the point of the meeting by writing a single line on the whiteboard

1 > 2

Huh? One is definitely not greater than two. One is strictly less than two. What Valentine meant was that having one way to do some shared bit of functionality that is actually shared, not copied/reimplemented, is better. It’s more efficient. It means less duplicated effort. It lets teams focus on their value add instead of doing the same thing everyone else was doing. That makes sense. So that’s what Amazon does now, right?

Nope. After Valentine said that was how things should be done and stepped back, Bezos stepped up and changed it slightly. To

1 > 2 > 0

What? That makes even less sense. Two is greater than 0, and so is one, but two is not between one and zero. What Bezos was saying was that having one solution might be better than having two, but waiting for some central team to build something new or update some existing service to have a new capability takes time. It adds coupling back in. It makes things more complex. And while you’re waiting for the central team to do its part the dependent team can’t do anything. So for some, potentially long, period of time, you don’t have 1 solution, you don’t have 2 solutions, you have 0 solutions. And having zero solutions is even worse than having multiple solutions. The plan pretty much ended there. Like they say in the go proverbs, “A little copying is better than a little dependency.”

Which is not to say you never go back and centralize common things. Amazon doesn’t expect every team to write their own operating system. They don’t write their own S3 access layer. They use Linux, or the AWS SDK. And when there is a critical mass of people doing something common then you write a centralized library that is shared. Or write a new service with a defined interface that everyone should call.

The trick is to do it at the right time. After there are enough instances to let you know what to build, but before there are too many versions to be able to replace them all with the central version.

by Leon Rosenshein

What You Do Next

“If you hit a wrong note, it’s the next note that you play that determines if it’s good or bad.”     –Miles Davis

That’s pretty deep. And it applies to things very far removed from jazz. Things that are very structured and precise. Things like code. Or, maybe, software isn’t as structured, precise, and linear as we think.

Before you get upset and tell me I’m nuts (which may be true, but doesn’t matter here), I want to be clear. Writing code that is well structured and precise is important. Being clear and understandable is important. Separation of concerns and listening to the data is important. Big ball of mud development is not what I’m talking about. I’m talking about how we write code, not the code we write.

Consider this alternate phrasing of what Miles Davis said.

You learned something new. How you respond determines if the knowledge was good or bad.

Those two sentences pretty much define the software development process. Everything else is an implementation detail. Waterfall or agile, greenfield or brownfield, startup or established in the industry, it doesn’t matter.

Put another way, it’s the OODA loop. Observe (see where you are). Orient (understand the situation). Decide (choose the next step). Act (do it). Because you are where you are, and the situation is what it is. Your involvement in getting there (Miles’ wrong note) doesn’t matter anymore. The only thing that matters is what you do next. How often (or fast) you run the loop.

Think about how empowering that is. The immutable past has happened. You can’t do anything about it. But you have a lot of control over the future. You have agency. You have power. You have the ability to make sure that the next step puts you in a better place than you are now.

If your next action moves you closer to your goal than where you currently find yourself then your move was good. Whether you find yourself closer or farther1, run the loop again. And keep running it. Eventually you find yourself at your goal. The more experience you have in the field, and with the situation, the more likely your choice of action will move you closer to your goal.

That applies whether you’re playing jazz or writing software. And now that I think about it, to life in general.


  1. Closer or farther in the minimizing time and effort sense. Sometimes refactoring, which takes time, appears to be irrelevant, and has no visible impact, actually gets you to your goal with less time and effort. ↩︎

by Leon Rosenshein

What Are You Testing? II

I’ve written about what you’re testing before. That article was about writing tests in a way that you could look at the test and understand what it was and wasn’t testing. It’s about writing readable and understandable code, and I stand by that article.

However, there are other ways to think about what you’re testing. The first, and most obvious way to think about it is thinking about what functionality the test is validating. Does the code do what you think it does? That’s pretty straightforward. You could think about it like I described in the article and how readable and understandable the test is. Or, you could think a little more abstractly, and think about what part of the development process the test was written I support of.

Taking inspiration from Agile Otter, you can think about the test at the meta level. What are you writing the test for? Is the test to help you write the code? Is the test to help you maintain the code? Is the test supposed to validate your assumptions about how users with use your code? Is the test supposed to characterize the performance of the code? Is the test supposed to help you understand how your code fails or what happens when some other part of the system fails? The reason you’re writing the test, the requirements you have for the test, help you write the test. It helps you know how to write the test and what successful result looks like.

Industrial Logic slide- Title: So, only write fast automated tests, right? Body:No. Only fast, automated microtests will support refactring. Other tests support system correctness, release-worthiness, etc.

Why we write tests.

A set of tests written to validate the internal functionality of a class, with knowledge of how the class operates has very different characteristics than a test written to validate the error handling when 25% of the network traffic times out or is otherwise lost. Tests written to validate the public interface of that class also look and feel different. They all have different runtime characteristics as well and are expected to be run at different times.

Knowing, understanding, and honoring those differences is key to writing good tests. Because the tests need to not only be correct in the sense that they test what they say they do by their name, but that the results of the test also accrue to the meta goal for the test. Integration and system level performance tests are great for testing how things will work as a whole, but they’re terrible for making sure you cover all of the possible branches in a utility class. Crafting a specific input to the entire system and then expecting that you can control exactly which branches get executed through a call chain of multiple microservices and database transactions is not going to work. You need unit tests and class functionality tests for that. The same thing if you need to do some refactoring. Testing a low level refactor by exercising the system is unlikely to test all of the cases and will probably take too long. On the other hand, if you have good unit tests for the public interface of a class, you can refactor to your heart’s content and feel confident that the system level results will be the same.

Conversely, testing system performance by measuring the time taken in a specific function is unlikely to help you. Unless you already know that most of the total time is taken in that function, assuming you know anything about system perf from looking at a single function is fooling yourself. Even if it’s a long-running function. Say you do some work on a transaction and take it from 100ms to 10ms. A 90% reduction in time. Sounds great. But if you only measure that execution time you don’t know the whole story. If the transaction only takes place 0.1% of the time, and part of that workflow involves getting a user’s response to a notification, saving 90ms is probably never going to be noticed.

So when you’re writing tests, don’t just remember what one thing you’re testing, also keep in mind why you’re testing it.

by Leon Rosenshein

Release Stabilization

Release stabilization is traditionally the period at the end of a development cycle where the team minimizes change and spends time fixing things and making sure the result is stable. That sounds like a good thing, doesn’t it? And compared to its most obvious opposite, release de-stabilization, it’s definitely a good thing. Before you release any software, executable, library, website, whatever, you want to know that its good and that it’s stable and resilient to whatever the real world will throw at it, When thought of that way, it’s something we should always do.

On the other hand, there’s a different opposite state that’s implied by that term. A term that makes me think really hard about how we develop software in general. The alternative other state is Development Instability. If we need to take some time at the end to make the software work, what the hell were we doing all the time before that? Working on job security? Of course not, But …

“Release stabilization?” I don’t understand. Why did you choose to make it unstable? In what world does that make sense?       - - Kent Beck

I know at least one reason. The drive to get things done. Or at least call things done. And be able to close the Jira ticket. Because often, in the short term, that’s how we’re measured. And as Eli Goldratt said,

Tell me how you will measure me, and then I will tell you how I will behave. If you measure me in an illogical way, don’t complain about illogical behavior.

That doesn’t make it the right thing to do though. It’s a classic local maximum issue. The fastest thing I can do right now is the least amount of work needed to be able to close the ticket. Even if that means I leave a bunch of work for tomorrow (the release stabilization period). If it were as simple as delaying some work until later, we’d be OK, and it wouldn’t be a local maximum.

Unfortunately, it’s not just delaying some work. The work we’re delaying isn’t just moved to the end of the project, it also slows down progress for the rest of the project. Every time we delay work, we make things a little slower in the future. If you think of it as technical debt, which isn’t a bad metaphor, eventually the interest becomes too high and you go bankrupt.

You can hit a local maximum the other way too. You can spend too much time making some small thing perfect. You think by doing that you’ve made it so you don’t need the time at the end, and don’t have the problem of slowing yourself down in the future. It’s a good thought, but again, there’s a problem. You spend all that time making it perfect, given what you know about the rest of the system. Then you learn something new about the system and what was perfect becomes not perfect. So you perfect it again. Then you learn something new. And the same thing happens. It turns out there’s a name for this, Rework Avoidance Theory and it doesn’t work either. You end up slowing yourself down because you need to keep changing things as you learn more about what you’re doing.

We know these things. We struggle against doing them. Like everything else in software (and life), the right choice at any specific time is dependent on the context. I can’t tell you what you should do in any specific context, but I can tell you that the right way to approach the problem is to be aware of your context. To think about which work to do now based on what you know, and which work to leave for later when you know more.

I can’t be sure, but experience tells me that you should probably be doing a little more work after you think you’re done, not less. There might be some extra conditions/experiments you run when you get closer to the end, but overall, doing more along the way and less stabilization will probably get you to the finish line sooner.

by Leon Rosenshein

Average Practice

I’ve talked about best practices a bunch of times. Generally, in praise of them, with the caveat that context is important. What was the best practice for one team, in one specific situation, might not be the best practice for you in your situation.

Over time, best practices get diluted. As they get described, and implemented, with less and less connection to the original context which they came out of, the less specific and focused they get. They become generic advice that applies everywhere. Also, by the time you hear about them, the team that came up with it has probably changed what they’re doing. Because their situation changed, the practices they follow has also changed to be optimized for that new situation.

Blindly following best practices doesn’t mean you’re doing what the best teams (whatever that actually means) are doing or that you’re going to get the same results. Think about it. Most documentation of best practices are so lacking in specifics that it’s easy to convince yourself that you’re already following them. And if they do have specifics, the specifics are so focused on the original situation that you can’t apply them in yours. What you end up with isn’t the best practice for you. It’s an average practice that is probably a very good idea.

Don’t get me wrong. I’m not saying you should ignore best practices. You shouldn’t ignore them. Instead, you should look closely at them. They may not tell you exactly what to do in order to get the same results as their originator, but they are usually providing a very good baseline. Instead of considering them a goal and something to strive for, think of them as defining a baseline that is pretty good, but could be adjusted and optimized to provide even better results for your specific implementation.

So don’t avoid best practices. The exact opposite in fact. Embrace them. Embrace them so hard you understand what problem they were created to solve. Then look at how they actually contributed to solving the problem, in the original context. Then look at how that context is the same as yours and how they differ. Then, and only then, should you start to think about how to apply a modified version of the best practice in your situation.

Consider the idea of code reviews. At some of the biggest tech firms, Microsoft, Google, Amazon, Meta, etc., there are bespoke tools for managing code reviews and ensuring that (almost) every change is reviewed by the right folks in a timely fashion. These companies all have billion-dollar market caps, and have huge profits. Therefore, if you want to have that kind of success, you need to build your own bespoke system to do the same, of even better, find the FOSS version of one of those systems and do exactly what they did, right?

Wrong. First, what problem was that system built to do? What were the incentives and counterincentives in the environment they were built in? Consider the Microsoft Windows team. 1000’s of developers, broken into larger and smaller orgs that need to work together. That need to share knowledge. That need to protect themselves from each other’s best intentions. They need to build 1000s of executables and libraries, in concert, quickly, and test them separately and together. The result is the Virtual Build Lab and all of the tooling and infrastructure around it. Microsoft has a large team of people who’s only job is to run the system that does builds of Windows. 100’s of folks who make sure the right changes go into the right libraries and executables and are then packaged up into some deployable thing that can be run through a set of automated tests, which are built and maintained by an equivalent team. So if you want to build and deploy something as ubiquitous as Windows, you need to do the same thing.

Or not. Do you have 1000’s of developers working on the same product? Do you have millions of lines of source code? Do you need to maintain backward compatibility with 20 year old software and hardware? Can you afford to dedicate 100 people to running your build system? Probably not, so why try?

Instead, look at what core business problems they’re trying to solve. Allow different teams to develop at their own pace. Make sure all teams have access to the latest code. Make sure builds don’t slow everyone down (too much). Decouple teams, but make sure the right people look at the right code before it gets into the system. Once you understand what they were trying to do, decide how much of that you care about.

Maybe it’s decoupling with proper oversight that is what you really want to solve. So solve that problem. At your scale. With the tools you already have at your disposal. Augment those tools as needed, but only with what’s needed. Something that assigns and requires specific reviewers based on the section of the codebase. Or maybe mob/ensemble programming solves the real problem, and you don’t even need to have additional code reviews after the code is written.

Remember, the goal is to solve the business problem and provide value, not to implement what someone else did. So next time someone tells you that you should do what does, make sure you know why they did it, make sure you want to have the same result, then use their best practice as a baseline and starting point for coming up with a best practice that works for you, in your situation.

by Leon Rosenshein

Green Fields And Platforms

I recently had the opportunity to work on a greenfield project. You know, the kind of project were you recognize a common problem a set of users have and you have an idea of how to solve the problem for them. It can be a great space to operate in. You have a green field to build your castle in. Sure, you’ve got the eternal constraints of time and resources, but other than that you’re free to innovate.

You get to choose your architecture. Need lots of extensibility governed by a common control? Maybe use a micro-kernel architecture. Have lots of different components that need to scale differently? Try microservices. You set the requirements and you decide on the architecture.

You get to examine the problem and solution space and figure out what the bounded contexts are. You pick the domains that drive the design. You pick the nouns (objects) and verbs (actions/API). You get to do it in a way that makes sense to you.

I was in that situation. We choose the language. We choose the architecture. We choose the domains. We built a proof of concept that showed us we were on the right path. Things were looking good.

Then we ran headlong into an observation Tolstoy made in Anna Karenina.

All happy families are alike; each unhappy family is unhappy in its own way.

We were trying to build a platform for others to build their solutions on. Like a good platform, one of the things we were doing was hiding complexity. We were trying to make it easier for our potential customers to focus on doing the things that added value and that their customers wanted. We wanted to make it so that they didn’t need to worry about the complexity and details of configuring and providing low latency connections between any two of a few hundred endpoints.

We spent some time talking to potential customers and it turns out that Tolstoy was right. At a high level all of our them were doing the same kinds of things. It was easy to see how a single platform could provide a foundation for all of them to build on and be happy. However, when we dug into the details of what they were unhappy about, it wasn’t as uniform.

Sure, they all ran into the same problems, and they all used most of the same tools to approach a solution. Unfortunately for us though, the solution to those common problems wasn’t the same across customers. And they weren’t different in the details and approach, they were also different in how those solutions integrated with the rest of their systems.

The domains had different shapes. The bounded contexts had similar names, but the boundaries were different. The set of verbs they used were generally synonyms of each other, but because the domains and contexts were different, they were incompatible. They were all unhappy with their bespoke solutions, but they were all unhappy in very specific, and different, ways.

Getting rid of that unhappiness still made sense as a solution, but the field wasn’t as green as we thought. Sure, we could have whatever architecture we wanted on the inside, but along the edges we found a whole new set of constraints. We had to (mostly) match existing code and APIs. Much more of a brownfield project than we originally thought.

As a platform we needed to make customers’ lives easier. And one of the big things we needed to do to make their lives easier was make sure the transition from their bespoke solutions to our framework was straightforward and painless. We realized that doing that had just become our biggest problem. In some ways it was a bigger problem than the original problem our platform was supposed to solve. Instead of just needing to hide the underlying complexity, we needed to hide the complexity of building a single API that could support the workflows and patterns of multiple different approaches to the problem.

To solve the problem, we kept reducing the scope of the initial version. Instead of a full-featured platform that could drop in, we focused on a very small subset of the problem space. Something we knew was important enough that all of our customers would want the change. Something we knew was separable enough that our customers could accept the change without having to do too much architectural surgery on their existing code. Something that would be the thin edge of the wedge and let us insert our platform in our customers’ codebase and then expand out from.

So next time someone tries to convince you that you’re going to be working on a greenfield project, remember that even the greenest of fields have edges, and the shape and context of those edges will turn what you think is a greenfield project into a brownfield project.

That doesn’t mean the idea is a bad one or that you should run away from it. On the contrary, it can be a great opportunity to learn a new domain, add lots of customer value, and have a chance to really stretch your design muscle. Just don’t forget the things on the edges, because that’s where the constraints and headaches come from.

by Leon Rosenshein

What Is Done?

The Definition of Done is an important part of the software development process. I’ve talked about being Done Done and Telling You What I Want before. I’ve also talked about the difference between Stories and Tasks. All of those things tell you about the importance of knowing what done means. There’s also some guidance in there on how to define done.

What those posts don’t do though, is point out any of the potential hazards along the way. Like most other sets of directions, they provide guidance on what to do, but they don’t tell you much about how to recognize when you’re off course. How to recognize the situations where your definition of done is actually causing more problems than it solves.

The first, and arguably, the most important, is being too rigid. You have your acceptance criteria and damn it, you either meet them or you don’t. You said there would be 95% line coverage with unit tests, but there’s only 94.87% coverage. You failed. Or you said there would be a P95 response time of less than 100ms, but P95 is 110 ms. Sorry, you tried hard, but you failed. That’s pretty demotivating, and no-one wants that, so instead of admitting to failure, you ignore the definition, declare victory, and move on. Everyone feels good. But after a few such occurrences, the definition of done becomes unimportant. Since everyone knows you’re just going to move on anyway, why bother doing what you said?

Being too rigid in the definition, then ignoring whether you met the definition or not quickly leads to the definition of done being irrelevant. The trick is to make the definition of done comprehensive enough that you’re always adding value, but not so strict that 50+% of the time you end up missing it. Instead leave some room in the definition of done so that it’s almost always achievable.

Which leads right to the second problem. Failing to inspect and adapt. Whether you call your process Scrum, Agile, XP, Kanban, BDUF, or something else, if you’re not paying attention to what’s really happening, you’re not going to end up where you want to be. Not meeting a definition of done by the specified date is a problem, but a bigger problem is to just accept that it happened, and will continue to happen, and not try and make things better. Every failure to meet a date is not a problem, it’s an opportunity.

Acknowledge the miss, then look at your process and your definition of done and make changes. As I mentioned above, you can make the definition less rigid. Instead of saying 95% coverage, say that coverage will be > 95% or the coverage at the beginning, whichever is lower. Now, your goal is not to suddenly meet some arbitrary coverage number, but to keep getting better until you meet the threshold you’ve defined you need to maintain.

Along the same lines, and worth a whole post on it’s own in the future, is that using a standard definition of done is a problem. It might be too rigid. It might be too lax. If it’s standardized and shared, it’s almost certainly out of context. Even if it’s a “best practice”, the label of best practice comes from someone else, and might not (probably doesn’t?) apply in this specific case.

So by all means, have a definition of done. Make sure it fits your team and context. Define it so that it’s achievable. Achieve it whenever you can. When you can’t, fess up to the miss. And always, always, always, take the time to look back and learn how to make your definition of done better.

by Leon Rosenshein

High Quality Quality

Before today’s topic, a little housekeeping. After a little over 14 months, I’m back at Aurora. I’ve taken on a Tech Lead Manager role, responsible for measuring and helping to improve software quality in support of Aurora’s mission to deliver self-driving, safely, quickly, and broadly.

Of course, before you can measure (or improve) something, you need to define what it is. In this case it’s software quality. There are lots of ways to measure software quality. To me, the most important is suitability to the problem at hand. Whatever software you write needs to add value and solve the problem it intends to solve.1

If you never intend to touch the code again, nor will it ever be used in a situation you didn’t expect, nor with inputs you hadn’t considered, you might have quality software. For the rest of us, writing code that will need to be changed, either to handle new requirements or new operating conditions or new environments, solving the problem at hand is a necessary, but not sufficient condition for having quality software.

Then there’s the definition of solving the problem. Have you solved the problem if you only solve it in some specific cases? Maybe, but it depends on the failure mode. And what the expectations are for that failure mode. A basic calculator is pretty simple. If it adds positive numbers correctly and says “invalid input” for negative numbers that might be quality. If it gives you the wrong answer, it’s not quality software.

You might be saying to yourself, wait a minute, this isn’t about software quality, it’s about correctness and suitability, or External Software Quality (ESQ). That’s sort of true, but you can’t have quality software without it, and if we let ourselves forget that we’ve got a whole different problem.2

Having said all that, what about code quality, or Internal Software Quality (ISQ)? When you start talking about ISQ, you’re talking about some of the ilities. Readability. Maintainability. Extendability. Scaleability. Securability. Optionality. Things that matter when you come back to the code weeks or months later because you’ve got a new requirement, or you found a problem.

At the heart of ISQ is the notion that while we might write code occasionally, we spend way more time reading and modifying code that’s already written. That’s where things like coding styles and linting come in. Things that make it easier to read the code because things look the same. Loop boundaries are clear. Variables are named clearly and not re-used for strange purposes. Things that make it easier to understand not just the exact meaning of a specific line, but how that line fits into the overall flow of the code. Reducing nesting and indentation. Things that make it easier to change code and be confident that you haven’t broken something. Increased code and branch coverage. Static analysis. Reducing cyclomatic complexity. Things thay keep the reader’s cognitive load down so that they can focus on the why, and not get caught up understanding the what.

Things like coverage, static analysis, and cyclomatic complexity are nice because they are quantifiable. You can look at today’s numbers and compare them to the numbers from last week or last month and see if you’re getting better or worse. They can be calculated for small parts of the code and changes to the numbers can be attributed to specific changes to the code, so you know what areas need work. You can use the numbers as gates to allow (or block) changes if you don’t like the way the numbers are impacted by the change. Doing that can help your ISQ. Just like ESQ, you need to measure and respond to what those numbers are telling you. But you also need to be careful to understand what you’re doing. You might be moving the numbers in the right direction, but there’s no guarantee that you’re improving your ISQ. If they’re getting worse, ISQ is probably going down, but the inverse isn’t always true.

Or put another way, the set of things you need to do to improve quality is also the set of things that make developers more productive. Things like DORA and SPACE. High ISQ leads to happy developers and happy developers leads to high ESQ.3 Unfortunately, if you think measuring code quality is hard, measuring developer sentiment and productivity is even harder.

There’s no silver bullet or magic potion. Instead, you have to implement all the individual measures, think about them, and make a change that you think will drive things in the right direction. Then you reevaluate and see if you’ve made things better. When something works, you do more of it. You look at incentives and goals, then you make sure the incentives align with the goals. You take many more much smaller steps, improving things one step at a time.

Or, you start instrumenting your code reviews and minimize the WTFs/Min.

Image of two doors to rooms were a code review is happening. One door has a caption reading 'Good Code'. The other has the caption 'Bad Code'. The good code door has though bubbles for 2 WTFs. The bad code door has 5.

WTFs/m


  1. It’s ok if the street finds its own use for your software. If it solves that problem, it has quality for that problem, even if it’s not the one you originally tried to solve. ↩︎

  2. It’s a problem for a different time, but for now, take a look at these search results ↩︎

  3. That’s also a topic for a different time ↩︎

by Leon Rosenshein

A Problem Vs. The Problem

If you’re looking for the All Models Are Wrong that was here by mistake, just follow this link.

I’ve talked about Root Cause Analysis (RCA) before. Otherwise known as “Why did this really happen?” It’s an important tool in deciding if you’re fixing a problem or alleviating a symptom. Both are very doable, and both can be appropriate. The question you need to ask yourself is “What impact do you want the solution you’re coming up with to have?”

The answer to that question is very contextual. If you’re doing a market analysis or trying to find product/market fit then you’re looking to solve a problem that your customers might not even know they have yet. At the other end of the spectrum, are outages and incident response. Something that used to work is suddenly not working. Users know exactly what their problem is (they can’t do the thing they’re trying to do), but have no ability to get it working again. You, on the other hand, have no idea what the user’s problems are, but you can see that your software/service isn’t responding correctly.

In the first, arguably easier, case, you have a couple of luxuries. No one is upset with you (yet). They have limited, or no, expectations. You have the time to think deeply about the problem and figure out what the root cause is and come up with a solution that solves that root cause. Then, with the root cause addressed, you can meet the user’s specific needs and add value for them.

When you’re dealing with an outage on the other hand, you don’t have the luxury of time. Something is broken and your users are suffering. You might just be slowing them down a little, but you could also be costing them time and money. At Uber, if the rider/driver matcher wasn’t working you had both problems. The riders were unable to get where they needed to be on time, and the drivers weren’t able to earn money. That puts more than a little urgency on solving the problem. That urgency drives you away from doing RCA and solving the underlying problem, and instead solving the proximate cause. However, you still need to solve the root problem.

Which leads right back to a problem vs the problem. Sometimes you need to solve a problem, the one right in front of you. Door A or door B? It’s highly unlikely you’ll be in the same situation again, so there’s no deeper, root cause to find and fix.

Consider the case where you’re at some kind of social event and you’ve walked up to buffet and they have lots of choices you like. Take a little of everything, or a lot of one thing. Make a choice. If you’re still hungry after, make a choice again. There’s no need to investigate why you’re at the buffet, how the selection of dishes was determined, or figure out a way to change those things.

On the other hand, if you’re at the event and you’re not hungry, the situation is different. Even if you have some kind of restricted diet, you’re not hungry, so there’s no need to find something to eat. You can just grab a drink and think about what you cold do if you were counting on the buffet having something you could eat. There are lots of ways to handle dietary restrictions at buffets, from carefully picking foods to talking to the provider to eating before or bringing your own food with you. There’s no pressure so take the time to figure out what you want to do next time you end up at a social event that might have a buffet.

On the gripping hand, you’re at the event, you’ve been on the road for 12 hours, so you’re more than a little hungry, and you’ve got those dietary restrictions. Now you need to solve both problems. You’re hungry, and clearly you hadn’t fully thought through the situation. What can you do to get something to eat now (the proximate problem) and how can you make sure you don’t find yourself dealing with the same problem again (the root cause)? To make matters worse, you’re hangry because you haven’t eaten in 12 hours. In that case, fall back to the incident response pattern. Stop the damage and grab a drink. Mitigate by finding something to eat. Identify and implement the long-term fix by keeping some iron rations in the car so don’t go 12 hours without eating again.

So next time you find yourself solving a problem, make sure you’re solving the right problem at the right time.