Recent Posts (page 25 / 70)

by Leon Rosenshein

Engineering Excellence

I maintain that we are engineers, and I’m not the only one. And of course, we all want to be excellent to each other. But what is engineering excellence, really?

It’s not doing code reviews or design docs or outage post mortems, but those are all part of it. It’s not SOLID or DRY or KISS, but those are part of it too. And it’s not staging environments, release processes, or alerting. Or at least it’s not just any of those things.

Engineering excellence, and especially a culture of engineering excellence isn’t about the things you do. It’s about why you are doing those things. It’s the difference between trying to ensure bugs aren’t released with the product at the end and trying to ensure bugs aren’t added to the product during development. A sufficiently rigorous late stage QA process can make sure bugs will never be released, but it can’t ensure you actually release something. Conversely, if you make sure bugs aren’t added (or live long if they are) you can release whenever you want.

It’s engineering excellence that takes you from executing all of the processes in the world because you’ve been told that process makes you better to “We care enough to do the best” so we want to do all those things which let us know that we’re doing excellent work. Engineering excellence is about craftsmanship and effectiveness. It’s about fit and finish. It’s about doing things in a way that makes things easier in the future, not what’s expedient not. It’s about doing things in a way that gives you the more average velocity, not more instantaneous velocity.

What you end up with is pretty impressive as well. You end up being both happier and more productive. You do things you didn’t know you could, and before you expected to be able to do them, so more value is released to your customers. You end up happier when you do it as well. You find that the little things that annoyed you are gone. There’s less friction when you try to do something. The time you used to sit around waiting for something to finish is gone.

It happens at multiple levels, not just the individual. The same thing happens at the organizational level. Build roadmaps not just for the quarterly review, but with customers so that the roadmap is built around customer needs. Collaborating with partners to reach a goal quicker. Having processes and norms to make things easier instead of using them as a constraint. A smoothing of the rough edges between and around teams and goals so they work together better.

And of course, not a simple copy of some other company’s ideas, but done in a way that fits our context.

by Leon Rosenshein

Context Matters

I’ve talked about contexts before. The contexts of what you and your audience know. The ubiquitous language of bounded contexts. And the shared context of experience. There’s another kind of context that I haven’t talked about. The kind of context that comes with size and environment.

Because size and environment change things. In a company of 10 people it’s possible for everyone to know everyone else. And for everyone to know as much (or as little) as they want about everything else. A company of 100 people can probably do it, if they work hard at it. In a company of 1000 people it won’t work.

When TK started Uber, in one city with a few people, the size of the company was right and the environment he did it in was ready for what he was doing. Trying the same thing today, in this context won’t get you far.

Consider three very large tech companies, Microsoft, founded in 1975, Amazon founded in 1994, and Google founded in 1998. They all have a search product. They all have a large part of their company dedicated to building and selling cloud services. They are all considered successful. But while their product portfolio has a lot of overlap, if you look under the covers no one would mistake one of them for either of the others. And if you tried to transplant a specific process or peice of the tech stack from one to another it wouldn’t work.

There are books by company insiders about how Microsoft and Google do software development. In both cases they’re distillations of hard-won lessons at the respective companies. And they’re different. Because they evolved over time in a specific context. A context that includes things like company processes, number of employees, company wide and org specific tools, and very different leaders at the top.

Which is a long way of saying that while there’s a lot to learn from the successes (and failures) of other companies and the tools and processes they used, we, or any other company for that matter, can’t just blindly apply those lessons and expect them to work as well.

Instead, we should view the experiences of others through the lens of our context and apply those learnings. Building the tools and processes, the culture, that works in our context.

by Leon Rosenshein

Measurement

"What gets measured gets managed - even when it’s pointless to measure and manage it, and even if it harms the purpose of the organization to do so."

Peter Drucker

"It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”

W. Edwards Deming

“Managers who don’t know how to measure what they want settle for wanting what they can measure.”

Russell Ackoff

"Tell me how you measure me, and I will tell you how I will behave."

Eliyahu Goldratt

You’ve probably heard shortened versions of the first two. “What gets measured gets managed” and “If you can’t measure it, you can’t manage it”. Now compare the common, well known, shortened version to the full quote. In context the meaning is very different. But that never stopped a good quote from entering the zeitgeist. Or putting the new cover sheet on all your TPS reports. And it’s not just conventional wisdom. I’ve been in multiple training sessions over the years where the shortened, incorrect versions were used. That’s a real problem because all it does is cement the problem.

Beyond the misunderstanding, the real problem with the short version is that they lead directly to the second two quotes. That’s the real problem. Measuring how much value a team has added, especially when a company is pre-revenue is hard. You can’t look at a bump in sales or monthly average users. There are no downloads to count or even user comments to look at. What you can do though is measure activity. Like tickets opened/closed, PRs landed, or documents written. Things like that are simple and easy to measure. And get you a new minivan.

As is often the case, the antidote to incomplete understanding is to go back and look at what was actually said. Don’t just measure something because you can. First, measure something because the information conveyed by the measurement actually helps you manage things in the direction you want to go. If it doesn’t, stop measuring it.

Second, manage the things you need to manage. Some things, like the average response time of an RPC is measurable. It makes sense to track it. And do something when it’s demonstrably below a threshold. But when you don’t have a direct, explicit measurement for it there’s probably a qualitative measurement. Unmeasurable doesn’t mean unknowable. You can’t measure the happiness or energy level of a team, but you can tell if it’s higher or lower than it has been. That’s important information. Use it.

So instead of measuring what you can and then managing that data, manage so that value is added. So that, even when no one is watching, the decisions that get made all day long, the little tiny ones, are made in a way that value gets added. Which means we’re back to talking about culture. Again.

by Leon Rosenshein

Broken Windows

A system that breaks randomly but frequently…and requires very short bursts of work to resolve the issue… will sap the morale of a team far out of proportion to the “size” of any ticket

-- John Cutler

When you overload a database and it falls over no-one is surprised that things stop. When that happens the team(s) involved dig in, pick it back up, and get things going again. Next they figure out the series of unfortunate events that caused the outage and do something to make sure it doesn’t happen again, or at least warn them before it does. Then they close out the issue and move on.

But sometimes that doesn’t happen. Instead, what happens is that an alert fires, but by the time someone starts looking the alert clears. Or maybe there aren’t any alerts, but throughput drops. Or someone tries to access a different S3 bucket and gets an error. There’s no outage, but there’s always something not quite right.

In many ways those on-call shifts are worse than ones with an outage. Because you never get a chance to catch your breath. Or think about root causes. Or do something about it.

What’s a team to do in that situation? The simple answer is to not ever let it get like that. And maybe that’s possible. But probably not. Because when the rubber meets the road you find out that things aren’t the way you expected. It could be because there was a constraint you didn’t see, or a set of known constraints were reached at the same time. Or there’s just a new use case with very different characteristics. It doesn’t really matter which of those reasons it is or if it’s something different. The situation will arise, so you need to deal with it.

IME the best way to deal with it is to own it. Acknowledge both the pain you’re feeling and that it needs to be dealt with. Acknowledge it as individuals and as a team. Take the time to understand what’s really going on. Then take the time to fix it right. Sure, it’s going to impact near term deliverables. If you spend a few days analyzing, experimenting, and delivering a fix you have slipped your immediate delivery by that many days. Own it. Tell you customers. But also recognize that after you put in the effort you’re going to be in better shape.

Drag will be lower. You’ll move faster. With everything else staying the same you’ll find that pretty soon you’re ahead of where you would have been if you hadn’t taken the time to fix the issue.

And you know what happens if you don’t take care of the issue? You’ve created a precedent. Similar to the broken windows theory. If it looks like things aren't being taken care of they won't be. When the next issue comes up people will look around and see that others haven’t done anything about their issues. Which makes them less likely to do something about the new one. Now you’ve got an even stronger precedent. Which makes it that much more likely that the third issue will be ignored.

Windows setup screen

Then one day you pick your head up and realize there’s a culture of not worrying about the little things. That’s not a culture anyone wants. Friction builds up. Progress slows. Deadlines are missed. You see things like this on the big display at the airport that's supposed to tell you when your flight is departing.

So instead, build a culture where the little things are important. Where people and morale are important. Where problems are acknowledged, owned, and acted on. And you’ll find that you not only have happier, more fulfilled people and teams, but you’ll get more done as well.



by Leon Rosenshein

Drive

You’ve probably heard of Maslow’s Hierarchy of Needs. Especially the first four levels. What Maslow called the D-Needs. The basic life and safety needs. If those aren’t met, nothing else matters. Things like air, food, and shelter. And that makes sense. Without those, nothing else really matters. And for a long time That was motivational theory. In life and at work. People are motivated to have their D-Needs met, and a steady paycheck ensured that you could meet those needs. But what happens when you’ve met those needs? How do you motivate people after that. Sure, Maslow had  levels beyond the D-Needs, but they weren’t as well fleshed out and there wasn't a coordinated way to approach motivation at that level.

Then, in 2009 Daniel Pink published Drive: The Surprising Truth About What Motivates Us. Pink took as his starting point, not Taylorism and the assembly line/repetitive task worker, but the knowledge worker. According to Pink, once you meet those basic needs you need to do something else to motivate your employees. For those people, a different approach is needed. Instead of a hierarchy, it’s a combination of three distinct, yet related things, autonomy, mastery, and purpose.

Autonomy: The desire to be self-directed. To figure out what the thing is that you should be doing.

Mastery: The desire to learn and grow. Being able to develop new capabilities for yourself.

Purpose: The desire to do something important. Something that will have a broad impact.

But how do you have a team, let alone a business, when all of the team members are doing what motivates them individually? How do you keep a team from turning into a group of individuals with a common manager, if not devolving into anarchy? That’s where the interrelatedness comes in.

The interrelatedness that comes from vision. Because when the company’s vision aligns with the team’s vision/purpose, which aligns with the individual’s vision/purpose anarchy is not the issue. If everyone’s purpose aligns, then, in general, work will align.That’s a two-fer right there. Making sure you have alignment of purpose means your autonomy won’t turn into anarchy (assuming you have trust and clear communications).

One thing that alignment doesn’t do though, is ensure that all of the needed work is being done. That’s where mastery can help. There’s lots of work to be done, and there are probably people who think it would be interesting, but don’t know how to do it. Instead of silo-ing folks into narrow areas, you can encourage them to learn new things in different areas. That’s another two-fer. You get more work done in the areas that need it and you bring different viewpoints/perspectives, which give you better work as well.

The real magic is in how to use that kind of motivation. To do that you’re talking about culture. Which is a whole different topic for another day.

by Leon Rosenshein

DNS Linkage

Question: How dynamic is your linker? Or put another way, can DNS be considered a really late binding, dynamic linker? Or put a third way, is the only difference between a monolith and a gRPC microservice ecosystem when the binding occurs?

Of course DNS isn’t a linker. A static linker pre-calculates the memory address of a function call. Then it sets the value that will be set into the instruction pointer. Then the normal operation of the program jumps to the function, and when it’s done the result (if any) is in a well-known place and execution continues. A dynamic linker does essentially the same thing, only it does it much later in the process, just before it’s needed.

DNS on the other hand just translates a well-known name to a routable address. It doesn’t set any registers. The normal operation of your program doesn’t go to a different memory location based on what DNS comes up with. Instead, your program does the exact same thing, regardless of what DNS returns. It sends a message to the address. Then it stops and waits for the answer (if any) to show up in a well known place. From an architectural perspective that really isn’t any different. That’s probably why it’s called a REMOTE procedure call. It’s just like any other procedure call, but it happens somewhere else.

That’s interesting and all, but it means there’s something really interesting you can do with it. You can expose your API as an RPC and as a library call. Yes, it’s more work up front, but it enables lots of interesting things. 

Consider a system that, in a mature, Uber scale, production, needs to handle thousands of concurrent operations. You’ll probably want to scale different parts differently, avoid single points of failure, and be able to update bits and pieces individually. This might naturally lead you to a microservice architecture. It meets all of those requirements. But before you get to that scale you’re handling a few concurrent operations. Or as a developer building/testing the thing in most cases you’re looking at 1 transaction at a time. So you don’t need the scale.

So maybe at the beginning you build libraries and link them together into one thing. It’s easier to build. It’s easier to test, It’s easier to deploy. Then as load and scale increase and you really need it, you turn your function calls into gRPC calls and split the monolith into a set of microservices.

What if you’ve already got a set of microservices and you need to do some testing. Sure, you could spin up a cluster and deploy a whole new set of services and make your calls. Or you could docker compose the entire system. Good luck step debugging though. You’ll need great logs, tracing, and a lot of patience with the debugger of your choice as you chase data around the system.

The other option would be to recompose them as a monolith. Debugging becomes easy. Following the execution is step-into instead of finding the right instance of the right service and hoping your conditional breakpoint is correct so you catch what you’re looking for. And it runs on your local machine. Less hoops. Less hops. Less latency. More productivity.

So no, DNS isn’t a linker. But sometimes it acts like one. 

by Leon Rosenshein

That's a Drag

wip

Speaking of flow and WIP, the reason to reduce WIP is to increase flow. The reason for increasing flow is to reduce time to value. To add value sooner. So obviously, if you want to add value sooner, you need to go faster.

But what do you need to go faster at? Increasing to speed can help, but not as much as you think. What you really want to increase is your average speed. After all, how much of your time is spent at “top speed” (whatever that means)? If you want a car analogy, it’s the difference between the big oval at Indianapolis Motor Speedway and the road course at Laguna Seca. At Indy you spend a lot of time at top speed on the long straights. Going a little faster there can make a lot of difference. Laguna Seca, on the other hand, has a top speed, but it’s not limited by the car’s top speed. Instead it’s limited by the shape of the course and how hard you can brake, how much grip you have around the turns, and then how quickly you can accelerate towards the next turn.

Developing large, complex and complicated software projects is much more like that road course than the brickyard. If you want to bring your average speed up the best way to do it is to bring up the speed of the slowest parts, letting you spend less time there and more time at the higher speeds. That’s where drag comes in.

When I say drag, I mean the things that slow you down while you’re doing something else. Time to compile and time to run tests are part of it, and maybe the most visible parts, but by no means all of it. Time spent configuring your environment to your (and your team’s) liking is drag. Time spent trying to find the right piece of documentation about how a library or tool works is also drag. Learning a new process or tool is in that category. Planning can be a drag. Then there’s the big one. KTLO keeping the lights on. The more time you spend KTLO, the less time you’re spending at top speed. Heck, reading random blog posts on the corporate intranet could be considered drag.

You can’t eliminate all drag. It’s an inherent part of the system. You need to compile your code. You need to keep the lights on. You should always be learning (and reading blog posts :)). But you need to be aware of drag. And minimize it.

Simple things, like incremental builds and tests. Using remote builds to speed up your builds. Scripts to automate common tasks like setup and deploy.

More complex things like build/borrow/buy decisions or planning at the correct level for the time horizon. Taking a little bit longer up front to reduce KTLO later. Building test harnesses so that you can do integration testing without needing a full environment.

So next time you’re trying to figure out how to reduce the time it takes to add value, remember that increasing top speed is only one of the things you can do. In many cases you can add value sooner and easier by reducing drag.

by Leon Rosenshein

Flow and WIP

Way back in 1975 Fred Brooks told us about the mythical man month and that adding people to a project that’s late will just make it later. There are lots of reasons for that. One of the biggest, according to Brooks, is that the number of communications channels between all of the people involved goes up roughly with the square of the number of people involved (actually n(n-1)/2), and all that communication at best slows things down, but more likely creates bottlenecks, which really limit things.

The other thing that can slow down a team is too much work in progress (WIP). With lots of WIP you have a few issues. First, you often end up doing work that isn’t quite right. If you build the dependent system (B) before the thing it depends (A) on you end up either constraining A to be something other than it should be, or you end up doing rework on B to match what A ended up being.

Second, If there is no B, but there’s some unrelated task that keeps you busy you end up with extra context switches. And every one of those takes time. Which ends up taking time away from actually doing those tasks. So you actually make less progress.

But the biggest issue is that while on average tasks might take a certain amount of time, in practice they’re all different. And that difference creates bottlenecks and work queues. Especially when combined with the notion that you should be busy, so you start something else while you’re waiting. Which means you’re not ready when the work is, which delays it even more.

You’ve probably heard that before, but it’s hard to visualize. How can being busy slow you down? It just feels wrong. I recently came across this video which shows a simulation of just how this works, or doesn’t work. How adding stages makes things slower. Even when you add more people at the bottleneck. And how limiting WIP and expanding an individual's capability makes the overall system faster.

And that’s what we’re really going for. Making the system faster, not the individual pieces.

by Leon Rosenshein

GetXXX()?

Language is important. The ubiquitous language of your domain. The metaphors you choose to use. The names you give things. Even the words that go into the name of a method are important.

Consider the word "get". It gets stuck on the beginning of a lot of functions to indicate that a value is being returned. But what is it really telling you? 

According to the folks at Oxford, there are multiple uses of "get" as a verb, but there are only 2 that are related to transferring or acquiring, and one of those is about a person.

  1. Come to have or hold (something); receive
  2. Catch or apprehend (someone)

So get probably means "Come to have or hold, receive" when used in a function name. But there's a problem there. It's referring to something. If you get it, you have it. That's great, and probably what the caller wants. But if you have it, who or what ever used to have it doesn't. Which, as the author of the function, you probably don't want. In C++ when you want to transfer ownership of something you call std::move, not getXXX. So instead of transferring the thing, you return a copy.. Which means you're breaking your contract with the caller. If I call the getBalance function and give it my user id, I don't want the balance transferred from my account to the caller. That would be bad.

Skipping the whole ownership transfer thing, what is your getXXX function actually doing? If all it's doing is exposing some internal value, then maybe don't call it getXXX. Consider calling it XXX, and treat it as a property, not a function. Speaking of properties, if it's a boolean property, think about isXXX. Having isReady() be true or false makes it pretty easy to understand the intent.

If, on the other hand, it's a query or calculation, make that explicit. findXXX sounds like it might find nothing, while retrieveXXX implies an error if it's not there. getXXX doesn't tell you how it's going to respond. Given a file path, extractFilename tells you the file name you get back is part of the full path. Who knows what getFilename will do. It probably gets the filename part of the path, but it also might get relative path, fully qualified path, or maybe turns the Windows short name into the long file name.

So next time you need to name a function, think about what the name implies.

by Leon Rosenshein

++ / --

History Lesson

The ++ (and –) operator has been around for a long time. Like 50+ years. They’re functions with side effects. Prefix (increment the variable and return the new value) and postfix (return the value, then increment) versions. You see them in loops a lot. They’re syntactic sugar created so that programmers could write code with fewer keystrokes. But what’s that got to do with strings?

They’re syntactic sugar because back in the day, when the only data type you had was the WORD, and the only data structure you had was the VECTOR, you ended up with lots of code that looked something like

copy(from, to, count) {
  i = 0;
  while (i < count) {
    to[i] = from[i];
    i = i + 1;
  }
}

Add in pointers and the new operators, and it turns into

copy(from, to, count) {
  while (count--) {
    *to++ = *from++;
  }
}

5 lines instead of 7. No initialization. Less array access math. Smaller code. Faster code. And almost as readable.

But what has that got to do with strings? It’s related to strings, because, if you assume null terminated strings, and 0 == FALSE, that loop turns into

strcpy(from, to) {
  while (*to++ = *from++) ;
}

Simple. Efficient. 3 Lines. Minimal overhead. And the source of one of the biggest security nightmares, the buffer overflow. Add in multi-byte character sets and unicode and it gets even scarier.

But that’s not the only scary thing about ++/–. As I mentioned before, they’re functions, with side effects. All by themselves, without an assignment you’re just getting the side effect, incrementing/decrementing the variable. That’s pretty straightforward. When you start using the return value of the function, things get more interesting.

Consider this snippet

#include <stdio.h>
#include <stdlib.h>
int main() {
  int i;
  
  printf ("\ni = i + 1\n");
  i = 0;
  while (i < 100) {
      i = i + 1;
      if (i % 10 == 0) {
        printf ("%d is a multiple of 10\n", i);
      }
  }
  
  printf ("\n++i\n");
  i = 0;
  while (++i < 100) {
    if (i % 10 == 0) {
      printf ("%d is a multiple of 10\n", i);
    }
  }
  
  printf ("\ni++\n");
  i = 0;
  while (i++ < 100) {
    if (i % 10 == 0) {
      printf ("%d is a multiple of 10\n", i);
    }
  }
}

The three loops are almost identical, but the results are different for ++i and i++. Because all of a sudden your test, either ++i < 100 or i++ <; 100, has a side effect.

But at least now you know why you can shoot yourself in the foot like that. For more history on why things are the way they are, check out Dave Thomas