Recent Posts (page 26 / 71)

September 21, 2021 by Leon Rosenshein

Broken Windows

broken window theory

A system that breaks randomly but frequently…and requires very short bursts of work to resolve the issue… will sap the morale of a team far out of proportion to the “size” of any ticket

-- John Cutler

When you overload a database and it falls over no-one is surprised that things stop. When that happens the team(s) involved dig in, pick it back up, and get things going again. Next they figure out the series of unfortunate events that caused the outage and do something to make sure it doesn’t happen again, or at least warn them before it does. Then they close out the issue and move on.

But sometimes that doesn’t happen. Instead, what happens is that an alert fires, but by the time someone starts looking the alert clears. Or maybe there aren’t any alerts, but throughput drops. Or someone tries to access a different S3 bucket and gets an error. There’s no outage, but there’s always something not quite right.

In many ways those on-call shifts are worse than ones with an outage. Because you never get a chance to catch your breath. Or think about root causes. Or do something about it.

What’s a team to do in that situation? The simple answer is to not ever let it get like that. And maybe that’s possible. But probably not. Because when the rubber meets the road you find out that things aren’t the way you expected. It could be because there was a constraint you didn’t see, or a set of known constraints were reached at the same time. Or there’s just a new use case with very different characteristics. It doesn’t really matter which of those reasons it is or if it’s something different. The situation will arise, so you need to deal with it.

IME the best way to deal with it is to own it. Acknowledge both the pain you’re feeling and that it needs to be dealt with. Acknowledge it as individuals and as a team. Take the time to understand what’s really going on. Then take the time to fix it right. Sure, it’s going to impact near term deliverables. If you spend a few days analyzing, experimenting, and delivering a fix you have slipped your immediate delivery by that many days. Own it. Tell you customers. But also recognize that after you put in the effort you’re going to be in better shape.

Drag will be lower. You’ll move faster. With everything else staying the same you’ll find that pretty soon you’re ahead of where you would have been if you hadn’t taken the time to fix the issue.

And you know what happens if you don’t take care of the issue? You’ve created a precedent. Similar to the broken windows theory. If it looks like things aren't being taken care of they won't be. When the next issue comes up people will look around and see that others haven’t done anything about their issues. Which makes them less likely to do something about the new one. Now you’ve got an even stronger precedent. Which makes it that much more likely that the third issue will be ignored.

Windows setup screen

Then one day you pick your head up and realize there’s a culture of not worrying about the little things. That’s not a culture anyone wants. Friction builds up. Progress slows. Deadlines are missed. You see things like this on the big display at the airport that's supposed to tell you when your flight is departing.

So instead, build a culture where the little things are important. Where people and morale are important. Where problems are acknowledged, owned, and acted on. And you’ll find that you not only have happier, more fulfilled people and teams, but you’ll get more done as well.

September 20, 2021 by Leon Rosenshein

Drive

book review

You’ve probably heard of Maslow’s Hierarchy of Needs. Especially the first four levels. What Maslow called the D-Needs. The basic life and safety needs. If those aren’t met, nothing else matters. Things like air, food, and shelter. And that makes sense. Without those, nothing else really matters. And for a long time That was motivational theory. In life and at work. People are motivated to have their D-Needs met, and a steady paycheck ensured that you could meet those needs. But what happens when you’ve met those needs? How do you motivate people after that. Sure, Maslow had levels beyond the D-Needs, but they weren’t as well fleshed out and there wasn't a coordinated way to approach motivation at that level.

Then, in 2009 Daniel Pink published Drive: The Surprising Truth About What Motivates Us. Pink took as his starting point, not Taylorism and the assembly line/repetitive task worker, but the knowledge worker. According to Pink, once you meet those basic needs you need to do something else to motivate your employees. For those people, a different approach is needed. Instead of a hierarchy, it’s a combination of three distinct, yet related things, autonomy, mastery, and purpose.

Autonomy: The desire to be self-directed. To figure out what the thing is that you should be doing.

Mastery: The desire to learn and grow. Being able to develop new capabilities for yourself.

Purpose: The desire to do something important. Something that will have a broad impact.

But how do you have a team, let alone a business, when all of the team members are doing what motivates them individually? How do you keep a team from turning into a group of individuals with a common manager, if not devolving into anarchy? That’s where the interrelatedness comes in.

The interrelatedness that comes from vision. Because when the company’s vision aligns with the team’s vision/purpose, which aligns with the individual’s vision/purpose anarchy is not the issue. If everyone’s purpose aligns, then, in general, work will align.That’s a two-fer right there. Making sure you have alignment of purpose means your autonomy won’t turn into anarchy (assuming you have trust and clear communications).

One thing that alignment doesn’t do though, is ensure that all of the needed work is being done. That’s where mastery can help. There’s lots of work to be done, and there are probably people who think it would be interesting, but don’t know how to do it. Instead of silo-ing folks into narrow areas, you can encourage them to learn new things in different areas. That’s another two-fer. You get more work done in the areas that need it and you bring different viewpoints/perspectives, which give you better work as well.

The real magic is in how to use that kind of motivation. To do that you’re talking about culture. Which is a whole different topic for another day.

September 14, 2021 by Leon Rosenshein

DNS Linkage

Question: How dynamic is your linker? Or put another way, can DNS be considered a really late binding, dynamic linker? Or put a third way, is the only difference between a monolith and a gRPC microservice ecosystem when the binding occurs?

Of course DNS isn’t a linker. A static linker pre-calculates the memory address of a function call. Then it sets the value that will be set into the instruction pointer. Then the normal operation of the program jumps to the function, and when it’s done the result (if any) is in a well-known place and execution continues. A dynamic linker does essentially the same thing, only it does it much later in the process, just before it’s needed.

DNS on the other hand just translates a well-known name to a routable address. It doesn’t set any registers. The normal operation of your program doesn’t go to a different memory location based on what DNS comes up with. Instead, your program does the exact same thing, regardless of what DNS returns. It sends a message to the address. Then it stops and waits for the answer (if any) to show up in a well known place. From an architectural perspective that really isn’t any different. That’s probably why it’s called a REMOTE procedure call. It’s just like any other procedure call, but it happens somewhere else.

That’s interesting and all, but it means there’s something really interesting you can do with it. You can expose your API as an RPC and as a library call. Yes, it’s more work up front, but it enables lots of interesting things.

Consider a system that, in a mature, Uber scale, production, needs to handle thousands of concurrent operations. You’ll probably want to scale different parts differently, avoid single points of failure, and be able to update bits and pieces individually. This might naturally lead you to a microservice architecture. It meets all of those requirements. But before you get to that scale you’re handling a few concurrent operations. Or as a developer building/testing the thing in most cases you’re looking at 1 transaction at a time. So you don’t need the scale.

So maybe at the beginning you build libraries and link them together into one thing. It’s easier to build. It’s easier to test, It’s easier to deploy. Then as load and scale increase and you really need it, you turn your function calls into gRPC calls and split the monolith into a set of microservices.

What if you’ve already got a set of microservices and you need to do some testing. Sure, you could spin up a cluster and deploy a whole new set of services and make your calls. Or you could docker compose the entire system. Good luck step debugging though. You’ll need great logs, tracing, and a lot of patience with the debugger of your choice as you chase data around the system.

The other option would be to recompose them as a monolith. Debugging becomes easy. Following the execution is step-into instead of finding the right instance of the right service and hoping your conditional breakpoint is correct so you catch what you’re looking for. And it runs on your local machine. Less hoops. Less hops. Less latency. More productivity.

So no, DNS isn’t a linker. But sometimes it acts like one.

September 10, 2021 by Leon Rosenshein

That's a Drag

wip

Speaking of flow and WIP, the reason to reduce WIP is to increase flow. The reason for increasing flow is to reduce time to value. To add value sooner. So obviously, if you want to add value sooner, you need to go faster.

But what do you need to go faster at? Increasing to speed can help, but not as much as you think. What you really want to increase is your average speed. After all, how much of your time is spent at “top speed” (whatever that means)? If you want a car analogy, it’s the difference between the big oval at Indianapolis Motor Speedway and the road course at Laguna Seca. At Indy you spend a lot of time at top speed on the long straights. Going a little faster there can make a lot of difference. Laguna Seca, on the other hand, has a top speed, but it’s not limited by the car’s top speed. Instead it’s limited by the shape of the course and how hard you can brake, how much grip you have around the turns, and then how quickly you can accelerate towards the next turn.

Developing large, complex and complicated software projects is much more like that road course than the brickyard. If you want to bring your average speed up the best way to do it is to bring up the speed of the slowest parts, letting you spend less time there and more time at the higher speeds. That’s where drag comes in.

When I say drag, I mean the things that slow you down while you’re doing something else. Time to compile and time to run tests are part of it, and maybe the most visible parts, but by no means all of it. Time spent configuring your environment to your (and your team’s) liking is drag. Time spent trying to find the right piece of documentation about how a library or tool works is also drag. Learning a new process or tool is in that category. Planning can be a drag. Then there’s the big one. KTLO keeping the lights on. The more time you spend KTLO, the less time you’re spending at top speed. Heck, reading random blog posts on the corporate intranet could be considered drag.

You can’t eliminate all drag. It’s an inherent part of the system. You need to compile your code. You need to keep the lights on. You should always be learning (and reading blog posts :)). But you need to be aware of drag. And minimize it.

Simple things, like incremental builds and tests. Using remote builds to speed up your builds. Scripts to automate common tasks like setup and deploy.

More complex things like build/borrow/buy decisions or planning at the correct level for the time horizon. Taking a little bit longer up front to reduce KTLO later. Building test harnesses so that you can do integration testing without needing a full environment.

So next time you’re trying to figure out how to reduce the time it takes to add value, remember that increasing top speed is only one of the things you can do. In many cases you can add value sooner and easier by reducing drag.

September 9, 2021 by Leon Rosenshein

Flow and WIP

flow

Way back in 1975 Fred Brooks told us about the mythical man month and that adding people to a project that’s late will just make it later. There are lots of reasons for that. One of the biggest, according to Brooks, is that the number of communications channels between all of the people involved goes up roughly with the square of the number of people involved (actually n(n-1)/2), and all that communication at best slows things down, but more likely creates bottlenecks, which really limit things.

The other thing that can slow down a team is too much work in progress (WIP). With lots of WIP you have a few issues. First, you often end up doing work that isn’t quite right. If you build the dependent system (B) before the thing it depends (A) on you end up either constraining A to be something other than it should be, or you end up doing rework on B to match what A ended up being.

Second, If there is no B, but there’s some unrelated task that keeps you busy you end up with extra context switches. And every one of those takes time. Which ends up taking time away from actually doing those tasks. So you actually make less progress.

But the biggest issue is that while on average tasks might take a certain amount of time, in practice they’re all different. And that difference creates bottlenecks and work queues. Especially when combined with the notion that you should be busy, so you start something else while you’re waiting. Which means you’re not ready when the work is, which delays it even more.

You’ve probably heard that before, but it’s hard to visualize. How can being busy slow you down? It just feels wrong. I recently came across this video which shows a simulation of just how this works, or doesn’t work. How adding stages makes things slower. Even when you add more people at the bottleneck. And how limiting WIP and expanding an individual's capability makes the overall system faster.

And that’s what we’re really going for. Making the system faster, not the individual pieces.

September 5, 2021 by Leon Rosenshein

GetXXX()?

ownership naming

Language is important. The ubiquitous language of your domain. The metaphors you choose to use. The names you give things. Even the words that go into the name of a method are important.

Consider the word "get". It gets stuck on the beginning of a lot of functions to indicate that a value is being returned. But what is it really telling you?

According to the folks at Oxford, there are multiple uses of "get" as a verb, but there are only 2 that are related to transferring or acquiring, and one of those is about a person.

Come to have or hold (something); receive
Catch or apprehend (someone)

So get probably means "Come to have or hold, receive" when used in a function name. But there's a problem there. It's referring to something. If you get it, you have it. That's great, and probably what the caller wants. But if you have it, who or what ever used to have it doesn't. Which, as the author of the function, you probably don't want. In C++ when you want to transfer ownership of something you call std::move, not getXXX. So instead of transferring the thing, you return a copy.. Which means you're breaking your contract with the caller. If I call the getBalance function and give it my user id, I don't want the balance transferred from my account to the caller. That would be bad.

Skipping the whole ownership transfer thing, what is your getXXX function actually doing? If all it's doing is exposing some internal value, then maybe don't call it getXXX. Consider calling it XXX, and treat it as a property, not a function. Speaking of properties, if it's a boolean property, think about isXXX. Having isReady() be true or false makes it pretty easy to understand the intent.

If, on the other hand, it's a query or calculation, make that explicit. findXXX sounds like it might find nothing, while retrieveXXX implies an error if it's not there. getXXX doesn't tell you how it's going to respond. Given a file path, extractFilename tells you the file name you get back is part of the full path. Who knows what getFilename will do. It probably gets the filename part of the path, but it also might get relative path, fully qualified path, or maybe turns the Windows short name into the long file name.

So next time you need to name a function, think about what the name implies.

September 3, 2021 by Leon Rosenshein

++ / --

History Lesson

The ++ (and –) operator has been around for a long time. Like 50+ years. They’re functions with side effects. Prefix (increment the variable and return the new value) and postfix (return the value, then increment) versions. You see them in loops a lot. They’re syntactic sugar created so that programmers could write code with fewer keystrokes. But what’s that got to do with strings?

They’re syntactic sugar because back in the day, when the only data type you had was the WORD, and the only data structure you had was the VECTOR, you ended up with lots of code that looked something like

copy(from, to, count) {
  i = 0;
  while (i < count) {
    to[i] = from[i];
    i = i + 1;
  }
}

Add in pointers and the new operators, and it turns into

copy(from, to, count) {
  while (count--) {
    *to++ = *from++;
  }
}

5 lines instead of 7. No initialization. Less array access math. Smaller code. Faster code. And almost as readable.

But what has that got to do with strings? It’s related to strings, because, if you assume null terminated strings, and 0 == FALSE, that loop turns into

strcpy(from, to) {
  while (*to++ = *from++) ;
}

Simple. Efficient. 3 Lines. Minimal overhead. And the source of one of the biggest security nightmares, the buffer overflow. Add in multi-byte character sets and unicode and it gets even scarier.

But that’s not the only scary thing about ++/–. As I mentioned before, they’re functions, with side effects. All by themselves, without an assignment you’re just getting the side effect, incrementing/decrementing the variable. That’s pretty straightforward. When you start using the return value of the function, things get more interesting.

Consider this snippet

#include <stdio.h>
#include <stdlib.h>
int main() {
  int i;
  
  printf ("\ni = i + 1\n");
  i = 0;
  while (i < 100) {
      i = i + 1;
      if (i % 10 == 0) {
        printf ("%d is a multiple of 10\n", i);
      }
  }
  
  printf ("\n++i\n");
  i = 0;
  while (++i < 100) {
    if (i % 10 == 0) {
      printf ("%d is a multiple of 10\n", i);
    }
  }
  
  printf ("\ni++\n");
  i = 0;
  while (i++ < 100) {
    if (i % 10 == 0) {
      printf ("%d is a multiple of 10\n", i);
    }
  }
}

The three loops are almost identical, but the results are different for ++i and i++. Because all of a sudden your test, either ++i < 100 or i++ <; 100, has a side effect.

But at least now you know why you can shoot yourself in the foot like that. For more history on why things are the way they are, check out Dave Thomas

September 2, 2021 by Leon Rosenshein

Speed

devops agile testing

It's a truism that going too fast is dangerous. It's human nature to slow down in the face of danger or uncertainty. When you walk into a dark room for the first time you don't run. You walk in slowly, possibly feeling around for obstacles. There's even a Facebook group telling our local community to stfdlouisville.

On the other hand, when I talk to my fighter pilot friends I get a different answer. Speed is life. Even Maverick got it. In that situation they say speed, but they really mean energy. Trading potential and kinetic energy as the situation demands. Because when you're out of speed and out of altitude, you're out of options.

And at an even more basic level, that's what we're looking for. Options. And the ability to choose between them as information is uncovered and understanding changes. The ability to detect an obstacle while you still have time to react to it.

In software, there are a bunch of ways to approach this. Big Design Up Front (BDUF), or the traditional waterfall approach which seeks to know everything at the beginning so that there are no obstacles to overcome. You've either designed around them or done other work to remove them. This can work. Provided you know exactly what you are going to need to do. And the order to do it in. And when it will be done (which is often different from when it needs to be done). Given sufficient time to build the BDUF, you can do a pretty good job of determining what will be built and when it will be ready. Since you know what needs to be done you can keep everyone working. With more people doing more things in parallel and getting done sooner.

Another option is a much more iterative approach. Start with a team and a rough definition of the problem space. You try something. If it gets you closer to a solution you do more of it. If not, you try something else. You're constantly testing how things work. Of course, this requires you to have a definition of what makes a solution valid. And a way to see if you have a valid solution. And you'll need to take very small steps. Checking all the time to see if you're going in the right direction. The faster you can adjust the less work you'll do in a non-optimal direction. And even with small steps, there will be re-work and changes. As the understanding of the problem grows and the success criteria mature, earlier decisions will turn out to be, at best, less than perfect, or maybe even completely wrong. So you fix them and move on. Eventually you'll get to a valid solution. You won't know what it will look like or when it will be until you're almost done, but, as long as your definition of done is correct, it will converge.

On the surface BDUF seems like the right answer. Less rework. More parallelism. More predictability. But… Reality tells us that we never know everything up front. If we just do what's in the plan then at best we've only solved part of the problem, since we didn't understand it. More likely, we missed some interaction or condition along the way and the solution doesn't work. So we end up doing the rework anyway. At the end, when it's harder. And more error prone. And after the scheduled completion date. Which leads to overwork, burn-out, and the mythical man-month. And no-one wants that.

So we try the iterative approach. And that's where speed really matters. Not the speed of the solution. That's an outcome, a result, not an input parameter. The speed that matters, the speed that's the input to the system, is, as Charity Majors tells us, the speed of the decision cycle. How quickly can a developer/team go from recognizing a problem to getting real-world feedback on if the proposed solution makes things better or worse? Because that's the time that drives how fast you can react to new information.

After all, you wouldn't drive a car by looking out the window every 6 months and deciding how to adjust the steering wheel, throttle, and brake for the next 6 months, so why would you build software for one that way?

August 17, 2021 by Leon Rosenshein

Exploitation

career

Internet Explorer 3.0 had its 25th birthday the other day. In honor of that one of the original devs posted about his experience on Twitter. He tells the story of 2 AM foosball games and having all meals at the office. I joined Microsoft a few years later, and while not every team was working late and having dinner delivered, the number of days that there wasn’t a team nearby that had dinner brought in, with enough leftovers for folks on other teams without dinner to get some, was vanishingly small.

Prior to that I was in the game industry, so I was very familiar with the idea of crunch time and working 20 hour days to get a gold master released on time. Remember, this was before the advent of online updates, let alone downloadable content. Everything had to be in the box, and there could be weeks between sending the master off to the duplicator and getting things on shelves.

Which gets back to that twitter thread, which turned into a bit of a twitter storm. The author remembers it fondly. Talks about how much he learned and how it started his career. Friendships forged in fire that continue to this day. Pride in the effort and the result.

Others, who weren’t there, saw it differently. They see exploitation. They see people giving up their lives and their families for at best, money, or more commonly, a company that didn’t care. That the people who thought they were having a good time were fooling themselves.

I worked with some of those people on other projects. Some of the smartest folks I’ve worked with. And, at least when I worked with and for them, balanced. They knew when to stretch, when to work hard, and when to relax. Did they learn some of that working on IE 3? Almost certainly. Did they work hard then? Absolutely. Too hard? That depends. They didn’t think so. The families I knew didn’t think so. I think they made a conscious decision to trade the time and effort for the experience, learnings, and yes, money, that they got for being there.

Which is absolutely not to say that it is the right way to build software for the long term. In a year the team grew by 10x and they all needed a break after. That kind of effort isn’t sustainable over the long term. And no-one should offer or expect it. But just because someone chooses to work that hard and others benefit, doesn’t make it exploitation. It doesn't invalidate their choice.

Which leaves me very conflicted. At that time I worked that way. On multiple occasions. At the time I didn’t feel exploited. At the time I felt empowered to do the right thing and internally motivated to produce the best product I could. I remember the camaraderie we felt. The fun we had. The things I learned. The closeness and sense or purpose the entire team had. Looking back I don’t feel exploited.

On the other hand, I don’t work that way now. If I see someone doing it now I ask them why. If I’m involved I ask them to stop. I ask them to think long and hard about the choices they’re making and make sure they understand the sacrifices involved. I’ll tell them about my experiences. The things I missed out on. Because it was a sacrifice. I gave up a lot. Looking back, I wonder if I made the right decision. Would I be happier or more fulfilled now? There’s no way to know. Now I don’t make the same decisions. I have a different viewpoint and I don’t think it’s the right choice for me, or for others.

But that doesn’t mean they don’t get to make the choice. These things still happen. Back at Uber, when we were bringing up a datacenter one of the datacenter folks worked for 12+ hours with abdominal pain. He went to the ER, passed a kidney stone, and then came back and finished what he was doing. I, and multiple others, told him not to do something like that again. We had told him at the beginning of the day, and multiple times over the course of the day, to go home. We told him not to come back. But it was his decision.

It was part of the culture. Don’t let the other person down, regardless of the cost. Is it exploitation? I don’t know. Certainly demanding that kind of response and punishing those who don’t leads leads to that behavior and exploitation. If the bosses expect that behavior and treat it as the norm then it becomes exploitation.

But what about when it comes from the bottom up. When everyone is that passionate about what they’re doing? Is that exploitation? And how do you know the difference between a driven team and a team that doesn’t feel safe enough to speak up when they’re put in that pressure cooker? How do you keep passion and drive from becoming a positive feedback loop that turns into burnout? How do you keep from romanticising this kind of pressure?

What do you think? I’d love to hear about other folk’s experiences.

August 16, 2021 by Leon Rosenshein

Architecture Isn't Everything

Architecture is important. It’s about the user/customer. About adding value for that person. Engineering, on the other hand, is often the art of the possible. How can we build this in a feasible way? What do we need to do now, not directly for the user today, but so that we can do better for the user in the future. Which is another way to say that it’s a question of balance.

Who the best architect of all times is hard to say, but few would argue that Frank Lloyd Wright and Le Corbusier are among the best. But both of them have designed buildings with significant engineering problems. Situations where the value at the beginning continues to cause problems today.

Consider Falling Water, not far outside of Pittsburgh. It’s beautiful. It fits the setting and meets all of the sage requirements the owners asked for. And it’s not strong enough. Or at least it wasn’t when it was built. Despite the builder adding extra reinforcement on his own because his calculations said more was needed. Or another of his works, Taliesin. Eminently livable, but construction issues and building material choices have created a situation where maintaining the site is far more difficult and expensive than it needs to be.

Another example is Villa Savoye. One of the initial embodiments of Le Corbusier’s 5 Points system, but it leaks. And it’s noisy. And built poorly. And late. And over budget. Because the engineering didn’t match the architecture.

It’s the same with software. For example, Uber’s microservice architecture.

Uber's tangled web of service connections

At one point there were more microservices than there were engineers. That’s not always a problem, but it’s certainly something to think about. So much so that for the last few years Uber has been working to tame that plate spaghetti service links.

So while it’s critical to have a good architecture and focus on customer value, it’s also critical to not let the pursuit of value in the moment paint you into a corner where it takes a complete reset to move forward. Instead, maybe something like Evolutionary Architecture.

Older Newer