Recent Posts (page 21 / 67)

by Leon Rosenshein

It Happens

"That hardly ever happens is another way of saying 'it happens'."

  -- Douglas Crockford

When someone says “That hardly ever happens.'' there are two ways to approach the situation. The first is to take it at face value and put just enough effort into the rare case to ensure the system continues to operate, even if that little part fails. Like adding timeouts and retries to network requests. Something might be delayed or a user might have to click a button again, but things still happen. And the happy path stays happy.

The second is to focus on those “rare” events. Put just enough effort into tracking the happy case so you know things worked and spend the rest of your time dealing with the outliers. Like silent failures when writing data to disk. Consider BatchAPI, which, like it’s predecessors, handles millions of tasks per week, sometimes per day. Even with 6 9’s of reliability, at those scales you’re going to have multiple tasks failing every day. And those are the ones that people care about. The ones people want to dig into and understand what went wrong.

In cases like that, in platform code, “hardly ever happens” is the place where you need to focus. As much as scale is important, error handling is more important. Both internal and external. Internal, platform, errors get handled inside the system, and ideally never impact the user (other than potentially a delay in seeing the result). That needs to be rock solid. Redundant. Failsafe. Because, like those disk write errors, something will happen. It’s up to the platform to make sure there’s no impact to users. And at the very least, the platform should clearly take responsibility for the error when it fails.

On the other hand, when a user’s bit execution fails, then what? It’s a successful failure. All of the tooling and framework code did what it should, and correctly, but the user’s code failed. Now you need to help the user. What failed? Why did it fail? How can it be reproduced? Was it a transient problem and a retry will just work?

As a platform you need to think about these things and how to respond. Because when you get down to it, the platform is really just a giant error handling system. And regardless of how rarely it happens, it will happen.

by Leon Rosenshein

Slacking Off

Firefighters, especially professional ones, have buckets of slack time. According to one study firefighters should have less than 25% utilization (time responding to incidents) to avoid burnout. 75% slack time. Built into the fabric of the system. If they don’t have that much slack time they scale up. Cities build more firehouses, buy more engines, and hire more firefighters to get that slack time back.

That’s probably a reasonable goal for your on-call person. First of all, you want your on-call to be fresh if something happens. Second, slack time doesn’t mean idle time. It means there’s not something specific scheduled to be done. There’s always plenty of maintenance work to be done, so the on-call isn’t idle. Things like updating runbooks, alerts, and documentation, automating common on-call tasks, and digging into perennial trouble spots.

But what about the rest of the team? No slack time there. That’s the way to get the most done, right? Wrong. I don’t know about you, but I know the planning I’ve been involved with does a pretty good job of spec’ing out the known knowns, and identifying the known unknowns, but the unknown unknowns and the things we know that just ain’t so always come up.

Even with perfect planning and no unknowns you need some slack. Vacations. Injuries. Illnesses. Outages (other people’s). All of those things, and more, say that if you schedule 100% of the time you’re not going to complete your plan. That seems to be true even if you follow Hofstadter's Law.

One option is to not commit to anything. Just work on the most important thing at any given moment. Things take as long as they take, but you’re never late. That works really well at a small scale, when there’s only one person/group deciding what the most important thing is. And that thing doesn’t change often while you’re working on something. But when the most important thing changes a lot, and the cost of your context switch is high, that leads to lots of churn and peanut buttering your progress. And that ignores anyone potentially waiting for you to be done.

If there are others depending on your work being done by a certain time and you miss then they’re going to miss as well. And their dependencies will miss something. Small delays compound and you end up with long delays. So we do need to make commitments to dates and hit them. Especially when working in a deeply interconnected system (which we do).

Which brings us back to having enough slack time. Or conversely, only committing a certain (less than 100%) amount of your time to work against deadlines. It won’t guarantee you meet those commitments, but if you don’t have enough slack, I can guarantee you won’t meet them.

by Leon Rosenshein

Experience -> Expertise -> Wisdom

ex·pe·ri·ence

practical contact with and observation of facts or events.
"he had already learned his lesson by painful experience"

ex·per·tise

expert skill or knowledge in a particular field.
"technical expertise"

wis·dom

the quality of having experience, knowledge, and good judgment; the quality of being wise.
"listen to his words of wisdom"


That’s the typical progression, right? You have experience(s). You gain expertise. You turn that into wisdom. That’s certainly the progression we all want, but is it typical? I’ve talked about the difference between data and wisdom before, but there’s a similar progression that isn’t about raw numbers and understanding.

The 10,000 hour rule, popularized in Outliers, says that to excel at something you need to spend 10,000 hours doing it. We can argue about the number of hours it takes, but what’s really key to that is not the number of hours of experience, but the amount of experience in those hours.

In the US people work about 50 weeks/year, and the workday is nominally 40 hours long, so each year is ~2000 work hours. By that logic, it would take 5 years to be an expert in whatever it is you do.

Now consider a worker on an assembly line. Putting wheels on a car. After 5 years that person has 10K hours and is an expert at putting wheels on a car. And probably putting nuts on bolts in general. But not putting the muffler on, let alone building a car. Or designing a car. Or driving a car. Because that 10K hours is really the same hour 10K times. It’s critical to getting the car correctly built and out the door, but from an experience standpoint, there’s really not much there.

To be an expert on building cars would require not just 10K hours on one task, but experience with all of the tasks required. Welding the frame. Building sub-assemblies. Installing them. Electrical work etc. To be an expert in automobile building you need not just hours of experience, but lots of different experiences. So after those 10K hours the worker has seen all of the common things that can go wrong, and many of the uncommon ones. They’ll have worked out ways of dealing with them and be able to work through them. That terson would be considered and expert in the field of automobile building.

Wisdom though, includes good judgment. Not just experience or expertise. It requires the ability to learn from your experiences, then realize what you’ve learned might not apply in some cases, and then learn something else. Something more general. Wisdom goes beyond the what and how into the why. You have to understand why something is being done, and be able to generalize how a seemingly unrelated action will have an impact on the end result. Learning how to learn, unlearn, relearn, generalize, and extrapolate is a whole different set of muscles than just being an expert. You could say you need 10,000 hours of being an expert and working with new and different experiences in differing contexts to turn expertise into wisdom.

That applies to knowledge work just as much as it does putting wheels on a car or building cars from parts. First you need to learn the tools, Then you need to spend time doing the work. Not just the same 10 or 100 hours over and over again, but new and different hours. Exposing you to new and different situations and constraints. 10,000 different hours. Which can take longer than 5 years. And that just gets you to the expert level. There’s still plenty of room for growth. And wisdom doesn’t magically appear after 10K hours of being an expert. It just starts small and narrowly scoped. As you gain more experience with your expertise the scope broadens. And that never stops. Your scope isn’t limited to your day job or even the art of software engineering in general.

So as you journey along the path of experience -> expertise -> wisdom keep looking for opportunities for new experiences. To expand your expertise and wisdom. To expand your scope.

by Leon Rosenshein

Is That An Error?

“. . . the errors are errors now, but they weren’t errors then.”

    -- Marianne A. Paget, The Unity of Mistakes: A Phenomenological Interpretation of Medical Work.

I’ve talked about tech debt and agility before. It’s the choices you make to get value sooner by pushing work out to the future. And it’s absolutely a good idea. When done in moderation and at the right time. Just like Taking on any other debt.

And I’ve also talked about decision making. How the process of making decisions and their quality is related to, but not the same as, the outcome of the decision. We should be making the best decisions we can with the information we have when we need to make the decision.

Communication is important. That means definitions are important. It’s important to label things correctly so that everyone has the same, shared understanding of the situation. Which brings me to my point.

Not everything that isn’t the way it should be now is tech debt. We learn new things all the time. When what we know changes we need to respond to it. That often adds work, but it’s not tech debt.

Consider this scenario. I’ve been at least tangentially involved with computer graphics for a long time now. Simulation, games, 3D mapping. When I started the limiting factor was geometry transforms. Doing all that floating point match to figure out which pixels the corners of a triangle mapped to on the screen took a long time. Especially compared to the simple integer math involved with filling the triangle once you had the corners. So we built our models with as few polygons as possible. Did aggressive culling by knowing that from one octant it was physically impossible to see some of the model, so don’t even try to draw that triangle. And z-buffering was expensive too, so we used the painter’s algorithm and spent a lot of time figuring out how to time drawing the model in the correct order, from relative back to front and just overwriting things.

Then the world changed and we got transform engines in silicon. Chips designed to do that math fast. And suddenly we were fill rate limited. So all of the old models and the rendering engines were suboptimal. We needed to rebuild them. Some people called that tech debt. But it’s not. It’s new work. It’s an opportunity to add value based on new information and capabilities. In fact, not doing the work to take advantage of the new capabilities and releasing the next version without supporting them would be adding tech debt.

And building models and rendering engines that took the transform limit into account was the right thing to do. When the models and engines were built that was the way to get the most performance from the system. Doing anything else would have been a bad decision, because we had information showing us that more polygons and higher framerate added value to the customer. Doing anything else at the time would have been an error.

But needing to do the work to support the new graphics cards was neither tech debt nor caused by a bad decision. We needed to do it because the environment, the context changed. The work needed is the work needed. You still need to do it. But don’t put the wrong label on it just because the terms are handy. Be honest with yourself and each other. Saying someone made the wrong decision because the world changed and now there’s a better choice doesn’t help anyone.

by Leon Rosenshein

Empowered?

That’s what we all want, right? Managers should empower their teams. Individual contributors should be empowered to design the things they’re responsible for. That sounds right, but is it? When I hear about people being “empowered” I start to wonder. Before you think I’m going full-on Taylorism on you, think about the word empower.

Empower:


  • to give power or authority to; authorize, especially by legal or official means:
    I empowered my agent to make the deal for me. The local ordinance empowers the board of health to close unsanitary restaurants.


  • to enable or permit:
    Wealth empowered him to live a comfortable life.

Giving power, authorizing, permitting. Is that really what you want? It seems kind of backwards to me. We’ve all got roles and responsibilities. We’ve been put in them because people have made the conscious decision that we can execute the roles and meet the responsibilities. But if you have to be empowered to do something, what was the situation before you were empowered? Did you have responsibility without authority? Were you overly constrained?

Don’t get me wrong. There are many times where being empowered is the right thing to do. Delegating decision authority to someone that normally wouldn't have it can have lots of benefits. The person with the temporary authority gets to stretch and learn. The person delegating frees up time to work on something else. And if you empower the right person the decision will get made closer to the action and with more context. Wins all around.

But when it comes to empowering a team or individual to do their job, I think using the term “empower” is wrong. It sends the wrong message. It says “This isn’t really your job. Normally you should do what you’re told, but in this case I’m giving you special permission to do more.” It implies that the person doing the empowering has all the power and is relinquishing it.

I believe that’s almost always not the case. When someone tells you that you’re empowered to do your job, what they really mean is that your job includes making a set of decisions, and you should do your job, making all of the decisions that are required. But the choice of words matter. It sends a message. And we should all be careful of the message we’re sending.

So next time you feel the urge to use the term “empower”, think about what message you’re trying to send. If you’re delegating, then empower away. If, on the other hand, you are trying to encourage someone to use the authority they already have, be explicit. Say something more like “What do you think we should do? It’s always been up to you.”

by Leon Rosenshein

Cost Of Time

“I have only made this code more complex because I have not had the time to make it simpler.”

   -- Grady Booch

Getting things right takes time. Making things simple takes more time. The question is, how much time do you have, and when do you have it? Expediency says make it work and move on. But that leaves you with the complexity. And complexity leads to drag.

The question is, how do you deal with the drag? That’s easy. You put in the time to make it simpler, less complex, easier to work with. But when? When you need it, of course.

The trick is figuring out when you need it. You don’t need it right after you implement and release it. It’s working, so doing extra work is taking away from adding more customer value. It’s probably not when you need to make the first change. Chances are the change is small and specific. You haven’t lived with the system that long so you still don’t have a good understanding of system interactions yet. So you make the simple change. You go through that cycle a couple of times and notice that it’s getting harder to make the simple changes.

That’s when it’s time to put the effort in to make things simpler. At some point the drag of the system means the amortized cost of making the system simpler and cleaner, over the next N changes, is lower than just making the changes as they are needed. Once you make things simpler everything is faster.

And that’s how you maintain velocity. Not by adding people to a team. That just increases the number of connections and slows things down. It’s not by doing more BDUF. You don’t know what you don’t know, and that just causes more rework. And it’s certainly not by taking on more tech debt. Over the long run the interest on the debt will make it so hard to get things done that nothing happens.

So don’t let perfect be the enemy of the good. Make it work, however complex it needs to be to work. But when the time is right, go back and make it simpler.

by Leon Rosenshein

Engineering Excellence

I maintain that we are engineers, and I’m not the only one. And of course, we all want to be excellent to each other. But what is engineering excellence, really?

It’s not doing code reviews or design docs or outage post mortems, but those are all part of it. It’s not SOLID or DRY or KISS, but those are part of it too. And it’s not staging environments, release processes, or alerting. Or at least it’s not just any of those things.

Engineering excellence, and especially a culture of engineering excellence isn’t about the things you do. It’s about why you are doing those things. It’s the difference between trying to ensure bugs aren’t released with the product at the end and trying to ensure bugs aren’t added to the product during development. A sufficiently rigorous late stage QA process can make sure bugs will never be released, but it can’t ensure you actually release something. Conversely, if you make sure bugs aren’t added (or live long if they are) you can release whenever you want.

It’s engineering excellence that takes you from executing all of the processes in the world because you’ve been told that process makes you better to “We care enough to do the best” so we want to do all those things which let us know that we’re doing excellent work. Engineering excellence is about craftsmanship and effectiveness. It’s about fit and finish. It’s about doing things in a way that makes things easier in the future, not what’s expedient not. It’s about doing things in a way that gives you the more average velocity, not more instantaneous velocity.

What you end up with is pretty impressive as well. You end up being both happier and more productive. You do things you didn’t know you could, and before you expected to be able to do them, so more value is released to your customers. You end up happier when you do it as well. You find that the little things that annoyed you are gone. There’s less friction when you try to do something. The time you used to sit around waiting for something to finish is gone.

It happens at multiple levels, not just the individual. The same thing happens at the organizational level. Build roadmaps not just for the quarterly review, but with customers so that the roadmap is built around customer needs. Collaborating with partners to reach a goal quicker. Having processes and norms to make things easier instead of using them as a constraint. A smoothing of the rough edges between and around teams and goals so they work together better.

And of course, not a simple copy of some other company’s ideas, but done in a way that fits our context.

by Leon Rosenshein

Context Matters

I’ve talked about contexts before. The contexts of what you and your audience know. The ubiquitous language of bounded contexts. And the shared context of experience. There’s another kind of context that I haven’t talked about. The kind of context that comes with size and environment.

Because size and environment change things. In a company of 10 people it’s possible for everyone to know everyone else. And for everyone to know as much (or as little) as they want about everything else. A company of 100 people can probably do it, if they work hard at it. In a company of 1000 people it won’t work.

When TK started Uber, in one city with a few people, the size of the company was right and the environment he did it in was ready for what he was doing. Trying the same thing today, in this context won’t get you far.

Consider three very large tech companies, Microsoft, founded in 1975, Amazon founded in 1994, and Google founded in 1998. They all have a search product. They all have a large part of their company dedicated to building and selling cloud services. They are all considered successful. But while their product portfolio has a lot of overlap, if you look under the covers no one would mistake one of them for either of the others. And if you tried to transplant a specific process or peice of the tech stack from one to another it wouldn’t work.

There are books by company insiders about how Microsoft and Google do software development. In both cases they’re distillations of hard-won lessons at the respective companies. And they’re different. Because they evolved over time in a specific context. A context that includes things like company processes, number of employees, company wide and org specific tools, and very different leaders at the top.

Which is a long way of saying that while there’s a lot to learn from the successes (and failures) of other companies and the tools and processes they used, we, or any other company for that matter, can’t just blindly apply those lessons and expect them to work as well.

Instead, we should view the experiences of others through the lens of our context and apply those learnings. Building the tools and processes, the culture, that works in our context.

by Leon Rosenshein

Measurement

"What gets measured gets managed - even when it’s pointless to measure and manage it, and even if it harms the purpose of the organization to do so."

Peter Drucker

"It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”

W. Edwards Deming

“Managers who don’t know how to measure what they want settle for wanting what they can measure.”

Russell Ackoff

"Tell me how you measure me, and I will tell you how I will behave."

Eliyahu Goldratt

You’ve probably heard shortened versions of the first two. “What gets measured gets managed” and “If you can’t measure it, you can’t manage it”. Now compare the common, well known, shortened version to the full quote. In context the meaning is very different. But that never stopped a good quote from entering the zeitgeist. Or putting the new cover sheet on all your TPS reports. And it’s not just conventional wisdom. I’ve been in multiple training sessions over the years where the shortened, incorrect versions were used. That’s a real problem because all it does is cement the problem.

Beyond the misunderstanding, the real problem with the short version is that they lead directly to the second two quotes. That’s the real problem. Measuring how much value a team has added, especially when a company is pre-revenue is hard. You can’t look at a bump in sales or monthly average users. There are no downloads to count or even user comments to look at. What you can do though is measure activity. Like tickets opened/closed, PRs landed, or documents written. Things like that are simple and easy to measure. And get you a new minivan.

As is often the case, the antidote to incomplete understanding is to go back and look at what was actually said. Don’t just measure something because you can. First, measure something because the information conveyed by the measurement actually helps you manage things in the direction you want to go. If it doesn’t, stop measuring it.

Second, manage the things you need to manage. Some things, like the average response time of an RPC is measurable. It makes sense to track it. And do something when it’s demonstrably below a threshold. But when you don’t have a direct, explicit measurement for it there’s probably a qualitative measurement. Unmeasurable doesn’t mean unknowable. You can’t measure the happiness or energy level of a team, but you can tell if it’s higher or lower than it has been. That’s important information. Use it.

So instead of measuring what you can and then managing that data, manage so that value is added. So that, even when no one is watching, the decisions that get made all day long, the little tiny ones, are made in a way that value gets added. Which means we’re back to talking about culture. Again.

by Leon Rosenshein

Broken Windows

A system that breaks randomly but frequently…and requires very short bursts of work to resolve the issue… will sap the morale of a team far out of proportion to the “size” of any ticket

-- John Cutler

When you overload a database and it falls over no-one is surprised that things stop. When that happens the team(s) involved dig in, pick it back up, and get things going again. Next they figure out the series of unfortunate events that caused the outage and do something to make sure it doesn’t happen again, or at least warn them before it does. Then they close out the issue and move on.

But sometimes that doesn’t happen. Instead, what happens is that an alert fires, but by the time someone starts looking the alert clears. Or maybe there aren’t any alerts, but throughput drops. Or someone tries to access a different S3 bucket and gets an error. There’s no outage, but there’s always something not quite right.

In many ways those on-call shifts are worse than ones with an outage. Because you never get a chance to catch your breath. Or think about root causes. Or do something about it.

What’s a team to do in that situation? The simple answer is to not ever let it get like that. And maybe that’s possible. But probably not. Because when the rubber meets the road you find out that things aren’t the way you expected. It could be because there was a constraint you didn’t see, or a set of known constraints were reached at the same time. Or there’s just a new use case with very different characteristics. It doesn’t really matter which of those reasons it is or if it’s something different. The situation will arise, so you need to deal with it.

IME the best way to deal with it is to own it. Acknowledge both the pain you’re feeling and that it needs to be dealt with. Acknowledge it as individuals and as a team. Take the time to understand what’s really going on. Then take the time to fix it right. Sure, it’s going to impact near term deliverables. If you spend a few days analyzing, experimenting, and delivering a fix you have slipped your immediate delivery by that many days. Own it. Tell you customers. But also recognize that after you put in the effort you’re going to be in better shape.

Drag will be lower. You’ll move faster. With everything else staying the same you’ll find that pretty soon you’re ahead of where you would have been if you hadn’t taken the time to fix the issue.

And you know what happens if you don’t take care of the issue? You’ve created a precedent. Similar to the broken windows theory. If it looks like things aren't being taken care of they won't be. When the next issue comes up people will look around and see that others haven’t done anything about their issues. Which makes them less likely to do something about the new one. Now you’ve got an even stronger precedent. Which makes it that much more likely that the third issue will be ignored.

Windows setup screen

Then one day you pick your head up and realize there’s a culture of not worrying about the little things. That’s not a culture anyone wants. Friction builds up. Progress slows. Deadlines are missed. You see things like this on the big display at the airport that's supposed to tell you when your flight is departing.

So instead, build a culture where the little things are important. Where people and morale are important. Where problems are acknowledged, owned, and acted on. And you’ll find that you not only have happier, more fulfilled people and teams, but you’ll get more done as well.