Recent Posts (page 8 / 67)

by Leon Rosenshein

The Audience Is Listening

There are lots of ways to present information. When you’ve got some information you want to share it’s your responsibility share it well. There are lots of things to think about. Of course, you need to make sure the information you want to share is presented clearly, but that’s only the most basic, obvious, part.

Another thing to keep in mind is context. The context that you share with the people you’re talking to. Your audience. It’s your responsibility, as the person sharing the information, to ensure that you and your audience have a shared context. That the underlying foundation and framework that supports what you’re sharing is in fact also shared.

One of the big things in that kind of context is jargon. Oxford defines jargon as

special words or expressions that are used by a particular profession or group and are difficult for others to understand.

When you and your audience share those words and expressions you can take shortcuts. You can say one or two words and describe an entire situation. That kind of shorthand can be extremely useful to set the stage and make sure you and your audience are on the same page. As long as they understand the jargon.

On the other hand, if they don’t share the jargon, it makes things worse. Instead of inviting your audience in to join you, incomprehensible jargon pushes them away. It’s a barrier to entry. While the audience is trying to understand what you said, you’re finished the point and have moved on to something else. So not only did your audience not understand the jargon, they probably missed the next point as well.

Which gets us back to the title. They’re in the room. They’re probably paying attention. That, however, is not enough. You also need to be listening too. You need to know the audience before you start. If you expect the audience to have enough of the same context you can start with jargon and use it for clarity and brevity. On the other hand, if you think the audience doesn’t have the context, then you need to avoid jargon. At least at the start.

You also need to read the room, both while you’re speaking. Your initial understanding could easily be wrong. Is the audience leaning in and nodding? Are they looking confused? Are they asking questions to get you to define your terms, or are they responding with jargon and extrapolating from what you’ve said to what you’re getting ready to say? All of those things indicate that there’s a mismatch between you and the audience. And if there’s a mismatch between what you’re saying and what they’re hearing nothing is going to be shared.

It’s on you, as the person trying to share something, to be aware of the audience. To understand what they’re understanding and what they’re not. To adjust yourself to meet them where they are.

Because, of course, the audience is listening, so you should too.

by Leon Rosenshein

Incident Response

As I’ve mentioned before, I’m a Mechanical/Aerospace Engineer by training, and plane geek. One thing I’ve noticed is that there’s a lot that software engineers can learn from aerospace engineers and aviation in general. One of those things is incident response.

A few days ago a couple of Alaska Airlines jets headed for Hawaii had tail strikes, within 6 minutes of each other.

Now tail strikes happen occasionally. There’s a procedure for it. The airplane returns to the departure point, gets checked out, and depending on the intensity of the strike, the passengers either continue on the same plane or a different plane is brought in and the passengers leave on that one.

Two in a row (6 minutes apart), at the same airport, from the same airport, with a similar destination, however, is not normal. There’s no defined procedure for it. Instead, Alaska’s director of operations looked at the situation and made a call. He declared an incident. All Alaska flights on the ground would stay on the ground until the issue was understood and mitigated if needed. It’s a safety of flight issue. When you don’t know what’s going on, stop and figure it out.

In this case, it turns out that a recent software update had some kind of issue. The flight planning software that was supposed to figure out weight and balance information and use it to set takeoff parameters had a concurrency/load issue. When operations got busy and built too many flight plans too quickly it got the weights wrong on heavy flights leading to incorrect power settings, which eventually lead to the tail strikes.

There are lots of lessons to be learned here. And that’s ignoring how the test plan for the software itself didn’t find the bug. First, the system has defense in depth against failures. The flight plan is designed to use less than full power to save fuel and wear and tear on the engines, but it doesn’t use the absolute minimum required. There’s a good sized safety margin built in. The pilots have a checklist involving the length of runway remaining, acceleration, speeds, and altitudes as the takeoff proceeds. If any of those checklist items hadn’t been met on takeoff the takeoff would have been aborted. Even with the margins and checklist, there were a couple of tail strikes, but there was no injury or loss of life, and minimal damage to the planes. One of the planes continued on its way about 4 hours later. The system itself is resilient.

Second, while there are no pre-defined triggers around number of tail strikes in a short period of time, the director of ops understands how the system works, recognized an anomalous situation. He was empowered to act. So he pulled the Andon Cord to stop things. There was a system in place to stop operations, so things quickly stopped.

The next step was information gathering. It quickly became apparent that the weight info was sometimes wrong. To resume operations the pilots, the line operators, who were most familiar with the weights and balances, were empowered to make a decision. They were told to check the weight, apply a TLAR (That Looks About Right) check, and if they felt there was a problem, to check with operations and get the right number.

22 minutes after operations stopped, they were restarted. Meanwhile work continued on a long-term solution to the problem. They handled things in the right manner. Stop the damage. Mitigate the problem and resume operations. Identify and implement the long-term fix. Figure out how the problem got into the system in the first place and make sure it doesn’t happen again.

That’s not just good incident response in a safety critical system. That’s how you should respond to every incident.

by Leon Rosenshein

Done Done

One of the questions that gets asked a lot when you’re working on somethings is “When will it be done?” Lots of people want to know, for lots of good reasons. Leaving aside all of the questions around how to estimate (story points, blink estimation, bduf, etc), or if you should not even estimate at all, there’s still the question of what Done means.

This is where it gets interesting. According to Scrum, the definition of done is

The Definition of Done is a formal description of the state of the Increment when it meets the quality measures required for the product.

– The 2020 Scrum Guide

which says approximately nothing. Just that you’re done when you’re done, with quality. So let’s break it down a little.

If you’re using User Stories you need to get them sized right. Even then, it’s pretty vague, since a user story is a placeholder for a conversation with the user. You’re not done until the user says you’re done. Then you’re done.

If you have a task list you might be in better shape. At least you likely have a list of acceptance criteria (AC). Your goal is then to meet the ACs with as little code as possible. Meet those criteria and you’re done.

Simple, right? Wrong. You still have lots of problems. In the user story case, you don’t know when you’re done until you get there. There is no definition of done, just a recognition of the fact that you haven’t gotten there until you arrive at done. On the plus side, when you’re done in that case you’ve added value for the user.

In the task list case, on the other hand, you know exactly when to stop. You’ve met the ACs. Unfortunately, just meeting the ACs doesn’t mean you’ve added any value. The task might is probably one small part of the work needed to do something. Updating a schema. Implementing an endpoint. Automating a manual step. None of those on their own add any value. The work you’ve done for the task probably needs to go through a set of deployment steps before it adds value. Those are probably tasks themselves and suffer similar problems. Doing a deployment without changing anything is good practice, but doesn’t add any value.

Which brings us to what I think done really means. You’re done with something when you’ve added value for someone. The definition of value, and figuring out who that someone is, are things worthy of their own posts in the future. For now though, a simple placeholder for value is “This work makes a user’s task easier.” Who that someone is can be anyone. It could be you, or your team, or a partner team, or even a customer. It really doesn’t matter.

That’s how you define done.

by Leon Rosenshein

Engineering Project Risks

I’ve mentioned Gergely Orosz before. He currently writes the Pragmatic Engineer Newsletter and has worked at Uber, Skype, and other places before deciding to write his newsletter full time. He’s written about lots of interesting things, so check out his newsletter if you get a chance.

For those who haven’t subscribed to his newsletter, he also posts teasers and summaries on Twitter. One of them is about Engineering Project Risks. 7 types of risks and how to manage them.

  1. Technology
  2. Engineering Dependencies
  3. Non-Engineering Dependencies
  4. Missing Decisions/Context
  5. Unrealistic Timelines
  6. Not Enough People/Bandwidth
  7. A “Surprise” Midway Through the Project

All very real and valid risks. And the mitigations are pretty good too. There’s another one though, a combination numbers 1, 2 and 4, that I wanted to bring up.

It’s the New Domain issue. Consider the case where you’re part of a new team being formed to address pain points in a domain no one on the team is familiar with? Someone with domain experience has identified a very real pain point. They’ve also identified a direction for a technical solution. The team now needs to figure out how to design and implement something that eliminates the pain point(s).

There are multiple things you’re dealing with at once, and all are important. I’d say they break down into 3 major areas:

Team Dynamics

It’s a new team, so all the issues a new team has are in play. The team will go through at least the first 3 stages of Tuckman’s model. Critical at the beginning is building trust. Trust that the idea makes sense. Trust that the team has the support it needs. And most important, trust within the team.

The best way to build that trust is Honesty and openness. Everyone saying what they know, and admitting when they don’t. With a foundation of trust the forming and storming is minimized and the team can do its norming.

Domain Context

Not only does the team not know each other, they don’t know the space they suddenly find themselves in. There are folks in the domain that are using existing technology. They’re using jargon and acronyms that the team doesn’t know. They have a set of shared beliefs about what they can’t touch and why.

Here, that lack of context is also one of the team’s greatest strengths. Asking questions to learn, and then sharing the new info across the team helps generate trust within the team and between the team and stakeholders/partners/customers (see above). The lack of context also lets the team look at things in new ways. After all, one of the 4 categories of information is the things we know that just ain’t so. As a new team without context you don’t know those things, so you ask the questions and approach things without those blinders on.

Dependencies

Sure, there’s lots of ambiguity, and there’s no existing code. You probably think you’re working on a greenfield project. But really, it’s a brownfield project. Unless you’re doing something entirely self-contained, you have dependencies. If nothing else, there’s the hardware and operating system you’re using. And way more likely, there’s a lot more you depend on. You’re using libraries and tools and systems that are developed and supported by other teams. Other teams with their own goals and priorities.

Here the lack of context and ambiguity is working against you. You don’t know what you don’t know, so you need to clear up the ambiguity. You need to figure out what you don’t know, then identify you’ll figure it out. You need to identify the dependencies and their limitations. How you’ll work with them. Again, openness and honesty are your superpowers here. The more open and honest you are, the better the answers you’ll get.

And of course, once you get through the special risks of a new team working on a new project in a new space, you still have to deal with the same set of risks as any other project. For that, I refer you back to [Gergely’s list (above).

by Leon Rosenshein

E_INSUFFICIENT_CONTEXT

As I mentioned the other day, I say “It depends” a lot. Then I go on to describe why it depends in that particular case. Yesterday I ran into a very short Mastodon Toot that is the underlying reason it depends.

“It depends” is human for E_INSUFFICIENT_CONTEXT_PROVIDED

The more I thought about it, the more true it became. It’s just that simple. At least the reason “it depends” is that simple. Doing something about it is a whole different thing. To answer the question, you still need the context. That leads to lots of follow-on questions and conversations so you can get that missing context. Once the context is provided the answer changes from “It depends.” to at least “Here’s a good way to start.”, if not “Here’s what’s worked for me in the past and what I recommend you do.”

And it doesn’t just lead to my stock “It Depends” answer. It also leads to my favorite question. “What are you really trying to do?”. That comes from the exact same place. Because that question is really about getting at the missing context. When someone presents you with an XY problem they’re not providing you with the context of the problem they’re really trying to solve, just the context of the blocker they’ve most recently run up against.

There are lots of ways to avoid providing insufficient context. Rubber Ducking is a great place to start. If your rubber duck can understand the problem you’ve probably provided enough context. Another good place to start is Jon Skeet’s Stack Overflow Question Checklist. Even if you’re not asking on Stack Overflow, that was a great checklist 10 years ago and it’s a great checklist now. Especially that last one. After you’ve gone through the checklist, re-read the question and make sure you’ve addressed everything on the checklist.

Providing sufficient context (asking your question well) helps everyone. It’s more than just being a good neighbor, although it is that as well. It reduces the number of back-and-forth cycles required to get enough shared context. Being mindful of others time makes it easier for the answerer. Which makes them more likely to have a good answer to your question. Which makes it more likely that you get an appropriate answer to your question. Which is what you really want anyway.

So next time you have a question, make sure the response isn’t E_INSUFFICIENT_CONTEXT_PROVIDED

by Leon Rosenshein

Don't Be Too DRY

DRY, or Don’t Repeat Yourself is a good rule of thumb. Like all other rules of thumb though, if you were to ask me if you should remove some specific bit of duplication, my answer would be “It depends”.

On the other hand, one of the Go Proverbs is, as Rob Pike tells us,

A little copying is better than a little dependency

So, when should you stay dry, and when should repeat yourself? It depends. It depends on whether or not you’re really repeating yourself or just have the same sequence of characters.

At one extreme you have something like this

numberOfSidesOfSquare := 4
numberOfSeasons := 4
defaultLegsOnDog := 4

There’s three things all set to the same value. Strict DRY might have you believe that you should define a constant called FOUR and then assign that to each of the variables, or write a function, setVariableTo4() that takes a variable and changes it’s value. But that doesn’t make sense. First, the integer 4 is already constant, so defining another constant is redundant. Second, while the action being taken is the same, semantically each of those lines does something different. There is no logical, domain relationship between the number of seasons, the number of sides of a square, and the default number of legs on a dog. The only thing they have in common is the value, 4.

At the other extreme you have bunch of classes that need to send out notifications. They all have functions with a signature something like


function sendNotification(smtpHost string, fromEmailAddress string, toEmailAddress string, subject string, body string) err {
}

Some have the arguments in a different order. Some don’t default smtpHost and subject if you pass in nil or an empty string, One doesn’t accept a from address at all. Inside the function some validate the inputs, others don’t. The error messages are different. They use different internal flow. If you look at the sequence of characters in the function bodies, they’re all different. No textual repeating at all. But logically, semantically, they do exactly the same thing. They validate the inputs. They transform the inputs into the right format and structures needed to talk to an SMTP host and send an email. The return an error if there’s some kind of problem sending the email. That’s something you should write once. For multiple reasons.

First, it makes future maintenance easier. If the default SMTP host changes you only need to change it in once place, and you can make sure no-one else even knows about it. If you want to change from email to Slack notification (or SMS or anything else) you can do that in one place. Or do them all. Again, only in one place. Second, from a Domain Driven Design standpoint, notification doesn’t belong in those classes. Yes, the classes need to do the notification, and they probably care who’s notified, but that’s it. They shouldn’t know or care how. It’s an external capability they should be told about.

Of course, those two are the extremes. For other cases you need to think more deeply about it. Are the things the same semantically/logically? Are they parts of multiple domains? Are they really parts of those domains or are they used by those domains? Should the functionality be extracted and put somewhere else entirely. Would not repeating yourself add some coupling you don’t want? If you have some coupled things, can you break the coupling by repeating yourself.

TL/DR; being DRY is good, but don’t be too dry. Think about the costs and benefits first.

by Leon Rosenshein

Lies, Damn Lies, And Statistics

When you’re running code in production, metrics are important. You need to know that things are working correctly. You need to know when things are starting to go in the wrong direction before any of your users or customers notice. Metrics can tell you that.

The problem with metrics at scale is that, in order to make sense of them, they’re often aggregates. Mean, median, mode, p90, and p95. All ways to turn lots of numbers into a single representative number. Unfortunately, aggregates like that are very lossy compression. The more variable the data, the less useful things like the average are.

Consider Van Gogh’s The Starry Night. According to John Rauser the average color in The Starry Night is #4C5C6D. That’s interesting, but it really doesn’t say much about the painting. Who wants to look at a colored rectangle?

There’s much more information in the painting, and I’d much rather look at it.

Van Gogh's Starry Night

Not only is there a lot of missing information, I don’t think that average color even exists in the painting.

Or consider something like a P99 latency. Not only are there many ways for the P99 value to be wrong, but even if it’s accurate you could still easily be missing the point. Statistics can be tricky things, especially percentiles. Sure, your P99 value means 99% of the values are lower, but it doesn’t tell you anything about why that last percent of items is higher, nor does it tell you how much higher.

Image you have 10,000 web requests be served by 100 servers. 9,900 complete in less that 30ms, so your P99 is 30ms. That’s a pretty good P99. That’s probably within your SLA, so according to the SLA you’re all set. What that P99 doesn’t tell you is that the other 100 take 5 seconds each. Somewhere in there you’ve got a significant problem. The other thing it doesn’t tell you is that all 100 of the slow requests were handled by the same server. Which means you’ve probably got a routing/hardware problem. Just looking at your P99 and you wouldn’t have known there was a problem, let alone that you it’s on the physical side, not the software side.

Which is another way of saying don’t just look at the aggregates, look at your raw data and see what else it’s telling you. It can tell you a lot about how things are really going, if you let it.

by Leon Rosenshein

Pass or Fail?

We say a test passes when the results are as expected and that it fails when one or more of them is not. When one or more asserts fails. That makes sense. When you take a test in school and get the right answers you pass. If you don’t, you fail.

When you took that test in school and got the right answers who passed? You, or the test? It’s been a long time, but the last time I took a test in school I got enough correct answers. When I got the results, they said I passed. They didn’t say anything about the test itself passing or failing.

So here’s another framing question for you. Why do we say we run tests on our software when as a student we take a test? Why do we say the tests pass or fail for software when we say the student passed or failed? I’ve got some thoughts on why.

I think it goes back to Taylorism, Legibility, and language. We tried to divide by function for efficiency. We tried to make sure we always had good, centralized metrics to tell us we were on track. We execute the tests, not the software to get those metrics.

Taylorism is all about assembly lines, decomposing work into separable tasks, and optimizing repetitive steps for efficiency. That doesn’t make sense for writing code, but it might make sense for testing code. After all, the tests are a separable step, and we repeat testing the code over and over again. So Taylorism leads us to considering the tests to be their own thing, completely separate from the code.

Legibility is the idea that the only way to make sure you do things efficiently is to centrally plan, record, and respond to deviations from the plan. To do that you need have the results of all of the phases. You break the coding into tasks and give them a state. You identify all of the tests and give them a state. Since the tests exist for a long time and, according to the plan, never change, the state of the test is the result of the last time the software was tested, pass, or fail.

And then there’s language. In this case, it’s the language of executables. Tests are code. You run code. Testing code, especially unit tests, but even integration and beta tests, are code you run. The code doesn’t run the tests, the tests run the code. The Sapir-Worf Hypothesis says that language influences thought. The language makes us say and think about running the tests to get results instead of testing the software and seeing if the software passes or fails the test(s).

That’s how our frame gets us to talk about running tests and tests passing of failing. Its interesting, and mostly harmless, but its not completely harmless. It’s not harmless because the frame does more than just define the words we use. It also influences how we think of of our software and it’s state. We say the test failed. We don’t say the software failed. That frame makes us think of fixing the test to make it pass, not fix the software. Sure, we usually fix the software, but since the test is what’s failing it makes complete sense to change the test until it passes. The frame makes us think that might be the right thing to do. But it’s probably not.

So next time you find yourself talking about tests failing, consider reframing it and talk about the software failing the test. Because how you look at things influences what you do about it.

by Leon Rosenshein

Autonomy, Alignment, Purpose, and Urgency

I’ve talked about Autonomy, Alignment, Purpose, and Urgency before. What I haven’t done much of is compare and relate them. Particularly the Autonomy/Alignment and Purpose/Urgency pairs.

Autonomy is great. You get to decide what you (and/or your team) do. You decide what you think is important and you go do it. You expect the work to be impactful, and you get to have fun doing it. Ask for forgiveness instead of permission.

Alignment gets everyone moving in the same general direction. It’s vector math. To make forward progress the sum of the vectors needs to be positive (for whatever positive means for you). To use a buzzword, it’s synergy between teams/groups. Everyone moves forward more than any individual would/could.

Purpose is the ever important why. Not the business purpose, although that’s important, but the internal desire to do something that will have a big impact in an area that’s personally important. As Daniel Pink said in Drive, “All of us want to be part of something bigger than ourselves, something that matters.”

Urgency is about timeliness. When does something need to be done? When a building is on fire the urgent things are to make sure people are safe, make sure the fire doesn’t spread, and put the fire out, in that order. Only after all that is done should you work on preventing the next fire.

The question is, how do they interact?

How much autonomy should you have? How much alignment? When there is neither autonomy nor alignment you end up with Brownian motion. People wander in random directions and overall, not much changes. A lot of alignment and not much autonomy gives you robots. People blindly doing what they’re told. Hopefully it’s the right thing. If not, oh well. Do it anyway since those are the orders. Too much autonomy and no alignment ends up with everyone moving fast, but in random directions, so again, there’s very little overall progress. When there’s a good balance of alignment and autonomy that you get the most progress towards the goal.

Four quadrant graph. Low Autonomy/Low Alignment shows little motion in random directions. Low autonomy, high alignment shows conformance and everyone moving a little in the same direct. High autonomy, low alignment shows everyone move farther, but in wildly different and changing directions. High autonomy, high alignment shows everyone converging on the goal

Then there’s urgency and purpose. They pair a little differently. Urgency is short term. Too much urgency, or urgency for too long leads to burnout. In gaming, everything was urgent Crunch Time regularly led to people leaving as soon as something shipped. Purpose is long term. It’s sustainable. The more purpose you have the more energy you have to apply. And when you achieve your purpose, you don’t feel exhausted, you feel whole.

Urgency is extrinsic. A sense of urgency is imposed from the outside. It often exhibits as micromanagement, or some seemingly arbitrary timeline. It happens through power differentials. The manager/boss wants something done and makes the team work harder. The team doesn’t have a choice. Work harder or suffer the consequences. Purpose, on the other hand, is intrinsic. It’s personal. It’s empowering. It’s shareable. Others can see and understand your purpose and join you in it.

Where it gets really interesting is how Purpose and Urgency interact with Alignment and Autonomy.

Purpose can drive alignment. If everyone has the same purpose, then they’re much more likely to align on the same goal. Urgency, with its external definition of the goal has the same result. The difference comes from the time horizon. If you’re only caring about the next few hours, you’ll probably get better results from an externally defined goal. If you’re thinking about the long-term results for your customer/user, then shared purpose is going to serve you better.

Purpose also drives you towards autonomy. While everyone might have the same general purpose, it’s unlikely that everyone’s exact purpose is the same. This leads people to group themselves by similarity of purpose, and then want to do what they see as the most important thing. They want autonomy to make those decisions. Urgency, with its externally defined goal and micromanagement, will reduce autonomy.

High urgency, while it can be effective in the short term, has the overall effect of pushing us to the top left quadrant of the graph. Which might be ok in the short term, if you really know what the goal is and no-one else does. But for long term success, it’s much better to have shared purpose and allow matching autonomy that pushed you to the top right quadrant. Which is where you have the best overall outcome.

by Leon Rosenshein

The Immutable Past

Comments are important. I’ve written about them a bunch of times. Comments in docs. Comments in code. Comments in git commits and PRs. As an aside, git commit messages and Pull Request messages are often very different things with different intents.

All the old things still apply. Comment about why you did something, not what. Comment about what you decided to not do. Comment about when you might revisit the decision. The kind of comments that don’t go out of date as soon as someone else touches the code.

A quick heuristic that can help you know if you’re making a comment that won’t go out of date is to see if the comment makes sense if you prepend it with something like “On <today’s date> we wanted/needed/decided to/choose not to …”. That’s because the past is fairly immutable. The code (at the moment) says what was decided, and the comment says why.

The present, on the other hand, changes a lot. Our understanding of the domain will grow. The situation changes. What seemed correct then is wrong now. But the thoughts and reasons behind the decision didn’t change. They were valid when we made them, and that hasn’t changed, so those comments are still correct.

But now that we know more, whether by deeper understanding, the situation changing, a shifting goal, or something else entirely. So we want/need to change the code to match. And that’s a good thing. That’s being Agile. The situation changed, so we adapted.

We adapt the code, but we don’t need or want to change the comments. They’re recording a moment in time and that moment is still very real and very valid. We might add to the comments with our new understanding and why they old choice is no longer the appropriate one. But again, it’s documenting the why and the why not. Documenting the moment in time, right next to the code it’s relevant to.

What you don’t want to do is turn your comments into a changelog. After all, we have git (or whatever version control you’re using) to record the actual code changes.