Recent Posts (page 11 / 70)

by Leon Rosenshein

Enter The Conversation

“Always enter the conversation already taking place in the customer’s mind.” ~Robert Collier

Interesting quote. It comes from the sales and marketing world. Basically, it says that your message will be heard better/clearer if what you’re saying fits into what your customer is already thinking than if it’s disconnected.

That makes sense. Sure, sometimes a cold call really does pique your interest and get you to follow up. The more common response to a cold call, however, is to just say “Thanks. I’ll think about it.”, then promptly forget about the entire encounter. Which, as a marketer, is the exact opposite of the desired response.

It’s a good point, and it explains why targeted marketing is a thing and why the online advertising market is so big. Advertising your sneakers to someone who appears to be looking for sneakers has a much higher likelihood of turning into a sale than advertising your sneakers to someone who’s sad and is looking for a funny cat video to cheer them up.

What has that got to do with software though? Of course, it applies to marketing software. But it applies to UI design. And documentation. And API design.

Why are people using your API? They’re (probably) not using your API because they like you and using it is fun. They’re using it to get a job done. That’s where they are.

That means that your API is a conversation with your customer and that the goal of the conversation is to get the job done. Your API should be written in a way that makes it easier to get the job done. It should be consistent. It should speak your customer’s language. It should hide complexity, not create it.

The key is having the conversation at the right level of abstraction. Let’s say you’re writing an API that a website can use to display and manage a user’s travel schedule. The API you build will be somewhere between a single function that lets the user make a single SQL call and a single function that returns the HTML to be rendered. You could do either one of those, but the former is really the API you’re going to call when you write a new API, and the latter is what the website is supposed to do.

Instead, think of it from the user’s point of view. What do they know, and what do they want to do? They know the end user they’re dealing with. They know if they want to display the existing trips, show the detail for a specific trip, add a trip, or possibly change a trip. That’s the information that tells you what your API should do. It also gives you good info into what you should call the functions, what the arguments are, and what the return values should be. That’s the conversation you’re having with your customer. In their terms, in their language.

Just don’t forget Hyrum’s Law 😄

by Leon Rosenshein

Let Me Tell You What I Want

I’ve talked about user stories and ISBAT before. The idea that a user story should be focused on something the user wants to accomplish. The acceptance criteria (AC) should show how the result adds value to the user. While the AC’s are about the definition of done, they’re not about the steps along the way.

Very often however, user stories, especially on backlogs and on teams that work as groups of individuals, turn into tasks. One of the biggest reasons for that is that while ISBAT makes it clear what the end state is, it doesn’t really define the goal. Consider this user story.

As a customer, I should be able to check out.

The user is defined. The end state is defined. What more could you want in a user story? Consider this slightly extended user story.

As a customer, I should be able to check out, so that I get the product I picked.

The beginning of the story is exactly the same. The only change is adding the so that part at the end. It might not even change how you end up implementing the story. And if it doesn’t change anything, why write it down? It’s just syntactic noise with no added value.

However, there is added value. It’s slightly hidden, but there’s actually a lot of value there. The first version of the story is really just a task for the implementer. The second version is a task, with but with costs and benefits for the user stated. The benefit is that the user gets the product they picked. The cost is that they have to check out. When you recognize that you’ve but a cap on how much the “cost” to the customer can be. It has to be less than the value they receive. If the cost is less than the value, then you have a happy (and hopefully repeat) customer. If it’s higher you might have a one-time customer because of their sunk cost (they’ve already picked the product and done whatever else, so it’s worth it at that point), but you’re not going to have a repeat customer. From the customer’s standpoint there’s no value proposition.

Your goal, when coming up with user stories, is to ensure the story highlights the user’s value proposition, without constraining the folks doing the implementation. It lets the implementors working backwards from the user’s value proposition and maximize it.

If instead, you use this version of the user story,

As a customer, I should be able to get the product I picked.

you empower the folks implementing the solution to come up with a one-click solution. If you already know the user’s payment and shipping information you can have two buttons on the web page. One has the traditional checkout flow. Verify quantities, payment info, shipping info, and whatever else you need. And you can have a second one that says something like “Instant Purchase”, which takes those defaults, processes the order, and responds with a order ID and a tracking number.

That adds a lot of value to the user. It gives them the choice to pay the price in time and effort to go through the standard flow, or to save that time and effort. To have way to make the purchase with time and energy cost. It also gives the user more control over the process. Which adds more value.

Which is why you need to be care to not only make your user stories actually be user stories, not tasks, but also they need to be focused on adding value and improving the user’s cost/benefit ratio.

by Leon Rosenshein

The Audience Is Listening

There are lots of ways to present information. When you’ve got some information you want to share it’s your responsibility share it well. There are lots of things to think about. Of course, you need to make sure the information you want to share is presented clearly, but that’s only the most basic, obvious, part.

Another thing to keep in mind is context. The context that you share with the people you’re talking to. Your audience. It’s your responsibility, as the person sharing the information, to ensure that you and your audience have a shared context. That the underlying foundation and framework that supports what you’re sharing is in fact also shared.

One of the big things in that kind of context is jargon. Oxford defines jargon as

special words or expressions that are used by a particular profession or group and are difficult for others to understand.

When you and your audience share those words and expressions you can take shortcuts. You can say one or two words and describe an entire situation. That kind of shorthand can be extremely useful to set the stage and make sure you and your audience are on the same page. As long as they understand the jargon.

On the other hand, if they don’t share the jargon, it makes things worse. Instead of inviting your audience in to join you, incomprehensible jargon pushes them away. It’s a barrier to entry. While the audience is trying to understand what you said, you’re finished the point and have moved on to something else. So not only did your audience not understand the jargon, they probably missed the next point as well.

Which gets us back to the title. They’re in the room. They’re probably paying attention. That, however, is not enough. You also need to be listening too. You need to know the audience before you start. If you expect the audience to have enough of the same context you can start with jargon and use it for clarity and brevity. On the other hand, if you think the audience doesn’t have the context, then you need to avoid jargon. At least at the start.

You also need to read the room, both while you’re speaking. Your initial understanding could easily be wrong. Is the audience leaning in and nodding? Are they looking confused? Are they asking questions to get you to define your terms, or are they responding with jargon and extrapolating from what you’ve said to what you’re getting ready to say? All of those things indicate that there’s a mismatch between you and the audience. And if there’s a mismatch between what you’re saying and what they’re hearing nothing is going to be shared.

It’s on you, as the person trying to share something, to be aware of the audience. To understand what they’re understanding and what they’re not. To adjust yourself to meet them where they are.

Because, of course, the audience is listening, so you should too.

by Leon Rosenshein

Incident Response

As I’ve mentioned before, I’m a Mechanical/Aerospace Engineer by training, and plane geek. One thing I’ve noticed is that there’s a lot that software engineers can learn from aerospace engineers and aviation in general. One of those things is incident response.

A few days ago a couple of Alaska Airlines jets headed for Hawaii had tail strikes, within 6 minutes of each other.

Now tail strikes happen occasionally. There’s a procedure for it. The airplane returns to the departure point, gets checked out, and depending on the intensity of the strike, the passengers either continue on the same plane or a different plane is brought in and the passengers leave on that one.

Two in a row (6 minutes apart), at the same airport, from the same airport, with a similar destination, however, is not normal. There’s no defined procedure for it. Instead, Alaska’s director of operations looked at the situation and made a call. He declared an incident. All Alaska flights on the ground would stay on the ground until the issue was understood and mitigated if needed. It’s a safety of flight issue. When you don’t know what’s going on, stop and figure it out.

In this case, it turns out that a recent software update had some kind of issue. The flight planning software that was supposed to figure out weight and balance information and use it to set takeoff parameters had a concurrency/load issue. When operations got busy and built too many flight plans too quickly it got the weights wrong on heavy flights leading to incorrect power settings, which eventually lead to the tail strikes.

There are lots of lessons to be learned here. And that’s ignoring how the test plan for the software itself didn’t find the bug. First, the system has defense in depth against failures. The flight plan is designed to use less than full power to save fuel and wear and tear on the engines, but it doesn’t use the absolute minimum required. There’s a good sized safety margin built in. The pilots have a checklist involving the length of runway remaining, acceleration, speeds, and altitudes as the takeoff proceeds. If any of those checklist items hadn’t been met on takeoff the takeoff would have been aborted. Even with the margins and checklist, there were a couple of tail strikes, but there was no injury or loss of life, and minimal damage to the planes. One of the planes continued on its way about 4 hours later. The system itself is resilient.

Second, while there are no pre-defined triggers around number of tail strikes in a short period of time, the director of ops understands how the system works, recognized an anomalous situation. He was empowered to act. So he pulled the Andon Cord to stop things. There was a system in place to stop operations, so things quickly stopped.

The next step was information gathering. It quickly became apparent that the weight info was sometimes wrong. To resume operations the pilots, the line operators, who were most familiar with the weights and balances, were empowered to make a decision. They were told to check the weight, apply a TLAR (That Looks About Right) check, and if they felt there was a problem, to check with operations and get the right number.

22 minutes after operations stopped, they were restarted. Meanwhile work continued on a long-term solution to the problem. They handled things in the right manner. Stop the damage. Mitigate the problem and resume operations. Identify and implement the long-term fix. Figure out how the problem got into the system in the first place and make sure it doesn’t happen again.

That’s not just good incident response in a safety critical system. That’s how you should respond to every incident.

by Leon Rosenshein

Done Done

One of the questions that gets asked a lot when you’re working on somethings is “When will it be done?” Lots of people want to know, for lots of good reasons. Leaving aside all of the questions around how to estimate (story points, blink estimation, bduf, etc), or if you should not even estimate at all, there’s still the question of what Done means.

This is where it gets interesting. According to Scrum, the definition of done is

The Definition of Done is a formal description of the state of the Increment when it meets the quality measures required for the product.

– The 2020 Scrum Guide

which says approximately nothing. Just that you’re done when you’re done, with quality. So let’s break it down a little.

If you’re using User Stories you need to get them sized right. Even then, it’s pretty vague, since a user story is a placeholder for a conversation with the user. You’re not done until the user says you’re done. Then you’re done.

If you have a task list you might be in better shape. At least you likely have a list of acceptance criteria (AC). Your goal is then to meet the ACs with as little code as possible. Meet those criteria and you’re done.

Simple, right? Wrong. You still have lots of problems. In the user story case, you don’t know when you’re done until you get there. There is no definition of done, just a recognition of the fact that you haven’t gotten there until you arrive at done. On the plus side, when you’re done in that case you’ve added value for the user.

In the task list case, on the other hand, you know exactly when to stop. You’ve met the ACs. Unfortunately, just meeting the ACs doesn’t mean you’ve added any value. The task might is probably one small part of the work needed to do something. Updating a schema. Implementing an endpoint. Automating a manual step. None of those on their own add any value. The work you’ve done for the task probably needs to go through a set of deployment steps before it adds value. Those are probably tasks themselves and suffer similar problems. Doing a deployment without changing anything is good practice, but doesn’t add any value.

Which brings us to what I think done really means. You’re done with something when you’ve added value for someone. The definition of value, and figuring out who that someone is, are things worthy of their own posts in the future. For now though, a simple placeholder for value is “This work makes a user’s task easier.” Who that someone is can be anyone. It could be you, or your team, or a partner team, or even a customer. It really doesn’t matter.

That’s how you define done.

by Leon Rosenshein

Engineering Project Risks

I’ve mentioned Gergely Orosz before. He currently writes the Pragmatic Engineer Newsletter and has worked at Uber, Skype, and other places before deciding to write his newsletter full time. He’s written about lots of interesting things, so check out his newsletter if you get a chance.

For those who haven’t subscribed to his newsletter, he also posts teasers and summaries on Twitter. One of them is about Engineering Project Risks. 7 types of risks and how to manage them.

  1. Technology
  2. Engineering Dependencies
  3. Non-Engineering Dependencies
  4. Missing Decisions/Context
  5. Unrealistic Timelines
  6. Not Enough People/Bandwidth
  7. A “Surprise” Midway Through the Project

All very real and valid risks. And the mitigations are pretty good too. There’s another one though, a combination numbers 1, 2 and 4, that I wanted to bring up.

It’s the New Domain issue. Consider the case where you’re part of a new team being formed to address pain points in a domain no one on the team is familiar with? Someone with domain experience has identified a very real pain point. They’ve also identified a direction for a technical solution. The team now needs to figure out how to design and implement something that eliminates the pain point(s).

There are multiple things you’re dealing with at once, and all are important. I’d say they break down into 3 major areas:

Team Dynamics

It’s a new team, so all the issues a new team has are in play. The team will go through at least the first 3 stages of Tuckman’s model. Critical at the beginning is building trust. Trust that the idea makes sense. Trust that the team has the support it needs. And most important, trust within the team.

The best way to build that trust is Honesty and openness. Everyone saying what they know, and admitting when they don’t. With a foundation of trust the forming and storming is minimized and the team can do its norming.

Domain Context

Not only does the team not know each other, they don’t know the space they suddenly find themselves in. There are folks in the domain that are using existing technology. They’re using jargon and acronyms that the team doesn’t know. They have a set of shared beliefs about what they can’t touch and why.

Here, that lack of context is also one of the team’s greatest strengths. Asking questions to learn, and then sharing the new info across the team helps generate trust within the team and between the team and stakeholders/partners/customers (see above). The lack of context also lets the team look at things in new ways. After all, one of the 4 categories of information is the things we know that just ain’t so. As a new team without context you don’t know those things, so you ask the questions and approach things without those blinders on.

Dependencies

Sure, there’s lots of ambiguity, and there’s no existing code. You probably think you’re working on a greenfield project. But really, it’s a brownfield project. Unless you’re doing something entirely self-contained, you have dependencies. If nothing else, there’s the hardware and operating system you’re using. And way more likely, there’s a lot more you depend on. You’re using libraries and tools and systems that are developed and supported by other teams. Other teams with their own goals and priorities.

Here the lack of context and ambiguity is working against you. You don’t know what you don’t know, so you need to clear up the ambiguity. You need to figure out what you don’t know, then identify you’ll figure it out. You need to identify the dependencies and their limitations. How you’ll work with them. Again, openness and honesty are your superpowers here. The more open and honest you are, the better the answers you’ll get.

And of course, once you get through the special risks of a new team working on a new project in a new space, you still have to deal with the same set of risks as any other project. For that, I refer you back to [Gergely’s list (above).

by Leon Rosenshein

E_INSUFFICIENT_CONTEXT

As I mentioned the other day, I say “It depends” a lot. Then I go on to describe why it depends in that particular case. Yesterday I ran into a very short Mastodon Toot that is the underlying reason it depends.

“It depends” is human for E_INSUFFICIENT_CONTEXT_PROVIDED

The more I thought about it, the more true it became. It’s just that simple. At least the reason “it depends” is that simple. Doing something about it is a whole different thing. To answer the question, you still need the context. That leads to lots of follow-on questions and conversations so you can get that missing context. Once the context is provided the answer changes from “It depends.” to at least “Here’s a good way to start.”, if not “Here’s what’s worked for me in the past and what I recommend you do.”

And it doesn’t just lead to my stock “It Depends” answer. It also leads to my favorite question. “What are you really trying to do?”. That comes from the exact same place. Because that question is really about getting at the missing context. When someone presents you with an XY problem they’re not providing you with the context of the problem they’re really trying to solve, just the context of the blocker they’ve most recently run up against.

There are lots of ways to avoid providing insufficient context. Rubber Ducking is a great place to start. If your rubber duck can understand the problem you’ve probably provided enough context. Another good place to start is Jon Skeet’s Stack Overflow Question Checklist. Even if you’re not asking on Stack Overflow, that was a great checklist 10 years ago and it’s a great checklist now. Especially that last one. After you’ve gone through the checklist, re-read the question and make sure you’ve addressed everything on the checklist.

Providing sufficient context (asking your question well) helps everyone. It’s more than just being a good neighbor, although it is that as well. It reduces the number of back-and-forth cycles required to get enough shared context. Being mindful of others time makes it easier for the answerer. Which makes them more likely to have a good answer to your question. Which makes it more likely that you get an appropriate answer to your question. Which is what you really want anyway.

So next time you have a question, make sure the response isn’t E_INSUFFICIENT_CONTEXT_PROVIDED

by Leon Rosenshein

Don't Be Too DRY

DRY, or Don’t Repeat Yourself is a good rule of thumb. Like all other rules of thumb though, if you were to ask me if you should remove some specific bit of duplication, my answer would be “It depends”.

On the other hand, one of the Go Proverbs is, as Rob Pike tells us,

A little copying is better than a little dependency

So, when should you stay dry, and when should repeat yourself? It depends. It depends on whether or not you’re really repeating yourself or just have the same sequence of characters.

At one extreme you have something like this

numberOfSidesOfSquare := 4
numberOfSeasons := 4
defaultLegsOnDog := 4

There’s three things all set to the same value. Strict DRY might have you believe that you should define a constant called FOUR and then assign that to each of the variables, or write a function, setVariableTo4() that takes a variable and changes it’s value. But that doesn’t make sense. First, the integer 4 is already constant, so defining another constant is redundant. Second, while the action being taken is the same, semantically each of those lines does something different. There is no logical, domain relationship between the number of seasons, the number of sides of a square, and the default number of legs on a dog. The only thing they have in common is the value, 4.

At the other extreme you have bunch of classes that need to send out notifications. They all have functions with a signature something like


function sendNotification(smtpHost string, fromEmailAddress string, toEmailAddress string, subject string, body string) err {
}

Some have the arguments in a different order. Some don’t default smtpHost and subject if you pass in nil or an empty string, One doesn’t accept a from address at all. Inside the function some validate the inputs, others don’t. The error messages are different. They use different internal flow. If you look at the sequence of characters in the function bodies, they’re all different. No textual repeating at all. But logically, semantically, they do exactly the same thing. They validate the inputs. They transform the inputs into the right format and structures needed to talk to an SMTP host and send an email. The return an error if there’s some kind of problem sending the email. That’s something you should write once. For multiple reasons.

First, it makes future maintenance easier. If the default SMTP host changes you only need to change it in once place, and you can make sure no-one else even knows about it. If you want to change from email to Slack notification (or SMS or anything else) you can do that in one place. Or do them all. Again, only in one place. Second, from a Domain Driven Design standpoint, notification doesn’t belong in those classes. Yes, the classes need to do the notification, and they probably care who’s notified, but that’s it. They shouldn’t know or care how. It’s an external capability they should be told about.

Of course, those two are the extremes. For other cases you need to think more deeply about it. Are the things the same semantically/logically? Are they parts of multiple domains? Are they really parts of those domains or are they used by those domains? Should the functionality be extracted and put somewhere else entirely. Would not repeating yourself add some coupling you don’t want? If you have some coupled things, can you break the coupling by repeating yourself.

TL/DR; being DRY is good, but don’t be too dry. Think about the costs and benefits first.

by Leon Rosenshein

Lies, Damn Lies, And Statistics

When you’re running code in production, metrics are important. You need to know that things are working correctly. You need to know when things are starting to go in the wrong direction before any of your users or customers notice. Metrics can tell you that.

The problem with metrics at scale is that, in order to make sense of them, they’re often aggregates. Mean, median, mode, p90, and p95. All ways to turn lots of numbers into a single representative number. Unfortunately, aggregates like that are very lossy compression. The more variable the data, the less useful things like the average are.

Consider Van Gogh’s The Starry Night. According to John Rauser the average color in The Starry Night is #4C5C6D. That’s interesting, but it really doesn’t say much about the painting. Who wants to look at a colored rectangle?

There’s much more information in the painting, and I’d much rather look at it.

Van Gogh's Starry Night

Not only is there a lot of missing information, I don’t think that average color even exists in the painting.

Or consider something like a P99 latency. Not only are there many ways for the P99 value to be wrong, but even if it’s accurate you could still easily be missing the point. Statistics can be tricky things, especially percentiles. Sure, your P99 value means 99% of the values are lower, but it doesn’t tell you anything about why that last percent of items is higher, nor does it tell you how much higher.

Image you have 10,000 web requests be served by 100 servers. 9,900 complete in less that 30ms, so your P99 is 30ms. That’s a pretty good P99. That’s probably within your SLA, so according to the SLA you’re all set. What that P99 doesn’t tell you is that the other 100 take 5 seconds each. Somewhere in there you’ve got a significant problem. The other thing it doesn’t tell you is that all 100 of the slow requests were handled by the same server. Which means you’ve probably got a routing/hardware problem. Just looking at your P99 and you wouldn’t have known there was a problem, let alone that you it’s on the physical side, not the software side.

Which is another way of saying don’t just look at the aggregates, look at your raw data and see what else it’s telling you. It can tell you a lot about how things are really going, if you let it.

by Leon Rosenshein

Pass or Fail?

We say a test passes when the results are as expected and that it fails when one or more of them is not. When one or more asserts fails. That makes sense. When you take a test in school and get the right answers you pass. If you don’t, you fail.

When you took that test in school and got the right answers who passed? You, or the test? It’s been a long time, but the last time I took a test in school I got enough correct answers. When I got the results, they said I passed. They didn’t say anything about the test itself passing or failing.

So here’s another framing question for you. Why do we say we run tests on our software when as a student we take a test? Why do we say the tests pass or fail for software when we say the student passed or failed? I’ve got some thoughts on why.

I think it goes back to Taylorism, Legibility, and language. We tried to divide by function for efficiency. We tried to make sure we always had good, centralized metrics to tell us we were on track. We execute the tests, not the software to get those metrics.

Taylorism is all about assembly lines, decomposing work into separable tasks, and optimizing repetitive steps for efficiency. That doesn’t make sense for writing code, but it might make sense for testing code. After all, the tests are a separable step, and we repeat testing the code over and over again. So Taylorism leads us to considering the tests to be their own thing, completely separate from the code.

Legibility is the idea that the only way to make sure you do things efficiently is to centrally plan, record, and respond to deviations from the plan. To do that you need have the results of all of the phases. You break the coding into tasks and give them a state. You identify all of the tests and give them a state. Since the tests exist for a long time and, according to the plan, never change, the state of the test is the result of the last time the software was tested, pass, or fail.

And then there’s language. In this case, it’s the language of executables. Tests are code. You run code. Testing code, especially unit tests, but even integration and beta tests, are code you run. The code doesn’t run the tests, the tests run the code. The Sapir-Worf Hypothesis says that language influences thought. The language makes us say and think about running the tests to get results instead of testing the software and seeing if the software passes or fails the test(s).

That’s how our frame gets us to talk about running tests and tests passing of failing. Its interesting, and mostly harmless, but its not completely harmless. It’s not harmless because the frame does more than just define the words we use. It also influences how we think of of our software and it’s state. We say the test failed. We don’t say the software failed. That frame makes us think of fixing the test to make it pass, not fix the software. Sure, we usually fix the software, but since the test is what’s failing it makes complete sense to change the test until it passes. The frame makes us think that might be the right thing to do. But it’s probably not.

So next time you find yourself talking about tests failing, consider reframing it and talk about the software failing the test. Because how you look at things influences what you do about it.