Recent Posts (page 12 / 67)

by Leon Rosenshein

Customers vs. Partners

Customers are and partners are different. According to the dictionary, it’s like this.

Customer: a person or organization that buys goods or services from a store or business.

Partner: any of a number of individuals with interests and investments in a business or enterprise, among whom expenses, profits, and losses are shared.

Or to put it more succinctly, you work for a customer, and with a partner. And that’s a huge difference. Your customers have different motivations than you do, with almost no overlap. Of course, your customer wants to buy the thing and you want to sell it, so there’s that overlap. They have a need to fill and you want to fulfill that need so there’s that overlap. But at the high level, your goal is to fill the need, but your customer wants it as a means to an end for whatever their purpose is. You succeed if they make the purchase. You don’t care if they succeed after that, with the exception that if they succeed they’ll buy more of what you’re selling.

Conversely, your and your partner’s motivations have considerable overlap. There might be some differences in the details, but you share the exact same high level motivations. You’re working together to meet the same external need or goal. You succeed or fail together. You succeed by working together to meet that external goal.

One big way this shows up is how influence flows. Of course, when you build something for a customer you should listen to them. If enough of your customers want something it’s probably a good idea to add it. How you build your thing will influence your customers. They’re invested in using the process and want it to work, so what you provide will influence them and their thing they’re building, but it’s a weak influence.

With partners the influence is much tighter/closer. You have more influence since you’re doing things together. It’s a collaboration. All sides get to state their wants/needs directly. You work together, in partnership, to come up with a solution that works for everyone. That fulfills the shared need of everyone in the partnership.

Being successful, regardless of whether you’re working with a customer or a partner, means you need to aware of those differences.

As important as that awareness is, it gets even more interesting when you’re a platform team working with internal customers or partners. But that’s a another story for another time.

by Leon Rosenshein

Best Practices and Cargo Cults

A best practice is a “method or technique” that often produces superior results. A cargo cult, on the other hand, is slavishly following the forms of something that you’ve seen work (or at least think works). The problem is that from the outside, and often from the inside, they look awfully similar. The question is, how can you tell the difference?

Actually, the difference is pretty straightforward. What makes seeing the difference hard is that it’s not in what you’re doing. In both cases you’re probably doing the same thing. In fact, the closer you’re following the original the more likely it is you’re cargo culting, not following best practices.

The difference lies in why you’re doing the technique. Cargo culting is simple. You do the technique. It goes something like this.

You’ve seen <successful team/company> do a thing. You read an article they wrote about what they did and how great it worked out. You saw tweet something similar, about how it’s the new hotness, the new best practice. You can’t afford the consultant, but you like the reported results and decide to try it out. They have donuts every Tuesday morning and lattes on Thursday afternoon. They have a shared doc that they all use to write down what they’re working on and update it daily. You can do that, so you try it out. And unsurprisingly, it doesn’t work. Nothing changes.

Best practices are harder to implement. You start by looking at the technique. Then you look at the context it was used in. You look at the culture. You look at the environment. You look at the problems they were trying to solve. Things like team and company size. How long the they’ve been together. What the communication patterns were before they tried the technique. You think about it and understand how they technique, when applied in that specific context, solved the problems.

Then you look at your context. You look at the problems you’re trying to solve. You look at how it’s the same as the original. You look at how it’s different. Especially how it’s different, because as Tolstoy said in Anna Karenina,

Happy families are all alike; every unhappy family is unhappy in its own way.

That’s critical. Because you can’t just blindly apply someone else’s solution, from their context, and to their specific problems, to your context and your problems. You need to look not at what they did, but why they did it, and what you can learn from their experience. Then you apply it to your context and your problems.

Or put more simply, Google has more engineers building and running their build system than you have total employees, let alone developers. The build tools and systems that work for them won’t work for you. Instead of building your own version of blaze/borg, figure out what’s really slowing you down and fix that.

by Leon Rosenshein

Applesauce, Names, and Refactoring

Did you know that applesauce can be an important part of refactoring, understanding code, and honest function naming? It’s true, and here’s how it works.

Picture of a bowl of applesauce

First and foremost, I’m not talking about mashed apples. As much as I enjoy applesauce on my latkes, that’s not what I mean. I’m talking about using applesauce as an honest name for something.

Naming is hard. It’s one of the two big problems in computer science. Names are also a design problem. You don’t know what you don’t know, so names are fluid. As you learn more about the domain (by spending time in it) and the system (by building it) you find that the names you’ve come up with are often, to a greater or lesser degree, wrong. And as your understanding grows, your methods start to accrete more functionality. So you end up with a method called AddUser which ends up updating a user’s profile and sending email if they haven’t logged on in 3 months. At some point you realize that you need to refactor the code. You need to split things up in to methods that do one thing and that have good names.

The problem is that you often don’t know just how to split things up. If you don’t know how to split things up how can you come up with a good, honest name? You probably can’t. One option is to not bother to try, or at least not bother to try, at first.

Which is where Naming as a Process comes in. It describes a process that enables you to go from working code to working code while adding useful information to method names. And it starts with applesauce. You don’t know what the method does, when you should call it, what it returns, or even why it exists. Calling it applesauce is the first step on the path to a completely honest, informative, intentional, domain related name. You don’t know what it does, and unless you’re writing an app that manages a kitchen or food storage system no one is going to think the name means anything. So no one will assume it does the wrong thing. They’ll have to look at it to find out.

You started with AddUser, then you went to applesauce. If you stop there, bad developer. No cookie for you. But if you follow the process the name will get better. Next you make it honest, if slightly ambiguous. Look at the method. What does it seem to spend the most time doing? It seems to be updating the user’s profile, and along the way it will at least create the user if it doesn’t exist and send email if they haven’t logged on for a while. And maybe something else. There’s some more code that doesn’t seem to have anything to do with users or profiles. So give it a name like UpdateOrCreateUserProfile_and_SendEmailToInfrequentUsers_and_somethingElse. It’s a mouthful, but it lets you and anyone else looking at the code know what you know it does, at least one of the side effects, and that you’re pretty sure it does something else as well. It’s not great, but at least it’s honest.

Now start extracting things. Run the process a few more times until you’ve got a set of methods, each doing one thing, called by a controller that understands what to do. Give each method an honest name that explains what it does, like AddUserIfNeeded, SendInfrequentUserEmail, and UpdateProfileWithLastAccessTime. That’s going to be pretty helpful to the next person to look at it. Just by looking you know what’s happening.

Unfortunately, you don’t know why. That’s the next step. Make the name intentional and domain relevant. Move some logic around so the controller decides what happens and the other methods just do things. Things like CreateNewWebsiteUser, SendUserEngagementEmail, and RecordUserInteraction. All inside a controlling method called TrackAndEncourageUserAccess.

Next time you (or anyone else on the team) steps into the code they’ll see not only what is happening, but why it should be happening. They can trust that any side effects are made clear. You’ll all be happier. Especially if you’re the Maintainer

by Leon Rosenshein

Policies

I’ve talked about Chesterton’s Fence before. It’s the idea that you have to understand why something was done in the first place before you decide to undo it. You buy a vacation house with a fence and a gate across the driveway and all you ever do is stop, open the gate, drive through, and then close the gate behind you. To save time and trouble, you remove the gate because all it does is slow you down. You come back a few weeks later and find that the wild goats have not only eaten your grass, but as they are wont to do, turned it into a goat desert, eating not just the grass, but the trees and shrubs as well. Now you know why the fence (Chesterton’s Fence) was there. It wasn’t to slow you down, it was to protect your landscaping.

As I mentioned before, you can see things like that in code. Input validation tests for things that should never happen. Error handling even when input is validated and the call should never fail. An extra watchdog timer wrapped around an event handler. You could take them out and things would be fine for a while. Probably for a long time. But remember, saying something hardly ever happens is the same as saying that it happens, so eventually there will be a problem. Until you can be sure the thing can’t happen you need to be ready for when it does.

Which brings me to policies. Policies are rules about when and how to do things. Sometimes they’re written down, like in an employee handbook, or even enforced by the system (like tests need to be run before landing a PR). Sometimes they’re part of the team/org’s minhag, the custom, handed down as tribal knowledge, where you get told how the team let’s downstream users know about planned changes and outages by using a certain format in a specific Slack channel. And sometimes they’re only found when you violate them, like the policy that says if you need some new hardware you could just technically order it, but you’d better ask the admin first and they’ll take care of it. Regardless of how you learn about them, they’re there.

The thing is, as redundant or arbitrary as they seem to be, they were almost certainly put in place as a response to something that happened. As Jason Fried said,

Policies are organizational scar tissue.

However, just because a policy exists, and that it might have made sense at the time, it might not make sense now. That’s where the rest of that quote comes in

They are codified overreactions to situations that are unlikely to happen again. They are collective punishment for the misdeeds of an individual. This is how bureaucracies are born. No one sets out to create a bureaucracy. They sneak up on companies slowly. They are created one policy—one scar—at a time. So don’t scar on the first cut. Don’t create a policy because one person did something wrong once. Policies are only meant for situations that come up over and over again.

The problem is that in general, policies don’t have expiration dates. In fact, it’s the opposite. The longer they’ve been around, the harder it is to change them. Which can be a problem. Because policies are set in isolation from each other. And they accumulate. They can even conflict with each other. So you have to be careful when setting policies.

You probably don’t need a policy the first time something happens. You need to think about how likely it is to happen again, the cost of having/living with the policy, and what the cost of it happening again is. Unless the cost of it happening is greater than the cost of it happening, consider it as a teaching moment and remind people of the goals and consequences. Just send an email to the right group of people.

Instead, use policies for things that happen multiple times and have a very high cost that you can’t (easily) put a mechanism in place to prevent. If you want to make sure tests are run on every PR/commit, don’t make it a policy and hope folks do it, make it a part of the system so they don’t have to think about it. On the other hand, if you want to let your customers know about upcoming downtime or service interruptions, make a policy, write it down, and make sure everyone knows.

Finally, put a re-evaluation date on your policies. Write down why the policy is in place, what it’s goal is, and ideally, what criteria need to be met to remove the policy. For instance, you might have a policy to run unit tests today, then re-evaluate it every week while you build the mechanism to do it automatically. Once the mechanism is in place you can remove the policy.

And if you find a policy you don’t understand the reason for, remember Chesterton’s Fence. Don’t just remove it because you don’t know why it’s there. Figure out why it’s there, decide if it’s still needed, for that or some other reason, and then make a decision to keep, modify, or remove it.

by Leon Rosenshein

This Is The Way

I’ve talked about The Unix Way before. I recently came across a really good description of it. It doesn’t matter if you’re writing a tool or deciding what goes into your commit. It shows up in architecture, in Domain Driven Design and Bounded Contexts. It shows up in non-code places as well. Reducing Work In Progress is just another way of implementing the Unix way.

The best definition of the Unix way I’ve ever seen is:

That’s it. Two things. Simple and straightforward. Of course, the devil is in the details.

First, what is the one thing? Where are the boundaries to that one thing? Are you writing a program, a library, a service, or a platform? Where the customer value comes from drives the boundaries. Where the boundaries are drives what you’re building. But it’s not done a one-way flow in isolation. It’s actually a system with feedback loops, so boundaries influence what the “one thing” is which then ends up influencing how it gets used, which influences customer value. And now we’re talking about levels of abstraction. You can always add another level of abstraction to help solve a problem (except the problem of too many levels of abstraction. The trick is to figure out if you should or not.

A really good way to approach this one is to recognize that it is an iterative process. Understand the domain and the context. Figure out what you need to do. Figure out what you shouldn’t do. Then look at everything you think you need to do and decide if you really need to do. At any given level you can probably do less, then use composition at a higher level. Even if you never share that lower level with the end user, tighter domains and less coupling will help you. Now and in the future. At the same time, don’t be afraid to merge two contexts or rearrange a group of them into things that make more sense. As you design, your understanding grows. As your understanding grows the boundaries will shift. Use that to your advantage.

Second, how do you tie things together? In classic Unix, the text output of one program is used as the text input to the second. For CLI tools that still makes sense. But lots of things don’t run from the command line. They’re driven by user clicks/taps on a screen, and they might not even run on the same computer. Which means you can’t use your favorite shell’s pipe mechanism. In fact, you need to think about how to tie things together across computers. Data size has grown as well. By multiple orders of magnitude. On Falcon 4 we shipped our entire game on a single CD. Less than 680 MB, including all of the art assets. More recently, for the Global Ortho project each individual image was ~550 MB, and we had over 2 million of them between the US and Western Europe. Even one of those images is a lot of data to pass via a pipe, let alone hundreds or thousands of them. And since you don’t know what’s going to be reading your data you don’t know what language the next thing is going to be written in, so your access needs to be cross-language.

The key to solving this issue is to add structure. But only as little structure as you need. And document it. Cleanly, clearly, and with examples. Consider using a formal IDL like Protobuf or Apache Thrift. They’re great for ensuring you have a single source of truth for the definition while letting you have language specific implementations. They can be used on distributed systems (across the wire) or through persistent stores like file systems or databases. You can even use them as part of your API in a library to keep things consistent. Even if you’re using text for a CLI tool, a little structure will make your life easier. YAML and JSON are good choices. And remember that it’s often useful have text output that’s formatted in a way that makes it easy for people to understand (columns, colors, complete sentences, etc.) and a machine parseable format. With clean, clear documentation.

The Unix Way has been around for almost 60 years. Despite the advances in processor, storage, language, and network design and implementation, it’s still as relevant today as it was back then.

by Leon Rosenshein

TDD Anti-patterns

I’ve talked about the difference between Test After Design and Test Driven Design before. I’m a big proponent of writing tests that prove your design before you write code. Not all the tests, because as your knowledge of what the design should be improves, you’ll learn about new tests that need to be written as well. So, as you learn you write more tests to drive the design.

Regardless of when you write the tests, before or after the code, there are a bunch of test anti-patterns to watch out for. Anti-patterns, or code smells, as they’re sometimes known, aren’t necessarily wrong. They are, however, signals that you should look a little bit deeper and check your assumptions.

TDD anti-patterns are definitely not a new idea. A slightly repetitive list was published over 15 years ago, and consolidated about 10 years ago. The shorter list is still a valid list today.

Check out the list yourself for the full details, but here’s a few of the ones that I see getting implemented pretty often:

The Liar

The liar is a test that doesn’t do what it says it’s doing. It’s testing something, and when the thing it’s actually testing isn’t working it fails. An example would be a test called “ValidateInput” that, instead of checking that the method properly validates its input parameters, is actually checking to make sure the result is correct.

The Giant

The giant is a test that does too much. It doesn’t just test that invalid input parameters are properly handled. It also checks that valid input gives the correct answer. Or it might be an integration test. Something that requires way too many things to be working properly for the result of the test to tell you something. One important note here is that a table-driven test that tests all possible combinations of invalid inputs in a single test function is not a giant (as long as each entry in the table is its own test). It’s just a better way to write small tests

The Slow Poke

The slow poke is just that, slow. It takes a long time to run. It might be waiting for the OS to do something, or for a real-world timeout, or it might just be doing too much. Tests need to be fast because entire test suites need to be fast. If your test suite takes minutes to complete, it’s not going to be run. And a test that doesn’t get run might as well have not been written.

Success Against All Odds

The success against all odds test is the worst of all. It takes time write. It takes time to maintain. It executes your code. Then it says all tests pass. Regardless of what happens. It could be you’re testing your mock, or, through a series of unfortunate events, you’re comparing the expected output to itself, or even worse, forgot to compare the output to the expected output. Regardless of the reason, you end up with Success Against All Odds, and you feel good, but you haven’t actually tested anything. You haven’t increased anyone’s confidence, which is Why We Test in the first place.

So regardless of when you write your tests, make sure you’re not falling into any of the anti-patterns by mistake or by not paying attention.

by Leon Rosenshein

Monitoring vs. Observability

Pop quiz. What’s the difference between monitoring and observability? First of all, Monitoring is a verb. It’s something you do. You monitor a system. On the other hand, Observability is an adverb. It’s one of the -ilities. Observability describes a capability of a system.

Monitoring is collecting logs and metrics. Metrics like throughput, latency, failure rates, and system usage. Monitoring is going over logs and looking for issues, patterns, and keywords among other things. Advanced metrics include some way of tracing through a system, whether it’s a distributed system or not. And in many cases, you compare the results to a threshold and fire some kind of alert if that threshold is crossed.

Observability is the ability to know and understand the system’s current state by looking at the metrics. Not just the explicit bit of state you get from the logs and metrics, but the overall system state. Not just if some threshold of a particular metric has been crossed, like how much RAM is being used, but why that amount of RAM is being used.

In other words, good observability includes monitoring, but just having monitoring doesn’t mean you have good observability.

In a system without good observability, if you’ve got a good understanding of your system, you can look at the metrics and deduce if there’s a problem. You can look at the logs and work back from the problem/error and identify where there’s a log message that shows where the problem started. In that case you might be able to use metrics to find the root cause of why a threshold was crossed (and an alert fired that woke you up at 2 0’clock in the morning).

If you don’t have that understanding then what? You start reading code. And tracing it. Looking for code paths that might have caused the issue. You might search for where log messages are being generated and try to understand why. You’re trying to build understanding of the system and rebuild the internal state of the system at the same time. Meanwhile the fire rages.

If your system has good observability then you don’t need that deep understanding. The metrics (and logs and traces) point you to where the problem is, not where the symptom surfaced. So when your p95 response latency crosses your threshold you know not only what happened, but what caused it.

And you usually want that ability. Because at first, before you fully understand a system you need help figuring out why something when wrong. Later, you understand the system. You’ve put in mechanisms to deal with issues you’ve learned about. You’ve got good metrics to tell you about that state just in case something does happen. So what happens next? Instead of the problem you understand, something you don’t understand happens. Something where your deep knowledge of the system isn’t deep enough. And you’re right back to the not understanding state.

Which is why good observability is so important.

by Leon Rosenshein

Traditions

Just what is a tradition? More importantly, what do we get of traditions? According to Merriam-Webster, tradition is

  1. a: An inherited, established, or customary pattern of thought, action, or behavior (such as a religious practice or a social custom) b: A belief or story or a body of beliefs or stories relating to the past that are commonly accepted as historical though not verifiable
  2. The handing down of information, beliefs, and customs by word of mouth or by example from one generation to another without written instruction
  3. Cultural continuity in social attitudes, customs, and institutions
  4. Characteristic manner, method, or style

Lots of words that describe what a tradition is. Basically it comes down to “Something we’ve gotten from the past”, and often without proof. And nothing about the point of a tradition, or it’s benefits.

There can be lots of benefits to learning from the past. If something worked once then maybe it will work again. If it didn’t work then figuring out why and avoiding that failure mode is a good idea.

Tradition can also reduce your cognitive load. If people share a tradition then you have shared context. It makes communication easier and can reduce transmission errors.

Tradition can reduce the number of decisions you need to make. I, j, and k are often used as simple loop variables or counters. Not because there’s any particular magic to them, but because that’s the way it’s taught and what we expect. Maybe ‘i’ because it’s the first letter in index and then j and k are the next letters, or maybe not. It’s a tradition and we do it.

Which touches on the problem with tradition. When we follow tradition because it’s tradition then we’re not thinking. Any time we’re not thinking we’re making assumptions and we open ourselves up to problems.

Consider that loop variable again. Is ‘i’ really the best choice? If we’re counting then maybe using count is better. Or some other name that describes intent. Sure, I know ‘I’ is some kind of loop variable, but what does it represent? Blindly following tradition means losing that context.

Which is why I like describing tradition this way.

Tradition is not the worship of ashes, but the preservation of fire

                Gustav Mahler

The point of tradition is to pass on the information, the knowledge, the learnings that we have come up with. To avoid replicating problems. To make things easier and more shareable. To not forget where we have come from. To be a guide to doing things better in the future. It’s not to do something the same way just because we’ve always done it that way.

Blindly doing what was done in the past without understanding why it was done that way does us all a disservice. It leads to cargo culting. But that’s a topic for another day.

by Leon Rosenshein

Design Is Iterative

I’ve talked about rework avoidance and the advantage of taking smaller steps. I’ve talked about how it’s different from the older BDUF process. How a sea chart gives you a better chance of getting a successful result.

Given that, and, the fact that the Agile Manifesto, was written in 2001, you would think that the concept of iterative design is a new one.

However, you’d be wrong.

Way back in 1968, in the Software Engineering Report by the NATO Science Committee, you’ll find this quote by H. A. Kinslow.

The design process is an iterative one. I will tell you one thing which can go wrong with it if you are not in the laboratory. In my terms design consists of:

  1. Flowchart until you think you understand the problem.
  2. Write code until you realize that you don’t.
  3. Go back and re-do the flowchart.
  4. Write some more code and iterate to what you feel is the correct solution

There are a lot of good ideas in that report, but Kinslow’s comment really resonates with me. It plays well with my personal bias for action. Do some planning. Write some code. Repeat. Or put another way, learn from the domain to write some code. Then learn from the code to better understand the domain. Then do it again. By learning from both, iteratively, you come up with a better answer then only learning from either one. And, you’ll probably come to that answer sooner.

It’s a system. A system with at least 2 loops. The questioning/understanding loop and the building loop. At first you don’t even know what questions to ask. You start with a problem statement. You think you understand it well enough to write a solution, but then you have questions. So you dig deeper and get your questions answered. Which leads to more understanding and a better solution, which leads to more questions. The cycle repeats until you have enough understanding to build a solution that actually solves the original problem.

So remember, whether it’s your APIs, your class design, your domain boundaries, or how your systems work together, iteration is your friend. Because without iterating you can’t have enough understanding to come up with a good solution.

by Leon Rosenshein

Good Habits

In the past I’ve talked about Why We Test. We test to increase our stakeholder’s confidence. So let’s add all the confidence we can. If a little confidence is good, more confidence is better, right?

Maybe, but maybe not. Because Speed is also important. We want to get things done Sooner, not Faster. It’s not confidence at any cost. It’s enough confidence to move forward.

I think Charity Majors said it really well.

Or you could think of it like running tests is the equivalent of eating well, not smoking, and wearing sunscreen. You get diminishing returns after that.

And if you’re too paranoid to go outside or see friends, it becomes counterproductive to shipping a good healthy life. ☺️🌴

How much confidence do you need? What’s the cost of a something going badly? What are your operational metrics (MTTD, MTTF, MTBF, MTTR)? Do you have leading metrics that let you know a problem is coming before customers/users notice?

The more notice you have before your customers notice, the faster you can detect and recover from a problem, and the rarer they occur, the less confidence you can accept. If you have a system that detects the problem starting and auto-recovers in 200 milliseconds and your customers never notice required confidence is low.

On the other hand, if you’re going to lose $10K/second a 2 minute outage has a direct cost of $1.2M. And that’s without including the intangible expense of the hit to your brand. You’re going to want a bit more confidence when making that change.

So do your unit testing. Do your integration testing. Track your code coverage. Run in your test environment. Shadow your production environment. Until you have the appropriate amount of confidence.

Because all the time you’re building confidence you’re not getting feedback. You might have confidence that it’s going to work, but you have no confidence that it’s the right thing to do.