Recent Posts (page 11 / 65)

by Leon Rosenshein

This Is The Way

I’ve talked about The Unix Way before. I recently came across a really good description of it. It doesn’t matter if you’re writing a tool or deciding what goes into your commit. It shows up in architecture, in Domain Driven Design and Bounded Contexts. It shows up in non-code places as well. Reducing Work In Progress is just another way of implementing the Unix way.

The best definition of the Unix way I’ve ever seen is:

That’s it. Two things. Simple and straightforward. Of course, the devil is in the details.

First, what is the one thing? Where are the boundaries to that one thing? Are you writing a program, a library, a service, or a platform? Where the customer value comes from drives the boundaries. Where the boundaries are drives what you’re building. But it’s not done a one-way flow in isolation. It’s actually a system with feedback loops, so boundaries influence what the “one thing” is which then ends up influencing how it gets used, which influences customer value. And now we’re talking about levels of abstraction. You can always add another level of abstraction to help solve a problem (except the problem of too many levels of abstraction. The trick is to figure out if you should or not.

A really good way to approach this one is to recognize that it is an iterative process. Understand the domain and the context. Figure out what you need to do. Figure out what you shouldn’t do. Then look at everything you think you need to do and decide if you really need to do. At any given level you can probably do less, then use composition at a higher level. Even if you never share that lower level with the end user, tighter domains and less coupling will help you. Now and in the future. At the same time, don’t be afraid to merge two contexts or rearrange a group of them into things that make more sense. As you design, your understanding grows. As your understanding grows the boundaries will shift. Use that to your advantage.

Second, how do you tie things together? In classic Unix, the text output of one program is used as the text input to the second. For CLI tools that still makes sense. But lots of things don’t run from the command line. They’re driven by user clicks/taps on a screen, and they might not even run on the same computer. Which means you can’t use your favorite shell’s pipe mechanism. In fact, you need to think about how to tie things together across computers. Data size has grown as well. By multiple orders of magnitude. On Falcon 4 we shipped our entire game on a single CD. Less than 680 MB, including all of the art assets. More recently, for the Global Ortho project each individual image was ~550 MB, and we had over 2 million of them between the US and Western Europe. Even one of those images is a lot of data to pass via a pipe, let alone hundreds or thousands of them. And since you don’t know what’s going to be reading your data you don’t know what language the next thing is going to be written in, so your access needs to be cross-language.

The key to solving this issue is to add structure. But only as little structure as you need. And document it. Cleanly, clearly, and with examples. Consider using a formal IDL like Protobuf or Apache Thrift. They’re great for ensuring you have a single source of truth for the definition while letting you have language specific implementations. They can be used on distributed systems (across the wire) or through persistent stores like file systems or databases. You can even use them as part of your API in a library to keep things consistent. Even if you’re using text for a CLI tool, a little structure will make your life easier. YAML and JSON are good choices. And remember that it’s often useful have text output that’s formatted in a way that makes it easy for people to understand (columns, colors, complete sentences, etc.) and a machine parseable format. With clean, clear documentation.

The Unix Way has been around for almost 60 years. Despite the advances in processor, storage, language, and network design and implementation, it’s still as relevant today as it was back then.

by Leon Rosenshein

TDD Anti-patterns

I’ve talked about the difference between Test After Design and Test Driven Design before. I’m a big proponent of writing tests that prove your design before you write code. Not all the tests, because as your knowledge of what the design should be improves, you’ll learn about new tests that need to be written as well. So, as you learn you write more tests to drive the design.

Regardless of when you write the tests, before or after the code, there are a bunch of test anti-patterns to watch out for. Anti-patterns, or code smells, as they’re sometimes known, aren’t necessarily wrong. They are, however, signals that you should look a little bit deeper and check your assumptions.

TDD anti-patterns are definitely not a new idea. A slightly repetitive list was published over 15 years ago, and consolidated about 10 years ago. The shorter list is still a valid list today.

Check out the list yourself for the full details, but here’s a few of the ones that I see getting implemented pretty often:

The Liar

The liar is a test that doesn’t do what it says it’s doing. It’s testing something, and when the thing it’s actually testing isn’t working it fails. An example would be a test called “ValidateInput” that, instead of checking that the method properly validates its input parameters, is actually checking to make sure the result is correct.

The Giant

The giant is a test that does too much. It doesn’t just test that invalid input parameters are properly handled. It also checks that valid input gives the correct answer. Or it might be an integration test. Something that requires way too many things to be working properly for the result of the test to tell you something. One important note here is that a table-driven test that tests all possible combinations of invalid inputs in a single test function is not a giant (as long as each entry in the table is its own test). It’s just a better way to write small tests

The Slow Poke

The slow poke is just that, slow. It takes a long time to run. It might be waiting for the OS to do something, or for a real-world timeout, or it might just be doing too much. Tests need to be fast because entire test suites need to be fast. If your test suite takes minutes to complete, it’s not going to be run. And a test that doesn’t get run might as well have not been written.

Success Against All Odds

The success against all odds test is the worst of all. It takes time write. It takes time to maintain. It executes your code. Then it says all tests pass. Regardless of what happens. It could be you’re testing your mock, or, through a series of unfortunate events, you’re comparing the expected output to itself, or even worse, forgot to compare the output to the expected output. Regardless of the reason, you end up with Success Against All Odds, and you feel good, but you haven’t actually tested anything. You haven’t increased anyone’s confidence, which is Why We Test in the first place.

So regardless of when you write your tests, make sure you’re not falling into any of the anti-patterns by mistake or by not paying attention.

by Leon Rosenshein

Monitoring vs. Observability

Pop quiz. What’s the difference between monitoring and observability? First of all, Monitoring is a verb. It’s something you do. You monitor a system. On the other hand, Observability is an adverb. It’s one of the -ilities. Observability describes a capability of a system.

Monitoring is collecting logs and metrics. Metrics like throughput, latency, failure rates, and system usage. Monitoring is going over logs and looking for issues, patterns, and keywords among other things. Advanced metrics include some way of tracing through a system, whether it’s a distributed system or not. And in many cases, you compare the results to a threshold and fire some kind of alert if that threshold is crossed.

Observability is the ability to know and understand the system’s current state by looking at the metrics. Not just the explicit bit of state you get from the logs and metrics, but the overall system state. Not just if some threshold of a particular metric has been crossed, like how much RAM is being used, but why that amount of RAM is being used.

In other words, good observability includes monitoring, but just having monitoring doesn’t mean you have good observability.

In a system without good observability, if you’ve got a good understanding of your system, you can look at the metrics and deduce if there’s a problem. You can look at the logs and work back from the problem/error and identify where there’s a log message that shows where the problem started. In that case you might be able to use metrics to find the root cause of why a threshold was crossed (and an alert fired that woke you up at 2 0’clock in the morning).

If you don’t have that understanding then what? You start reading code. And tracing it. Looking for code paths that might have caused the issue. You might search for where log messages are being generated and try to understand why. You’re trying to build understanding of the system and rebuild the internal state of the system at the same time. Meanwhile the fire rages.

If your system has good observability then you don’t need that deep understanding. The metrics (and logs and traces) point you to where the problem is, not where the symptom surfaced. So when your p95 response latency crosses your threshold you know not only what happened, but what caused it.

And you usually want that ability. Because at first, before you fully understand a system you need help figuring out why something when wrong. Later, you understand the system. You’ve put in mechanisms to deal with issues you’ve learned about. You’ve got good metrics to tell you about that state just in case something does happen. So what happens next? Instead of the problem you understand, something you don’t understand happens. Something where your deep knowledge of the system isn’t deep enough. And you’re right back to the not understanding state.

Which is why good observability is so important.

by Leon Rosenshein

Traditions

Just what is a tradition? More importantly, what do we get of traditions? According to Merriam-Webster, tradition is

  1. a: An inherited, established, or customary pattern of thought, action, or behavior (such as a religious practice or a social custom) b: A belief or story or a body of beliefs or stories relating to the past that are commonly accepted as historical though not verifiable
  2. The handing down of information, beliefs, and customs by word of mouth or by example from one generation to another without written instruction
  3. Cultural continuity in social attitudes, customs, and institutions
  4. Characteristic manner, method, or style

Lots of words that describe what a tradition is. Basically it comes down to “Something we’ve gotten from the past”, and often without proof. And nothing about the point of a tradition, or it’s benefits.

There can be lots of benefits to learning from the past. If something worked once then maybe it will work again. If it didn’t work then figuring out why and avoiding that failure mode is a good idea.

Tradition can also reduce your cognitive load. If people share a tradition then you have shared context. It makes communication easier and can reduce transmission errors.

Tradition can reduce the number of decisions you need to make. I, j, and k are often used as simple loop variables or counters. Not because there’s any particular magic to them, but because that’s the way it’s taught and what we expect. Maybe ‘i’ because it’s the first letter in index and then j and k are the next letters, or maybe not. It’s a tradition and we do it.

Which touches on the problem with tradition. When we follow tradition because it’s tradition then we’re not thinking. Any time we’re not thinking we’re making assumptions and we open ourselves up to problems.

Consider that loop variable again. Is ‘i’ really the best choice? If we’re counting then maybe using count is better. Or some other name that describes intent. Sure, I know ‘I’ is some kind of loop variable, but what does it represent? Blindly following tradition means losing that context.

Which is why I like describing tradition this way.

Tradition is not the worship of ashes, but the preservation of fire

                Gustav Mahler

The point of tradition is to pass on the information, the knowledge, the learnings that we have come up with. To avoid replicating problems. To make things easier and more shareable. To not forget where we have come from. To be a guide to doing things better in the future. It’s not to do something the same way just because we’ve always done it that way.

Blindly doing what was done in the past without understanding why it was done that way does us all a disservice. It leads to cargo culting. But that’s a topic for another day.

by Leon Rosenshein

Design Is Iterative

I’ve talked about rework avoidance and the advantage of taking smaller steps. I’ve talked about how it’s different from the older BDUF process. How a sea chart gives you a better chance of getting a successful result.

Given that, and, the fact that the Agile Manifesto, was written in 2001, you would think that the concept of iterative design is a new one.

However, you’d be wrong.

Way back in 1968, in the Software Engineering Report by the NATO Science Committee, you’ll find this quote by H. A. Kinslow.

The design process is an iterative one. I will tell you one thing which can go wrong with it if you are not in the laboratory. In my terms design consists of:

  1. Flowchart until you think you understand the problem.
  2. Write code until you realize that you don’t.
  3. Go back and re-do the flowchart.
  4. Write some more code and iterate to what you feel is the correct solution

There are a lot of good ideas in that report, but Kinslow’s comment really resonates with me. It plays well with my personal bias for action. Do some planning. Write some code. Repeat. Or put another way, learn from the domain to write some code. Then learn from the code to better understand the domain. Then do it again. By learning from both, iteratively, you come up with a better answer then only learning from either one. And, you’ll probably come to that answer sooner.

It’s a system. A system with at least 2 loops. The questioning/understanding loop and the building loop. At first you don’t even know what questions to ask. You start with a problem statement. You think you understand it well enough to write a solution, but then you have questions. So you dig deeper and get your questions answered. Which leads to more understanding and a better solution, which leads to more questions. The cycle repeats until you have enough understanding to build a solution that actually solves the original problem.

So remember, whether it’s your APIs, your class design, your domain boundaries, or how your systems work together, iteration is your friend. Because without iterating you can’t have enough understanding to come up with a good solution.

by Leon Rosenshein

Good Habits

In the past I’ve talked about Why We Test. We test to increase our stakeholder’s confidence. So let’s add all the confidence we can. If a little confidence is good, more confidence is better, right?

Maybe, but maybe not. Because Speed is also important. We want to get things done Sooner, not Faster. It’s not confidence at any cost. It’s enough confidence to move forward.

I think Charity Majors said it really well.

Or you could think of it like running tests is the equivalent of eating well, not smoking, and wearing sunscreen. You get diminishing returns after that.

And if you’re too paranoid to go outside or see friends, it becomes counterproductive to shipping a good healthy life. ☺️🌴

How much confidence do you need? What’s the cost of a something going badly? What are your operational metrics (MTTD, MTTF, MTBF, MTTR)? Do you have leading metrics that let you know a problem is coming before customers/users notice?

The more notice you have before your customers notice, the faster you can detect and recover from a problem, and the rarer they occur, the less confidence you can accept. If you have a system that detects the problem starting and auto-recovers in 200 milliseconds and your customers never notice required confidence is low.

On the other hand, if you’re going to lose $10K/second a 2 minute outage has a direct cost of $1.2M. And that’s without including the intangible expense of the hit to your brand. You’re going to want a bit more confidence when making that change.

So do your unit testing. Do your integration testing. Track your code coverage. Run in your test environment. Shadow your production environment. Until you have the appropriate amount of confidence.

Because all the time you’re building confidence you’re not getting feedback. You might have confidence that it’s going to work, but you have no confidence that it’s the right thing to do.

by Leon Rosenshein

Continuous Delivery

Continuous delivery is the idea that every change, or at least every change a developer thinks is done, gets built into the product and into customers hands. It’s a great goal. Reducing latency between a change and getting feedback on the change means you can make smaller changes and be more confident that changes you’re seeing are due to the specific change you’ve made.

Over the course of my career I’ve seen a lot of deployment frequencies. One of my first jobs was making concrete blocks. The Mold/Kiln combo could deploy a pallet of blocks about every 15 seconds, with each pallet having between 1 and 20 blocks on it. And do it 24 hours a day, 7 days a week, if there we fed materials in one end and took the blocks away on the other. That’s continuous deployment.

Or sort of continuous deployment. While we could “deploy” cubes of blocks in minutes, it was 24 hours from the time the material went into the mold until the block was ready to be sold. Which means while we could deploy continuously, the deployment latency was at least 24 hours.

I say at least because there was a big mixer at the front end of the process. It took in sand, dyes, water, and cement and mixed them together into multiple cubic yard lots that would get fed into the mold and eventually the kiln. It took about 20 minutes to mix each batch, and a few hours to use it up. Which meant we could only change block color every few hours. In most cases we could just change the mixture of sand/dye to get the new color, but when we wanted to make white blocks we had to wash out the mixer and all the transport equipment to make sure the new blocks stayed white. And that took a few hours itself. So in practice, the latency to change the color of the blocks wasn’t 24 hours, it was more like 26.

But even that isn’t the whole story. We made different shape blocks as well. Changing molds was a big deal. It took most of a day to do. And there were only a couple of people who could do it efficiently. If any of the rest of us did it the delay was a couple of days. And the mixer and concrete transport system had to be empty or the concrete would set and shut the machine down for days while we chipped the concrete out of all the nooks and crannies. Which meant that depending on who was around, the time to get a new shape of brick out the door was 30+ hours from when we started to make the change.

So here you have a system which can produce a set of blocks every 15 seconds. That’s delivering pretty continuously. Yet even with that kind of continuous delivery it could still take almost 2 days to start delivering something different.

But what does making concrete blocks have to do with delivering software? It’s this. The key factor to continuous delivery is the latency between a change being made and that change being available to the customer. The fact that your compiler takes 10 minutes, or someone can download your app in 2 minutes, doesn’t mean you’ve got continuous delivery. You need to account for all of the time between “it’s done” and “customers can use it”. And keep that number as small as possible. That’s when you see the benefits. You get fast feedback. You can roll back a change that didn’t work quickly and easily. You can try things with confidence.

When you do that, you’ve got continuous delivery.

by Leon Rosenshein

Scream Tests

The other day I mentioned to someone that we had a scream test scheduled for a few weeks from now and they gave me a very confused look. Since I know that it’s my responsibility to make sure my messages are received properly I asked if they knew what a scream test was. Turns out they didn’t. That day they were one of the lucky 10,000. If you don’t know, today’s your day.

In short, a scream test is when you think something is unused, but you’re not sure and not sure how to find out. One way to be sure is to simply turn it off. If no-one screams then it was unused. If there are screams you’ve identified the users. It’s a pretty simple and accurate test.

I’m honestly not sure when I first heard the term, but it was probably sometime in the late 1990s. That’s when I was first running servers and services that other folks used on a regular basis. They were file shares, and back then storage was both expensive and hard to get, so dealing with hoarders was a full time job. But we never knew for sure who was using the data. We’d ask folks to clean up. We’d tell people it was going to go away on a given day. We’d remind them. Sometimes we get a couple of folks saying they didn’t care, but generally, silence. So eventually we’d make the data unavailable. Usually nothing happened. So we’d wait a week, archive to tape, do a restore (this is a critical step in backup planning), then remove it from the server. Space was recovered and everyone was happy.

But every once and a while about 30 seconds after we took the folder offline someone would do the office gopher and scream “Where’s my data?” Then we knew who was using that data. We’d either keep it or figure out exactly was needed and only keep that. Not perfect, but not a bad outcome.

Even more rarely, something completely unrelated would stop working. Then the real fun began. Why would the monthly accounting run care if the asset share for a 3 year old game was offline? It cared because the person who wrote the script had originally used their desktop for temporary storage as part of the process, but stopped doing that because they got a memo to turn off their machines at night to save power and the script failed. So they found some temp storage they could use and started using it. That’s wrong for many reasons. Not the least of which was that everyone at the company had access to that share and could have seen the data if they knew it was there. In that case we identified not only an unknown user, but we also closed a security hole that we didn’t even know we had.

Over the years since I’ve done similar things with file systems, internal and external services, and once even and entire data center. Turns out turning off a DC can break all sorts of things. Starting with DNS and routing 😊

And now you know what a scream test is.

by Leon Rosenshein

Automation

If you need to do something millions of times a day it needs to be repeatable and, ideally, idempotent. While it’s technically possible to get enough people in your system in a Mechanical Turk mechanism, in most cases what you want to do is truly automate the task.

The trick though, as, @Kelsey Hightower said the other day on the art of agile development book club is that “automation is a byproduct of understanding.”

You should listen to him explaining, but the basic idea is that first you understand what you need to do, then you figure out how to automate it. And the hard part is getting the understanding. Not just the knowledge of the steps you took. That’s the easy part. You also need to do the hard part, understanding why those steps, and why in that specific order.

It’s only after you have that understanding that you can really automate something. Without it you can execute the steps, and probably even notice and acknowledge failures along the way. But you can’t really understand the reason behind the failures. Which means you can’t do anything about them. Instead of automation, you’ve got a self-executing runbook. That’s a great place to start, and it might be enough for a while, but it’s not really an automated system.

An automated system understands your intent. In understands the goal you’re trying to achieve and helps you achieve it. It adapts around errors and delays. It figures out how to get from current state to the desired state. It notices deviations and corrects for them as well. And that gets into promise theory, which is a story for another day.

by Leon Rosenshein

TAD vs. TDD

Test After Development (TAD) or Test Driven Development (TDD)? What’s the difference? They sound pretty similar, and the Levenshtein distance between them is only 5, but the difference is pretty profound.

In TAD you write some code, try it out with some test cases, and write some tests to cover the complicated or confusing parts. You want to make sure that you have some confidence that the code does what you want it to so you write some tests. Maybe you write some other tests to make sure some of the likely errors are handled. Simple and straightforward.

In TDD, on the other hand, you know what you want to do. But instead of figuring out how to do it, you write tests. You write tests that are examples of doing it. You write tests that demonstrate how you want people to use it. You write some showing you think they’ll try to use it. You write tests with the kinds of mistakes you think they’ll make. Of course, none of those tests pass. How could they? There’s no code to do it yet.

Then you look at all of those tests. And the APIs and function calls you’ll need to make it happen. And you think about them. Individually and as a group. How they fit together. If they’re consistent, in their inputs, outputs, error reporting, and “style”. If they solve someone’s problem, or are just a bunch of building blocks.

Then you go back and adjust the API. You make it consistent. Logically and stylistically. You make sure it fits together. You make sure that it can be used to solve the problem you’re trying to solve. And you re-write the tests to show it. But they still fail, since you haven’t written the code yet.

After all your tests are written, your APIs updated, all the tests are compiling but failing to run, you start to write the code. As you write more code more of the tests pass. Finally all the tests pass. Then you realize you’ve got an internal design problem. Something is in the wrong domain or you find some coupling you don’t want. So you refactor it. Some of the tests fail, but that’s OK. You find and fix the problem. All the tests pass and you’re done. Simple.

It took 1 paragraph to describe TAD and 4 to describe TDD. Based on that TAD is better, right? Maybe in some cases, but generally, no. And here’s some of the reasons why.

Looking at things from the outside

With TAD you generally don’t get a chance to look at your API as a whole. You don’t look for the inconsistencies, you just look at functionality. You might not upset your users, but you’re unlikely to delight them. It works, but it’s doesn’t feel smooth.

With TDD you have a ready-made source of documentation. Your tests are a great way to show your users how you expect the API to be used. And when your users tell you how they’re really using your API you can easily add that to your behavioral test suite to make sure your APIs are for life.

Looking at things from the inside

You will need to make changes. Your changes need to fit in with the existing system. With TDD you have a can see how the changes fit with the existing API. You can see if they are (or aren’t) working well together. You can ensure that your API stays logically consistent.

You’re always learning about the domain your modeling. As your understanding grows the boundaries between the domains will shift. Functionality and data will want to move between domains. There will be refactoring. TDD helps you know that what you’re changing doesn’t change the observable behavior. Because that’s what your users care about.

That’s why TDD generally leads to better results than TAD.

NB

One final caveat. While TDD makes your code and APIs better, it’s not the entire solution. No one methodology is. TDD is a great foundation, but for your own peace of mind you’ll also want other methodologies. Like unit tests for specific functionality and validating complex internal logic that doesn’t directly map to externally visible behavior. Or integration tests. Or testing on large data sets. Or whatever else your specific context demands.