Recent Posts (page 4 / 71)

by Leon Rosenshein

Emergency Procedures

The other day I ran into a quote on the internet about the problem with emergency procedures. I generally agree with it. The quote went like this:

If you wouldn’t use your emergency process to deliver normal changes because it’s too risky, why the hell would you use it in an emergency?

But, as always, It Depends. It’s about the risk/reward ratio. You want the ratio to be low. If the system is down, the risk of breaking the system is low. If it’s down, you can’t crash it .

In general, you want one, and only one, deployment process. You want it to be easy, automated, idempotent, and recoverable. You want it to be well exercised, well documented, well tested, and fail safe. And in almost all cases, you should use it. All of the checks, validations, and audit trails are there for a reason (see Chesterton’s Fence).

The main goal of any deployment process is to make sure the user experience is not degraded. Or at least only temporarily degraded by a very small, broadly agreed upon amount. That means making sure that nothing happens by mistake and without an appropriate amount of validation. There can be tests (unit, integration, or system) that need to pass. There can be configuration validators. There can be business, legal, and communications sign-off. All in service of making sure no-one has a bad interaction. Actually deploying the new thing is often just a tiny part of the actual process. There’s a high risk of something going wrong, so you need to be careful to keep the overall risk/reward ratio down.

In an emergency situation though, the constraints are different. If the system is down, you can’t crash it. You can’t reduce the throughput rate. You can’t make the experience worse for users1. The risk of making things worse is low, so the risk/reward ratio is biased lower.

In fact, many things you normally do to make sure you don’t have an outage are unneeded. You don’t need to keep in-flight operations going (because there are no in-flight operations). Instead, you can skip the step of your process that drains running instances. You don’t need to do a phased update to maintain your throughput. When nothing is happening, getting anything running is a step forward. Because nothing is running, you don’t need to do a phased roll-out to check for performance deltas or emergent behavior or edge cases. After all, things can’t get much worse. There are just a few of the things you don’t have to worry about when you’re trying to mitigate an outage.

Magic wand behind glass labeled 'In case of emergency, break glass'

Or to put it more simply, outage recovery is a different operation than system upgrade. When dealing with an outage, the first step is to mitigate the problem. When doing an upgrade, the most important goal is to have customers/users only see the good changes. There should be no negative changes to any user. Many steps can (and should) be shared between the two processes. But the goals are different, so the process is going to be different.


  1. Ok, there are things you can do to make it worse. Like loose data. Or expose personal data. But generally speaking, if your system is down, you can’t make the user experience worse. ↩︎

by Leon Rosenshein

Careers are Non-Linear

Hiring has been on my mind lately. I’ve been looking for an entry level developer. Someone just starting out in their career. I’ve described the arc of my career before. In fact, I came up with what I think is a pretty novel (and useful) way to describe the arc of a career. It’s also good for helping you visualize where you are at any given point compared to your company’s (or more specifically your managers) expectations. Anything that helps you do that gap analysis with your manager and guide the discussion on how you’re going to close those gaps is good for your career.

One important thing to keep in mid though, is that while we think of careers as always being “up and to the right”1, that’s not really the case. Especially as it’s perceived by the person living it. In fact, careers, as experienced by the person having the career, are very non-linear. The slope changes. It can even be negative. Particularly in one aspect or another. Even when the overall arc of the career trends towards more scope of influence.

Every career change, whether it’s role on a team, changing teams, promotion to a new level, or changing companies, changes your context. Everything you learned about where you were is still true, in that context. And much of it is still true in your new context. But not all of it.

That’s why your career is non-linear. What you’ve done got you to where you are. It was the right thing at the right time, in the right place. And while I don’t believe that the Peter Principle is generally true, when you start that new role, you definitely know less about it than you did about the role you just left. You don’t get promoted until you can’t do the job. You get promoted until you can’t learn/grow enough to do the next job. And you didn’t get the role change you just made because you can’t do the job. You got it because they do think you can do the job.

Think about it. If you really were ready for that next role (and people know and it was available), you would have gotten it. Since you didn’t, you’ve gone from being at the top of your old role, one of the best around, to being OK at your new role. Not bad, but nowhere near ready for promotion. So compared to other folks in the new role, you’re closer to the bottom that you are to the top. And that can feel like a move backward2.

All of that is when you’re staying in the same basic job/role. Moving from IC to Manager has all of those issues, and a whole set of its own. The same applies when transitioning between Product/Project/Program Management and Development, or really any other “discipline” (using the term very loosely here).

The important thing to remember is that all of these steps are advances in your career. Even if they don’t feel like it to you at the time. Even (especially?) if they feel like more work. When you’re challenged and succeed, you grow.

As has been said, “If you rest, you rust”.


  1. Progress being up and to the right is a metaphor. Knowing how we use metaphors, where they come from, and how they can subtly influence things even when you’re not trying, is an important topic, but for another day. Meanwhile, consider Metaphors We Live By and Darmok as tokens of the importance of metaphor in our lives ↩︎

  2. Back in the day at Microsoft, the Principal band was levels 65-67. A three level span doesn’t seem like that big a span, but in fact it was huge. L64, senior engineer, was considered a terminal level. If you reached level 64 there was no longer any expectation that you would get promoted or eventually be asked to leave. L59-L63 was considered up or out, the only difference was the time. Moving from L64 to L65 (Senior to Principal) was a big deal. It was an inflection point in your career. From L65 on, even if you had no direct reports, you were expected to show results through others. You still had to do your work, but the big expectation was around how you impacted others. That’s fine and makes sense. The problem was that back then everyone in the Principal band compared to everyone else in the same band. And newly promoted folks at L65 were being compared to folks at L67 who were being considered for promotion. L68 was Vice President. So the first review cycle as a new L65 you were suddenly compared to someone about to be one of the Vice Presidents. Unsurprisingly, L65s didn’t come out well in that comparison. It certainly felt like a step backwards. Talk about imposter syndrome. ↩︎

by Leon Rosenshein

People Over Process

As seen on the internet

People over process.

Why?

Because systems can’t fix problems with people, but people can fix problems with systems.

People Over Process is from the Agile Manifesto. There’s a lot to unpack there. It starts by acknowledging that the software development is a socio-technical endeavor. There are people (that’s the socio part). But there are also tools and rules and processes, which makes it technical.

First, and foremost, it’s over, not or. It’s not a Boolean choice. You get to have some of each. If you choose to only focus on the people, making them safe and happy, you can’t organize. You can’t even self-organize. Because without some norms, some process, you can’t communicate. And if you can’t communicate, you can’t coordinate. Not because no one cares, and not because no one wants to listen, but because anarchy is the opposite of coordination. Even the most libertarian knows that there needs to be some structure. Or you end up with the tragedy of the commons.

And if you choose process only, the first time something happens that your process doesn’t cover then you get stuck. Unless/until you can come up with a new process. Which takes a while, because there’s a well-developed process for changing the process. You did remember to add that to your set of processes, right?

So don’t ever let yourself be tricked into turning an analog choice into a Boolean one. Trust me, it won’t end well.

Second, it’s not quite accurate. Systems, particularly feedback systems, can fix, or at least minimize, problems with people. Processes, just like those warning signs on ladders, are there for a reason. They’re there because at some point in the past not just one person, but enough people did things in a way they thought was right, but was actually dangerous, and got hurt or killed. Processes are institutional scar tissue. Something bad happened, and the process is there to make sure it never happens again. The process is there for a reason, and that reason is so that the system can heal from a person’s mistake.

The trick is to have the right balance between the two. The agile manifesto says people over process, so at least 51%/49%, and less than 100%/0%, but that’s a pretty big range. Where you land in that range depends on lots of things. The context. The people. The familiarity of the people with the context. And some trial and error, because you’re unlikely to get the balance right the first time.

And there you have it. People over process. Because people can fix issues in your system. And your system needs to have processes to protect it from the people.

by Leon Rosenshein

The Power Of Examples

I’ve subscribed to Kent Beck’s Tidy First substack, and there’s lots of useful info there. He just posted a piece on Why TDD doesn’t Lead to Dumb Code. As usual, it’s a really good entry.

But what really stood out to me in that post was not what he was saying, but how he was saying it. In particular, his use of an example. Beck is trying to answer why TDD doesn’t lead to overly specific code. The task at hand is to use TDD to write a function called factorial. As a software developer, figuring out the factorial of a number is something I’m very familiar with. So the amount of cognitive overhead to understand the problem space was approximately zero. This left me all of my bandwidth to understand the message about TDD and generalization that he was really trying to get across.

That’s the beauty of good examples. They help the reader/listener understand the problem and the solution. And good examples don’t burden them with additional things they need to learn before they can get to the information you’re trying to impart.

One good way to do that is by knowing your audience and understanding their context. It’s great if you share the same context, but the key is speaking in their context. As the presenter of information, it’s on you to find the right example for the group you’re speaking to.

That’s why a lot of software stories use car analogies. Cars are ubiquitous. The classic agile incremental build image

works so well because you don’t need to think very hard to understand that if you value easier transportation, you get nothing until the end in the top half, while there’s value added at each step of the bottom half. That’s a great example

Besides tailoring your example to your audience, Hillel Wayne goes a step further and talks about the difference between instructive and persuasive examples. More importantly, he notes that while an example might be good at one or the other, you still need to use the right example depending on what you’re trying to do. A good instructive example is often not persuasive, and an example that’s very persuasive might not be good at teaching something. Like everything else about software, and engineering in general, It Depends.

All of this is just to say that good examples are hard to find. And they’re also very important. And worth the effort to find.

Because if you do, you’re much more likely to get your point across. Which is your goal in any communication. Hopefully my examples here have helped me.

by Leon Rosenshein

Biases: The Tyranny Of Or

I’ve mentioned the Tyranny of Or many times, I talked directly about it four years ago, and I stand by what I said. However, because OR is such a loaded word there’s more to say. And, I get to use the Agile Manifesto and how it’s often mistakenly applied as an example.

To recap, the Tyranny of Or is any time you are forced into, or have convinced yourself, that you are in a situation where you need to make a choice between two options. For example, you’re standing at the checkout counter and there are Snickers bars and Milky Way bars. The cost the same, and you only have enough cash for one of them. So you force yourself to choose. Snickers OR Milky Way. You must choose one. There is no other option.

The thing is, there are other options. It’s often just a bias that we use to fool ourselves. To avoid having to spend the mental energy to make a more thoughtful, nuanced, decision. Yes, sometimes the choice truly is “or”, but very often it’s not.

Consider the candy choice. There are also Kit Kat, Almond Joy, and M&M’s (both plain and peanut) options. You could buy them instead. Or you could by nothing. You might only be able to buy one, weren’t actually limited to the original OR. There might even be a smaller version of each, where you could buy both. Maybe you can turn the or into an and. Here’s another option. Put something else back and get both full-size candy bars. You only think you must choose one. It’s a false dichotomy.

Then there are things that we simplify into an OR. Consider the principles of the Agile Manifesto.

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

The people who came up with that list we very careful about how they worded it. The didn’t say Left good. Right bad. The didn’t say instead of. They very clearly said

We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

They value the things on the left over the things on the right. They didn’t say working software instead of documentation. They didn’t say responding to change instead of following a plan. You can do both. Make things work, and document the why’s and the why not’s. Respond to change, but have a sea chart with a clear goal. You might not know how you’re going to get there, but you do need to know where you’re going. Even if the goal might change over time.

When you’re faced with a value choice, it’s very easy to simplify it into an or. Doing only A or B is easier to explain, discuss, and reason about than 70% of A and 30% of B. Is that the right percentage? Are you actually hitting that exact percentage? How can you make sure a group agrees on those things? That’s way more cognitive load than “We’re doing A and only A”.

Software engineering is Engineering. That art of the possible. The compromise. Because It Depends. It’s always context dependent and nuanced.

So next time you think you’re forced to make a choice between A and B, check your assumptions and check your biases. It might really be a choice between A and B. But it might not be.

by Leon Rosenshein

Simple or Easy?

Here’s a question for you. What’s the difference between simple and easy? Are they different, are they the same, or is one a superset of the other? If you had to choose one, which would you choose? And why?

First, let’s see what Sir Merriam-Webster’s reliable book has to say.

SIMPLE

readily understood or performed
    simple directions
    the adjustment was simple to make

EASY

requiring or indicating little effort, thought, or reflection
    easy clichés

Similar, but not quite the same. It’s kind of like the difference between complex and complicated. Complicated processes are often made of many easy steps to do. You just need to do them correctly and in the correct order.

Complex things, on the other hand, are often simple. A Foucault Pendulum is simple. It’s just a weight on the end of a string. Describing its motion, on the other hand, is very complex. It’s a function of the weight, the length of the string, and the Earth itself, including the pendulum’s position on the Earth. Once you know about it, it’s simple, but figuring it out from just watching the pendulum swing is very hard.

It’s often like that in software. Complicated systems, tools, and pipelines are tedious to build, and hard to get right, because they’re so exacting. Bash scripts are very often complicated, but because they’re scripts, each step is easy. Distributed systems are simple. Just do things in parallel on different machines. Do the exact same things you would do on one machine, just do half of them on one machine and half on the other. Simple. But complex because the failure modes are multitude, and the differences between them are subtle.

To answer the questions:

Are simple and easy the same thing?

No, they’re not. Simple is easy to understand. Easy takes little effort to do. But beware, not everyone will agree of something is simple or easy. Or why.

Is one a superset of the other?

No. They’re related, and something can be both simple and easy. In fact, many things that are simple are easy, and many things that are easy are simple, but being simple does not imply being easy, nor does being easy imply being simple. There’s not a causal relationship between them.

If you had to choose one, which would you choose?

Of course, It Depends.

When I’m starting out on something I’m unfamiliar with, I want to make it easy. There might be lots of steps, and it might be complicated, but until I understand the system, I want it to be easy. Every part should do one thing and one thing only. Interactions between parts should be minimized. If something is not right, I want there to be only one place to look for the problem, and I want to be able to work on that one thing and not have to worry about breaking anything else. Of course, doing things often ends up with a brittle system that is easy to use, if you use it just right.

Later, for the same project, when I’m more familiar with the domain, I’m going to want to make it simple to use. The internals will become more complex, with interactions that require deeper understanding to deal with problems. It won’t be as easy to understand the intricacies, but the surface will be simple. Using it will be robust and easy to understand. It won’t every surprise you, even if you misuse it, and will be hard to misuse.

But what if you REALLY had to choose just one?

In that case, I’d choose easy. Because easy takes little effort right now. So I can add value. And if I can keep it easy, I’ll always be able to add value. And when you do that, you usually find that keeping it easy eventually makes it simple.

by Leon Rosenshein

Governing the Commons

Adding to my book reviews, consider Governing the Commons by Elinor Ostrom. You might wonder what a book about Turkish fisheries, Swiss grazing pastures, Japanese forests, and Spanish and Philippine water systems has to do with software development. It does seem to be a bit of a stretch.

Some background first. The common part is all about what the Commons actually are. In this case, it’s the same commons talked about in The Tragedy of the Commons. A shared resource. Traditionally, that’s used to denote a shared, finite, natural resource. Like a pasture, a forest, or water. That’s the commons or, as Ostrom says, the Common Public Resource (CPR).

In software development, particularly large projects with multiple teams, but really, any project with multiple developers, there are many shared finite resources. The most obvious are time, IO bandwidth, and storage (RAM and long term). You can probably list others, but as you can see, software has its own CPRs.

Those tangible things aren’t the only CPRs though. There are also more intangible things. Things that make up Internal Software Quality (ISQ). Like architecture, naming conventions, coding styles, and domains and boundaries. Those things may not be finite, but they are on a common, public, and they are on a continuum that is influenced by local actions with regard to the whole.

That’s how software development is like the Commons, and Ostrom’s findings and conclusions can tell us something about how we might do things differently to encourage better results.

Conventional wisdom tells us that when you have a group of actors, each looking out for their best interests, who must share a finite resource, they quickly use it all up, because each actor is working toward their own local maximum. The each want to get as much of the resource as they can, and before anyone else can get it. Or, one actor ends up controlling the resource, to the detriment of everyone else. That’s the tragedy. And the most common “solution” to the problem, is heavy handed, external (often governmental) regulation. It sort of works, but leaves everyone looking for a way to game the system and get the most for themselves, even if it makes things overall worse for everyone.

What Ostrom found, after looking at the various successfully managed CPRs, was a list of 8 guiding principles:

Group boundaries are clearly defined

While the shared owners and individual teams may be changing, everyone agrees on who is in the overall group, and which sub-group they’re in. It’s self-organized and dynamic, and often based on unwritten tribal knowledge, but the organization can be clearly seen.

Rules of use are matched to local conditions

Since the resources under consideration are different, the rules around their use are specific to those resources. It’s not as simple as saying “There are 1000 users. You each get 0.1%.”

Most (all?) actors are involved in modifying the rules

If you’re in the community, you’re automatically part of governing the community and the resource. If you’re not in the community, you’re kept out of the resource, until you join. Joining may be complicated, but it doable.

The community is allowed to set and enforce the rules

The converse of the above. If you’re not in the community, you have little (or no) say. External influences (like a government) are influences only, and only have influence on group members, not the group itself.

The community monitors itself

The community is tracks itself and identifies and “grades” infractions. Again, this is often known but unwritten, bit for larger communities it may be written and public.

There are sanctions for breaking the rules

If you break the rules, there are consequences. The bigger the offense, the bigger the consequence. Up to and including being excluded from the community

The community manages conflict resolution

The community handles its own disputes. You can bring up a dispute you’re involved in, and the community can step in where there are unresolved disputes, but either way, it stays internal

It’s a system, so communities and CPRs can be nested

The only way to scale to large numbers of actors or groups of actors is to have a hierarchy and nest groups and portions of the commons inside larger groups. The challenge here is to maintain the other 7 principles while building and operating the hierarchy.

Sure, when Ostrom wrote this she was talking about natural resources, but software development is a socio-technical endeavor. As such, there are lots of CPRs that need to be managed. Applying these principles can help us maintain those resources at an appropriate level while ensuring that everyone, both individually and collectively, have the right amount of those resources.

Finally, there’s a download of the book available on archive.org. The formatting is pretty bad, but the words seem to be correct.

by Leon Rosenshein

More Error Types

I’ve talked about Type I (False Positive) and Type II (False Negative) errors before. While it would have been so much better if they just called them False Positive and False Negative cases, they only cover part of the problem. A more complete list would include the Type III (the right answer to the wrong problem) and Type IV (the right answer for the wrong reason) errors.

The Type III error ought to be innocuous. After all, you may have wasted some time getting to an answer, but you get to a correct answer. Using that answer is going to be good, right? As usual, It Depends.

To use a car analogy, you’re driving down the road and you hear a thumping noise. Nothing seems obvious, so you keep going. A few miles later, the engine dies, and you safely get the car to the side of the road. After taking a few minutes to calm down, you get out of the car and walk towards the front of it. You notice that the hood isn’t fully latched down. So you open the hood, slam it properly, and confidently get back in the car to continue down the road. Unfortunately, when you go to start the engine, it won’t start. Yes, the hood wasn’t latched, and it may have been causing the noise, but fixing it didn’t help get the car running at all.

Since that didn’t work, you look around some more. You notice that there’s a loose spark plug wire. That’s surely a problem. You fix it, but the car still doesn’t start. You keep finding and fixing things, but you can’t get the engine to start. It turns out that that answer to the question of why the engine died is that you ran out of gas. All those other things, which are related to the car’s operation, don’t help. They are problems, and you came up w/ solutions to them. They answered the question of why there was a thumping noise. They answered the question of why the car was running roughly. But they didn’t answer the question of why the engine stopped running. You’re still sitting on the side of the road.

So before you go accepting whatever answer you got, make sure it’s the answer to the question you asked. Otherwise, it’s not going to solve your problem. And, might make things worse.

The Type IV error can be even harder to spot, and can do even more damage later. But why? You got the right answer, and it solved your problem. Yes, it did, but it also taught you something that isn’t true. It’s now one of the things you know that just ain’t so. And that’s one of the hardest biases to fight.

Going back to the car analogy, one day your car is pulling slightly to the right. You remember from watching NASCAR races that you can change the handling of your car by changing tire pressure, so you check, and the right front tire was a bit low. You add some more air to the tire. The problem goes away. You’re happy. Over the next few months, it happens a few more times because there’s a small leak in the tire. When the car starts pulling you know it’s time to add some air. When you get new tires there’s no leak, and you don’t have any more pulling. Then one day the car starts pulling to the left. You use your new knowledge and add air to the left front tire. The problem goes away, and you’re happy. It doesn’t come back and you forget about it. Months later you find that the left front tire is completely worn out in the center because it’s been overinflated. The car wasn’t pulling to the left because of a low tire, it was because of that pothole you hit, which slightly bent the tie rod. So now you have two problems. A bald tire and a bent tie rod.

If it hadn’t been for that Type IV error, the right answer for the wrong reason, you wouldn’t need to get at least 2 new tires and and a tie rod. If you hadn’t known that inflating the tire would solve your pulling problem, the right answer for the wrong reason, you probably would have dug a bit deeper and solved the real problem.

So before you go and use the solution you already know works, make sure it’s the solution for the current problem, not the solution for another problem with the same symptoms.

by Leon Rosenshein

Strong Opinions, Loosely Held - Part 2

A while ago I talked about Strong Opinions, Loosely Held. I talked about the problem with people with structural power (the HiPPO) having strong opinions and how those opinions often get too much weight.

Picture of a hippo at a desk making a proclamation

The HiPPO Timo Elliott

As Jim Barksdale of Netscape once said,

“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”

Doing that can easily get you into trouble, because being a HiPPO doesn’t make it right. In that post I only teased about the second part, so it’s time to talk about that.

After making sure other opinions get sufficient space to be investigated and thought about, the most important thing you can do with your strong opinions is to hold on loosely, but don’t let go to soon. You have that opinion for a reason. It might be a good one. It’s probably based on some experience you had or data you’ve seen.

But that doesn’t mean that it’s still the right opinion. You need to make sure that it’s still the right opinion. You need to check your biases.

Like confirmation bias, where you only look at data that supports your opinion. You need to look at that data, but you also need to look at contrary data. Especially when it’s brought by someone with a different opinion.

Or recency bias. Something like this just happened, so obviously it’s going to be like that next time, right. Possibly, but not definitely. Maybe what just happened was a black swan event. Or the circumstances were different.

Optimism bias is another one. Sure, it didn’t work last time, but we’ve learned a lot and we can be better/faster/stronger than last time. We can make it work now. Maybe, but maybe not. You need to be honest with yourself about your capabilities.

Ego bias is a big one. Especially for HiPPOs. After all, if they’re the highest paid, they must be the smartest/most valuable/most correct, so their opinion must be the best. That might be true for some set of conditions/areas, but no single person is the smartest in the room at everything. Again, honesty is the antidote to bias.

There are many other biases and reasons why your opinion might be wrong. The trick is to not be so tied to your opinions that you’re not able to listen to reason. To listen to another opinion.

Indecision can cripple a project/team. But so can slavish devotion to an opinion that has been proven to be incorrect. So have a strong opinion. Present it with confidence. But don’t force it on others. Encourage a discussion. And when you learn new facts of and you see that your opinion is not helping you, allow yourself to change your opinion.

Then work from that opinion/decision. Until there’s a good reason to change it.

by Leon Rosenshein

Virtuous tests

I’ve talked about the different kinds of tests. I’ve talked about when to run the different kinds of tests. I’ve said that your tests need to be good tests. All of that is true. What I haven’t done is talk about what makes a good test.

Of course, the answer to the question of what makes a test good is, It Depends. Mostly it depends on what kind of test you’re writing and when you’re running it. System tests are different from integration tests, which are different from unit tests. They do, however, have a few things in common. While your code should be SOLID (or CUPID), your tests should be FIRST. Your tests should follow the Arrange, Act, Assert pattern.

Those are all structural guidelines though. What helps separate great tests from just good tests? Tests are code, so they should have those virtues. They should also have a few more that only apply to tests.

Have a good name

Regardless of what you’re testing, your test should start with a good name. You want to be able to look a list of failed tests and understand exactly what failed. The name of the test needs to tell you what’s being tested. If you’re tests are called TestA, TestB, and TestC, you might know what a failure of TestC means 10 minutes after you wrote the test, but when if fails in 3 moths due to a bad refactor you (or whomever is running the test), will have no idea what failed. Even more specific, a good test name describes what is being tested, and tells you about the test case and the expected result.

Test one thing at a time

This makes the first thing easier. A test should only validate one thing at a time. Through good use of fakes, mocks, and dummy components, you ensure that the result of the test only reflects what happens with the thing you care about. Everything other than the thing under test should not be able to fail. (Yes, I know there are cases where this is impossible. Especially when there are noisy neighbors, but do the best you can).

Control the environment

You need to control the environment. Things like time and date, random number generators, environment variable, and disk and memory status. Also, you can’t rely on timing (unless the purpose of the test is to make sure things take less than a specified amount of time). Depending on the language and test framework, you might need to have a shim around your date/time/random library so you can control things. For single threaded code, you can rely on ordering, but for any multithreaded code, good tests handle the asynchronous nature of threads. And make sure they’re all done/cleaned up.

Validate the response

This should probably go without saying, but needs to be said anyway, the results/side effects should be checked. In almost all cases, just checking that no error was thrown/returned isn’t actually checking anything and isn’t a good test. Whatever you do, check something. assert (TRUE) unless there’s a panic/crash is almost always an invalid test.

Group different variations into a larger test

Tests should be grouped together based on what they’re testing. A good rule of thumb (for unit tests) is to pair a test file with each source file. It keeps things close together when you need to work on both of them (write the test, the write/update the code that makes it pass). But not just where the tests live, but how they’re grouped.

For unit tests for example, you want to test not only the happy path for a method, but also the various unhappy and edge condition paths. You could write a special test for each one, but that’s likely to end up with lots of duplicated code that’s hard to maintain. A better choice would be a table-driven test. Various languages/frameworks have tools to help, so use the one that makes sense in your case, but in general, instead of multiple tests, a single test that calls another method with a set of parameters that does the actual test. Then, when you discover a new test case, you just update the table with the parameters and expected result (or error). This also helps with the first case. You can usually concatenate the base function’s name with the test case’s name to define the test you’re running. That makes it very easy to understand what went wrong and where to look for the code/data that’s failing and the code being tested.

For integration or system tests, this might end up being a set of scenario definitions that are called as a single system test, then each scenario gets its own result. Again, the base name tells you what overall thing you’re testing, and the scenario name tells you the specifics.

Clean up after yourself

Since your tests could run in any order, any number of times, you need to make sure the results of the last run don’t pollute the results of a new run. The most common way this happens is when tests leave things laying around on disk, but extra rows in a database are almost as common. So spin up a new storage system (disk folder, in-memory file system, database, etc) for each test, or make sure that you clean up the shared one, regardless of how the test ends.

Not just the data, but connections, file descriptors, ports, memory, anything your test grabs on to needs to be put back. Otherwise, sometime, when you least expect it, your tests will start to fail because some resource is no longer available. And that will happen at the least favorable time, occasionally. Making it very hard to reproduce and debug.

So next time you’re writing some tests, whether TAD or TDD, make sure your tests are not only correct, but also virtuous.