Recent Posts (page 8 / 65)

by Leon Rosenshein

Another Year In Review

snowman with falling snow animation
It’s the last day of the year and time for my annual look back. It’s been a busy year. At home, at work, and on the blog.

On the work front, at the beginning of the year I had just started working from home again because of the Marshal Fire. A few weeks later cleaning was finished and I started working from the office again. There weren’t many people there, but it was good to not work alone. Two weeks later I started a new job. Luckily it was just down the street, and I could still walk to work. There was (and still is) a lot to learn, but I’ve come up to speed and I’m enjoying the work. About a month ago I got a new manager. Same role, same job, same everything, just a new manager. And now, at the end of the year, our local office is being closed. We all still have jobs. Nothing changed with that. I’ll just be working from home until the powers that be decide which of the other 2 nearby offices I’ll be working from.

On the home front, we were temporarily living with our daughter because of the Marshal Fire. Luckily all we had to do was spend a few nights away from home. We did do some smoke remediation, but that didn’t have much impact. On the other hand, we did a bunch of renovations this year. New master bathroom. Hardwood floors all around. New coat of paint everywhere, except the kitchen/family room, which are still to be done. That meant living with my daughter again, this time for a month. But we’re back in the house now, and loving what we did, so it was worth the trouble.

On the blog front, it was a mixed bag. At the beginning of this year my blog was a set of Confluence pages on Aurora’s internal network. It was easy to publish to, but not very visible, and I was about to lose access to it, so something had to change. I looked around at options and settled on Hugo as the site generator, with S3 as the backing store and CloudFront as the web server. The raw data is stored in GitHub and I use GitHub actions to build/copy to S3 when things change. Once I got it set up it’s just worked. Since you’re reading this, it’s working, and it only costs ~$0.25/month, I’d say it’s worth it. Along the way I made some updates to the theme I’m using to support for the OpenGraph protocol and Mastodon as a social media provider. If you’re interested in learning how it all works let me know.

From a content production standpoint though, not as big a year as some. I only added 90 entries this year, so my average (~1.7/week) is a little lower than the 2/week I wanted. On the other hand, I did get the Engineering Stories part of the site up and running, with some archtypes and my own story there for viewing. That’s something I’m pretty proud of.

Speaking of Engineering Stories and the related articles, they’re right there at the top of this year’s most viewed articles. For your viewing pleasure, here’s the top 10 articles for this year:

And a few others I really liked:

Hopefully you’ve found those, and other, articles interesting and useful. I know I found writing them helpful in clarifying and cementing my own understanding of the topics. Until next year, Happy New Year, and remember to enjoy whatever you’re doing. We only get one shot at this life, so make it worthwhile.

by Leon Rosenshein

Beware Of Bugs

“Beware of bugs in the above code; I have only proved it correct, not tried it.” – Donald Knuth

What does that statement even mean? Assuming the proof is correct, and I have no reason to doubt it, how can something be proven correct and still have bugs? There’s a lot to unpack in that statement.

First of all, how do you prove something is correct? There are formal/logical proofs that start with axioms and then prove a statement based on those axioms. You can sort of prove something, but really, you’re only proving the statement is correct if the axioms are really true. Or, you could use symbolic logic and a set of requirements/constraints and something like Leslie Lamport’s TLA+ to prove that your logic and sequences are correct. Those are at least two ways to make a formal proof, and there are other similar tools/mechanisms.

But even if you have that formal proof, have you really proved there are no bugs? No, you haven’t. For even more reasons than there are to formally prove something.

The biggest reason is that any proof you have is comparing some set of requirements to some specific logical representation of an algorithm. Once you have that you make two huge assumptions. First, you assume that the requirements you’re using for the proof are an exact and complete representation of the problem at hand. Second, you assume that the logical representation of the algorithm is a complete and accurate representation of the code that is/was/will be written. Both of those assumptions are fraught with peril.

Reality has shown us, time and time again, that our initial understanding of a problem is at best incomplete, if not wrong. There are always things we learn. As the Agile Maxims tell us,

It is in the doing of the work that we discover the work that we must do. Doing exposes reality.

We don’t know what we don’t know until we find out we don’t know it. Which means any proof of correctness you’ve done is incomplete. So your proof doesn’t prove anything, or at least doesn’t prove that your code does what it needs to.

You also don’t know that any code written is functionally equivalent to the logic in the proof you used. It’s translation on top of translation. Language translations are notoriously flakey. So we write unit tests to help increase our confidence that we’ve matched the requirements and expectations. And they do. But they can’t prove there are no bugs. All they can do is help us get our confidence high enough.

But even that isn’t enough. You could have perfect knowledge of the requirements. Create a perfect logical representation of the code. Perfectly translate that model into code. And then a stray neutrino hits the CPU at just the right (wrong?) time and the answer comes out wrong. Or you could have one of the Intel chips that we had in our Bing Maps datacenter. It couldn’t take the inverse cosecant of an angle. Well, it could, in that it gave you an answer, but it was wrong. Eventually we got a firmware update and at least it stopped lying to us and just gave an error if you tried to use the broken funcrion.

As bad as that is, it could be even worse. Almost 40 years ago, in 1984, Ken Thompson gave a talk on trusting trust. There’s a long chain of hardware, tools, and libraries between the code you write and the program a user eventually runs. He talked about a compiler that looked at what it was compiling. If it was a compiler it put a copy of itself inside the new compiler. If it was an operating system, it added a hidden login for him in the operating system itself. With that compiler you’d never know what it was really doing. It was mostly doing what you want, but sometimes it did more. That’s a pretty benign hidden ability, but if it can do that, what else can it do? Introducing bugs would be easy to do if you’ve already gotten that far.

Which brings us back to the opening quote. You can prove something is correct, but you can’t be sure that it will work and that there are no bugs. But that doesn’t mean things are hopeless. We shouldn’t just give up. There are lots of things we can and should do to. From software supply chain management to hermetic builds to exploratory testing. All of these things increase your confidence that things are correct. And we should never stop doing that.

by Leon Rosenshein

Happy Winter Solstice

Today is the Winter Solstice in the northern hemisphere. The shortest day of the year. And here in Colorado, if the weather prognosticators can be trusted, both a warm and pretty cold day. The high is listed as 51°F and the low as -10°F. Last year on this day it was warmer (63°F), but it hasn’t been that cold since 1998, when it was -17°F. That’s also the biggest swing between high and low temps I could find in the records, which go back to 1928.

The thing is, it involves time. And the English language. English, however, is slippery and imprecise. As Humpty Dumpty said

When I use a word," Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”

Time is hard. The winter solstice isn’t really a day. It’s an instant in time. We just decided to label the day by that instant. We say the winter solstice is the shortest day of the year, but it’s not. It’s (almost) the same as every other day. What we really mean is that it’s the shortest time between sunrise and sunset, which means it’s related to the solar day. Unless you’re close to the north pole, in which case the sun set in early October and won’t rise again until early March, so the “day” on the solstice is no shorter than it has been for a while.

Dates are hard too. The high for the day is predicted to be 51°F, between 1300 and 1400 local time. That makes sense. The low temperature, on the other hand, somewhere between 0800 and 0900 tomorrow, Dec. 22nd. That’s probably somehow historically related to the solar day, from one local noon to the next. Unfortunately, it’s completely disconnected from our calendar, which marks days as being from 000 hours on the clock to 2359.99999 on the same clock. That’s not the solar day or the sidereal day. Which leads to even more confusion.

All of that is why computers are bad at handling dates and times. Durations between a start tick and and an end tick aren’t too bad. We can measure the number of seconds pretty accurately between them. But try to give the start or end tick an unambiguous label or try to find the duration between two arbitrary labels and all you find are edge cases. So unless you’re one of the maintainers of the time zone database, leave it to the experts and use the official library in your language of choice. If you try to roll your own you WILL get it wrong at the edges.

Regardless of the difficulty in knowing what day/time it is, Happy Solstice everyone.

by Leon Rosenshein

Latkes

A plate of latkes with sour cream

Yesterday was our 14th annual Latke Fry. Not just any latkes, but vegan, gluten free, Rosenshein latkes. 30+ hours after the cooking began, and I still smell a little like the short order cook I was one summer many years ago.

After last year’s disappointing 25 pounds of potatoes, this year’s 9 hours of cooking and 40 pounds of potatoes helped me realize something. The limiting factor in how many latkes get eaten and how many pounds of potatoes get used isn’t how hungry folks are, the number of guests, how hard I work, or how pipelined the latke assembly line is. It’s the size of my frying pans.

Or really, the size of my stove/counter. My stove and counter are big enough for me to have two 18 in frying pans going at once. That means that at any given moment I have 8 latkes frying. If a batch of latkes takes 10 minutes (round numbers), I can do 50 latkes an hour. 50 latkes an hour for 9 hours is ~ 450 latkes. That works out to be 80-100 happy people, but it also means that somewhere around 2:00 I have folks lined up by the stove, taking the latkes as soon as they’re ready.

And there’s not much I can do about that with my current setup. I’m already starting to cook early so I have a cached supply of latkes warming in the oven to help with peak demand. Outside of that, there’s not much more I can other than starting to cook earlier and increasing the size of the cache.

Shredding the potatoes and onions faster won’t help. Other folks are already doing that and there are always shredded potatoes ready for me. They’re already sitting in bowls waiting, and I want the potatoes to be as freshly shredded as possible. Pre-mixing the dry ingredients (salt, pepper, flour) might save a little time, like 10 seconds a batch, maybe 10 minutes over the whole day at best. Plus, the mixing of a batch happens while there are latkes frying, so and I’m mostly idle (other than schmoozing with my friends), so that’s not going to help with peak demand. I can’t just cook them faster by turning up the heat. The oil starts to burn and smoke, which leads to latkes with burnt outsides and raw insides. It also sets off the smoke detectors. And no-one wants that.

I could invite fewer folks over and reduce demand, but that goes directly against the vision of having as many happy guests get as many latkes as possible. That translates into more latkes. Which means higher throughput. So what’s a latke cook to do? My current plan is that next year I’ll try adding a third, smaller, frying pan, to increase batch size.

Latke Frys are great and we all had a good time, but what has any of this got to do with software? It’s the classic struggle of a batch system. How do you increase throughput when you can’t do anything about latency? It takes 10 minutes to cook a latke, whether you’re making 1 or 10 at a time. All you can do is do more at once. Which means finding the bottleneck and working on that. In my case it’s cooking area.

I know I can’t get my throughput up to meet peak demand. Latency (cooking time) is too high, and bandwidth (cooking surface area) is too limited. Instead, I maximize use of available bandwidth (adding another frying pan) to increase my throughput from the beginning, which lets my increase the size of my cache (more latkes in the oven staying warm). The increased cache size then lets me reduce apparent latency for the customer (latke eaters) by letting them pull from the cache (pre-cooked latkes) which has almost zero latency.

Just like any other distributed, pipelined, batch processing system.

by Leon Rosenshein

What Are You Testing?

When writing tests, whether they’re TDD test, Unit tests, Integration tests, or Canary tests, the first thing you need to be sure of is what you’re testing. If you don’t know what you’re testing then how can you know if your test has succeeded? The next thing to do is make sure that what you’re testing is really the thing you want to test. I’ve seen way to many tests that say they’re testing some business logic, but what they really end up testing is either the storage system or some cache along the way.

Tests typically follow the Arrange, Act, Assert pattern. That means you set things up, you try the thing you want to test, then you assert that the response is what you expect. It sounds simple, and conceptually, it is. It looks something like this:

Arrange

The arrange step is pretty straightforward. Create the thing you want to test. Create any needed resources. Create your inputs. Identify the expected result. It sounds simple, and often is, but there are hidden details you need to be sure of.

Act

The act step is the easiest. Call the method. Capture the result(s). Simple. As long as your code is testable. As long as the method does one thing. One logical thing. The more logical things the method does the harder it is to make it do the act, the whole act, and nothing but the act.

Assert

The assert step is more subtle. Conventional wisdom says one assert per test. At it’s simplest, sure. If you have a method that adds two integers your test should assert that the result you get is equal to the sum you’ve calculated. But if you’re doing some kind of merge of two complex objects then a single assert is probably not enough. But logically it’s asserting that the merged object is what you expect, and that’s one logical assert.

In theory it’s easy to Arrange/Act/Assert. In practice though, it can be hard. For a lot of different reasons. Reasons you need to understand not just when writing the tests, but when writing the code too.

In the arrange step it means setting things up so there’s no variability. Does the thing you’re testing have any dependencies? How do you know they’re working? Are you sure you’re controlling all of the inputs? Not just the function parameters, but also any calls the thing you’re testing might make to get external data. Random numbers. Date/Time, environmental information. All of these things, (and more, need to be controlled. If you’re not you’re probably testing the wrong thing. At best you’re writing a flakey test that will fail at the worst possible moment. Dependency Injection and mocks/fakes are your friends here. Don’t leave the response of a dependency to chance. Making something testable often means adding interfaces and using them so you can replace the implementaiton behind the interface in your tests without changing any other operational characteristics.

In the act step it means arranging things so you can actually call the thing you want to test. It is a method? Is it public? Is it just a branch in some other method and you need to carefully craft the setup to force that branch to be called? If it’s hard to run the thing you want to test then you’ve probably got some refactoring to do. Make the code in the branch a method, then call it. Another common one is separating the code from calling some external thing from the error handling. A common code pattern looks like

retval, err = dbManager.Insert(newUser);
if err != nil then {
    switch err {
        case ALREADY_EXISTS:
           <Specific Error handling Here>
           break;

        case INVALID_OBJECT:
           <Other Specific Error handling Here>
           break;
    }
    return;
}

<Happy Path Code Here>

It can be hard to get an external reference (or it’s mock) to return a specific error just so you can test the error handling. A better choice might be to write it something like

retval, err = dbManager.Insert(newUser);
if err != nil {
    HandleError(err);
} else {
    HandleInsert(retval);
}

That way, HandleError and HandleInsert are both easy to test. Arrange the one input variable, call the method, assert what you want. And the wrapper is trivially inspectable by observation. Add some dependency injection and it’s trivial to test too.

Which brings us to the assert step. There are those that say one assert per test, and that might be an implementation, but the important thing is one logical test per test. And it should match the name. If there are 20 different types of invalid input then there should be 20 tests, each for one type of invalid input. You don’t want one test with 20 different ways to fail. It adds too much cognitive load when you need to figure out why your tests fail. What failed and why should be obvious when you see the name of the failed test.

And again, if the thing you’re testing does so much or has so many side effects that your test name is “validateInputsAndComputeResult” it’s probably time to refactor. In fact, any time your test name has been to Conjunction Junction, it’s probably time to refactor your code to do less.

So when you write your tests (and your code), think about what you’re testing.

by Leon Rosenshein

FFS

Once upon a time, in the before times, when people worked in offices and went to conferences in person, there was a conference called FFS Tech Conference. The format was simple. Every talk title started with FFS, and started with a 15 minute rant by the speaker followed by 15 minutes of questions/discussion. Unfortunately, the conference series never took off, but there are some real gems in that first one.

Let’s start with the idea behind the conference. It’s about clarity and transparency. It’s about taking what many folks say quietly to themselves and saying it out loud in front of others, then discussing it. Which brings me to the conference title. There are lots of possibilities. Full Flight Simulator. Fee For Service. Film Forming Substances, Field Fiscal Services. But none of those are right. In this case it’s For F@$ks Sake. And in this case, Urban Dictionary actually has the right answer (potentially NSFW, you decide). It’s an expression of expression of disappointment. It’s a cry of disbelief. And it’s a plea to change.

Take a look at the titles of some of the talks. You’ve probably all felt the frustration that led to them. I know I have.

  • FFS: Get Outside
  • FFS: Fix the small things. Some are bigger than they appear.
  • FFS: FFS just f’ing talk to each other!

Consider Kent’s Fix the Small things talk. It’s entirely possible, likely even, that you aren’t going to be able to convince management, whatever that means) that you should take 4 weeks and refactor the code you’re working on before you fix a single bug or add a feature. So don’t. Fix the most important bug. A do a little tidying of the code to make fixing the bug easier. Make a variable to function name match what it represents. Group things together for easier readability. If you’re adding some functionality, then find related functionality and group them together. At least physically. And if you can do some semantic reshuffling to make things more coherent and easier to change do that too. In fact, do it first. Don’t tell anyone, but that’s refactoring and you just got it done.

It’s a pretty amazing group of talks, and I wish I had been there. I’ve wanted to say almost every one of those things to multiple groupa multiple times. Just knowing that someone else has said them out loud is comforting and affirming.

FFS

by Leon Rosenshein

Hello Mother

Speaking of change and staying the same, today is the anniversary of the mother of all demos.

Poster announcing the fall joint computer conference on Dec 9th

54 years ago Douglas Engelbart got on stage at the ACM/IEEE Fall Joint Computer Conference and defined the environment we’re living in today.

In 90 minutes he showed, on one system, things that had never been seen together before. Things like graphical interfaces, windows, real-time collaboration, hypertext, mice, version control, video conferencing and more.

A lot of people say that Steve Jobs stole the ideas for the Mac from Xerox’s PARC, and that might be true, but the folks at PARC got their ideas from the Engelbart and the demo. Pretty much everything we take for granted about computers (and smartphones) can be traced right back to that demo. Everything from Amazon to Zoom. The world wide web is there in hypertext. Click on something and see it. Chat with your friends and co-workers. Video, audio, and shared media. You can see the basis for communities like Facebook, Twitter, Mastodon, and TikTok in the demo. You can see the beginning of Napster and file sharing. The source for Spotify and streaming. And to top it off the demo was done remotely. Engelbart had the terminal with him, but the computer was back at the office.

About the only two things the demo didn’t foreshadow were touchscreens and the size of today’s smartphones. While they weren’t in the demo, they were signs of what was coming. In 1965 the first touchscreen systems were developed and that’s the same year Gordon Moore first mentioned his eponymous Moore’s Law, so if you were aware of them you could see the writing on the wall.

So thanks Doug for the demo and the world you wished in to being with your demo. You should watch it when you get a chance.

by Leon Rosenshein

Properties

Properties are more than squares on a Monopoly™ board. According to dictionary.com, a property is

  1. that which a person owns; the possession or possessions of a particular owner:
  2. goods, land, etc., considered as possessions:
  3. a piece of land or real estate:
  4. ownership; right of possession, enjoyment, or disposal of anything, especially of something tangible:
  5. something at the disposal of a person, a group of persons, or the community or public:
  6. an essential or distinctive attribute or quality of a thing:

Monopoly uses the first 5 definitions, but I’ve been thinking about the sixth version. Properties as attributes and how they relate to software in general and testing specifically. I’m a believer in unit testing. I like the confidence it gives me that what I’ve written does what I think it does. I like that it tells me that the public surface of the thing I’m writing makes sense. That it doesn’t surprise anyone or make them work too hard. And I try to test before as much as I can, but sometimes I end up testing after.

Regardless of when I test though, my first inclination is to think of edge cases and write specific cases for them. That works, but it leaves you vulnerable to the Enterprise Developer From Hell (EDFH). The EDFH is the kind of person who will write code designed to pass your tests, but not actually be correct. The kind of person that would write code that looks at the inputs by running your tests and set the output based on pattern matching. At some point you might convince the EDFH to actually write the code you want, but if you only have a handful of inputs then it’s easier and quicker for them to just pattern match.

One way around this is to use property-based testing. Take something simple like classifying triangles given the length of the three sides. One way set of classifications of triangles is this set of 4 types:

  • equilateral: All three sides are the same length
  • isosceles: Two sides are the same length
  • right: The sum of the square of the length of 2 sides is equal to the square of the length of the third
  • regular: Any other triangle

A conscientious developer would do some analysis of the requirements, right some if statements that compare the sides, and then return the correct value. The EDFH, on the other hand would run your tests, see that your inputs/expected results are

  • 2, 4, 6 -> regular
  • 1, 1, 2 -> isosceles
  • 3, 4, 5 -> right
  • 7, 7, 7 -> equilateral

And right a function who’s psuedo-code looks a lot like

switch (inputs) {
  case 1, 1, 2 : return "isosceles"
  case 3, 4, 5 : return "right"
  case 7, 7, 7 : return "equilateral"
}
return "regular"

Sure, the tests will pass, but it’s not right. The question is, is there a way to get around the EDFH? One way is to use property based testing.

You know the properties of the different types of triangles. You also know a couple of other properties of triangles in general you can use to your advantage. You know the order of the sides doesn’t matter. You know that the sum of the lengths of any two sides must be greater than the length of the third side. So, instead of hard-coding the lengths, pick two of them randomly and then set the 3rd based on the known properties of triangles. Then try all the different permutations of the order of the lengths you’re validating. They should all give the same correct results.

When you use the properties to validate the function you make things much harder for the EDFH. They can’t do specific pattern matching. They’ll have to fall back to writing code that actually uses the properties, which is much more likely to result in code that actually does what you want, not just passes all the tests.

Property based testing has value even if you’re not dealing with the EDFH. It can find gaps in your logic. There’s one case we haven’t talked about yet. What happens when your input breaks the rules. Sure, you said “Tell me what kind of triangle this is”, and defined what 3 of the types were rigorously, but the 4th type was an incorrect catch-all. Tthe input is pretty open ended. You can use the properties to write a test that breaks the side length rule. Inputs that aren’t even a triangle. If you pass in 3, 4, and 10 the developer should come back to you and ask what you want done if the sides don’t represent a triangle.

So when you write tests, even if you know the developer isn’t the EDFH, because you’re not just the test writer, but the developer as well, think about the properties you’re really testing.

by Leon Rosenshein

The More Things Change

November seems to have slipped away, but I’m back now. This peice has been brewing for a while and it’s time to share.

The machine learning lifecycle, with loops

Things sure have changed for developers. The first piece of software I got paid to develop was written in Fortran77 on an Apollo DN4000. The development environment was anything but integrated. I used vim to edit the code, f77 to compile the individual source files into object files, then f77 again, this time with all of the object files, to generate an actual executable file. All manually. Or at least that’s what I was supposed to do.

At the best of times it took a while to do all those steps. Throw in debug by print and things were really slow. Write some code. Compile it. Link it. Run it. Look at the output. The cycle time could be hours or even days depending on how long it took me to get the syntax and logic to both be correct at the same time.

To make matters worse, I would regularly forget one of those steps. Usually, the linking step. Unsurprisingly, if you compile your code, but don’t link the object files into a new executable nothing changes. And instead of remembering what I forgot to do, I’d edit and compile again, and still nothing would happen. Eventually we figured out how to use Make, which solved most of the problems, but not all of them. The thing that Make didn’t solve was remembering to actually save the file. Especially when I learned how to use multiple windows so I could keep the editor open while I compiled, linked, and ran things.

Today modern IDEs make this much easier and less error prone. They know what files you’ve changed. They understand the languages. They understand and can, in many cases, figure out the internal dependencies of your code and build only the things that need to be built. And for those situations that the IDE itself can’t handle there are external tools like bazel to handle the dependencies and manage the build.

At least they usually do. I was helping my daughter with programing homework the other day and we were doing some front-end development. Doing the usual console.log debugging we weren’t seeing the output we expected. It turns out that she didn’t have autosave enabled on her IDE, so there we were, 30+ years later, still having problems keeping our development velocity up because we forgot to save a file.

Of course, that’s an outlier. The typical compile/test cycle is seconds to minutes. Some systems even complain if your tests take too long to run. Which means in that for a typical developer you can get (nearly) instantaneous and continuous feedback on whether your changes are having a negative impact on your unit tests. You can feel confident that you haven’t broken anything, and if you’re a person who writes tests first instead rather than test after, you know when you’re done.

Some folks are still dealing with that long, drawn-out, error-prone process. Two such groups that I deal with regularly are Data Scientists/ML Researcher. To do something the often need to not just make a change to their analysis code, but make one or more upstream changes to the data itself, which involves a completely different code change in a different code path. And then they must regenerate the data. And point their analysis/training code at the new data. All of those steps take time. They’re hard to track. They’re easy to miss. There’s not great tooling around it. Even worse, for some things, like the data regeneration, you can’t do it locally. There’s too much data or there’s access controls around it or you’re in the wrong country and the data needs to stay where it is, so you need to submit a request to have some process run to regenerate the data. And that can fail for lots of reasons too.

Put together, doing data science/ml work often feels like development used to. Which isn’t surprising. The more things change, the more they stay the same. It’s just the frame of reference/scale that changes.

by Leon Rosenshein

Omission vs Commission

Some bugs are bugs of commission. In my experience, it’s most of them. You get an equation wrong. You compare against the wrong, but similarly named, field. Or maybe as simple as flipping a sign in a test. The nice thing about bugs like that is that they’re usually easy to spot. Ideally with your test suite, but even if not, the response is generally wrong, so you notice, and then you fix it.

Other bugs are bugs of omission. The code you’ve written is 100% correct. It has the right logic in the right place. The algorithms and calculations you’re doing are correct. The code usually, almost always, works. Those are the ones that are hard to find.

They’re the ones that aren’t obvious. They only happen when conditions are just right (or maybe, just wrong). They happen because of assumptions and biases used when the code was being written. And they creep in at various points in the development process.

At the lowest level it happens when you get caught by your assumptions of how a languageworks. Back in the old days of C/C++ memory management was critical. Not just being sure to free everything you allocated, but avoiding buffer overruns and using uninitialized memory. Both of those less likely (or nominally impossible) with more recently developed languages, but there are similar things. Even if there’s garbage collection to clean up after you, you can forget to allocate something or allocate it twice. In go the differences between a slice and an array are subtle. Unless you are careful when passing/returning them you’ll end up with copies of copies. Your changes will be visible in some places and not in others.

The next level is dealing with more abstract things. Consider a search method. There are lots of things you could omit when using a collection. What if the thing you’re looking for isn’t there. What if there are more than one of them? What if the collection is empty (or worse, not defined)? How does your code handle that? Of course, it’s obvious when you use a method called Find() that you’re doing a search. But sometimes what you’re doing is logically a search operation, but you don’t necessarily recognize it. Create something. Store it. Update it. Do some other stuff, then update it again. What happens if things change while doing that other stuff? Like someone else deleted the thing you created. What do you do then?

Or in a distributed system, what if the remote call you’re making works, but the success response is lost. Did you handle that case? Do you just try to create it again, or do you do a search and then try to create it again? Or do you just ignore the response in the first place since “it always worked before”? That’s a big sin of omission.

Then there’s the meta level of omission. You’ve got an API that can do multiple things. Some of those things are permanent. Consider the lowly rm command. Back in the day it was possible to run rm -rf /. The command, being the good command that it was, assumed you knew what you were doing and began deleting everything. After all, computers are good at doing what you tell them, not what you want. In case you’re wondering, even with good backups, it’s really hard to recover from that. Since that’s not something you would want to do (there are better ways to actually erase everything on a disk) there was no code to prevent you from running a valid, but probably unintended, command. Now, on linux at least, there’s the --no-preserve-root option and a check to make sure you don’t make that mistake. But it’s indicative of a whole class of errors. The “Why would anyone do that?” error. And if you don’t check for it then at some point, it will happen.

The trick is to test for them. But how. Part of it is a mindset. Think of ways that things could fail, and test for those, not just the happy path. Think about ways your system could be misused, or at least misunderstood. Make sure you’ve guarded against those things too.

This is also where another set of eyes is good. Someone without your biases and assumptions. That person will think about things in a way you’re not. After all, if you thought about what would happen if someone tried to delete the entire filesystem you would have added the flag in the first place.