Recent Posts (page 16 / 70)

June 23, 2022 by Leon Rosenshein

Design Is Iterative

agile mmmss systems thinking

I’ve talked about rework avoidance and the advantage of taking smaller steps. I’ve talked about how it’s different from the older BDUF process. How a sea chart gives you a better chance of getting a successful result.

Given that, and, the fact that the Agile Manifesto, was written in 2001, you would think that the concept of iterative design is a new one.

However, you’d be wrong.

Way back in 1968, in the Software Engineering Report by the NATO Science Committee, you’ll find this quote by H. A. Kinslow.

The design process is an iterative one. I will tell you one thing which can go wrong with it if you are not in the laboratory. In my terms design consists of:

Flowchart until you think you understand the problem.
Write code until you realize that you don’t.
Go back and re-do the flowchart.
Write some more code and iterate to what you feel is the correct solution

There are a lot of good ideas in that report, but Kinslow’s comment really resonates with me. It plays well with my personal bias for action. Do some planning. Write some code. Repeat. Or put another way, learn from the domain to write some code. Then learn from the code to better understand the domain. Then do it again. By learning from both, iteratively, you come up with a better answer then only learning from either one. And, you’ll probably come to that answer sooner.

It’s a system. A system with at least 2 loops. The questioning/understanding loop and the building loop. At first you don’t even know what questions to ask. You start with a problem statement. You think you understand it well enough to write a solution, but then you have questions. So you dig deeper and get your questions answered. Which leads to more understanding and a better solution, which leads to more questions. The cycle repeats until you have enough understanding to build a solution that actually solves the original problem.

So remember, whether it’s your APIs, your class design, your domain boundaries, or how your systems work together, iteration is your friend. Because without iterating you can’t have enough understanding to come up with a good solution.

June 20, 2022 by Leon Rosenshein

Good Habits

testing tdd

In the past I’ve talked about Why We Test. We test to increase our stakeholder’s confidence. So let’s add all the confidence we can. If a little confidence is good, more confidence is better, right?

Maybe, but maybe not. Because Speed is also important. We want to get things done Sooner, not Faster. It’s not confidence at any cost. It’s enough confidence to move forward.

I think Charity Majors said it really well.

Or you could think of it like running tests is the equivalent of eating well, not smoking, and wearing sunscreen. You get diminishing returns after that.

And if you’re too paranoid to go outside or see friends, it becomes counterproductive to shipping a good healthy life. ☺️🌴

How much confidence do you need? What’s the cost of a something going badly? What are your operational metrics (MTTD, MTTF, MTBF, MTTR)? Do you have leading metrics that let you know a problem is coming before customers/users notice?

The more notice you have before your customers notice, the faster you can detect and recover from a problem, and the rarer they occur, the less confidence you can accept. If you have a system that detects the problem starting and auto-recovers in 200 milliseconds and your customers never notice required confidence is low.

On the other hand, if you’re going to lose $10K/second a 2 minute outage has a direct cost of $1.2M. And that’s without including the intangible expense of the hit to your brand. You’re going to want a bit more confidence when making that change.

So do your unit testing. Do your integration testing. Track your code coverage. Run in your test environment. Shadow your production environment. Until you have the appropriate amount of confidence.

Because all the time you’re building confidence you’re not getting feedback. You might have confidence that it’s going to work, but you have no confidence that it’s the right thing to do.

June 17, 2022 by Leon Rosenshein

Continuous Delivery

devops ci/cd

Continuous delivery is the idea that every change, or at least every change a developer thinks is done, gets built into the product and into customers hands. It’s a great goal. Reducing latency between a change and getting feedback on the change means you can make smaller changes and be more confident that changes you’re seeing are due to the specific change you’ve made.

Over the course of my career I’ve seen a lot of deployment frequencies. One of my first jobs was making concrete blocks. The Mold/Kiln combo could deploy a pallet of blocks about every 15 seconds, with each pallet having between 1 and 20 blocks on it. And do it 24 hours a day, 7 days a week, if there we fed materials in one end and took the blocks away on the other. That’s continuous deployment.

Or sort of continuous deployment. While we could “deploy” cubes of blocks in minutes, it was 24 hours from the time the material went into the mold until the block was ready to be sold. Which means while we could deploy continuously, the deployment latency was at least 24 hours.

I say at least because there was a big mixer at the front end of the process. It took in sand, dyes, water, and cement and mixed them together into multiple cubic yard lots that would get fed into the mold and eventually the kiln. It took about 20 minutes to mix each batch, and a few hours to use it up. Which meant we could only change block color every few hours. In most cases we could just change the mixture of sand/dye to get the new color, but when we wanted to make white blocks we had to wash out the mixer and all the transport equipment to make sure the new blocks stayed white. And that took a few hours itself. So in practice, the latency to change the color of the blocks wasn’t 24 hours, it was more like 26.

But even that isn’t the whole story. We made different shape blocks as well. Changing molds was a big deal. It took most of a day to do. And there were only a couple of people who could do it efficiently. If any of the rest of us did it the delay was a couple of days. And the mixer and concrete transport system had to be empty or the concrete would set and shut the machine down for days while we chipped the concrete out of all the nooks and crannies. Which meant that depending on who was around, the time to get a new shape of brick out the door was 30+ hours from when we started to make the change.

So here you have a system which can produce a set of blocks every 15 seconds. That’s delivering pretty continuously. Yet even with that kind of continuous delivery it could still take almost 2 days to start delivering something different.

But what does making concrete blocks have to do with delivering software? It’s this. The key factor to continuous delivery is the latency between a change being made and that change being available to the customer. The fact that your compiler takes 10 minutes, or someone can download your app in 2 minutes, doesn’t mean you’ve got continuous delivery. You need to account for all of the time between “it’s done” and “customers can use it”. And keep that number as small as possible. That’s when you see the benefits. You get fast feedback. You can roll back a change that didn’t work quickly and easily. You can try things with confidence.

When you do that, you’ve got continuous delivery.

June 15, 2022 by Leon Rosenshein

Scream Tests

devops backups hoarders

The other day I mentioned to someone that we had a scream test scheduled for a few weeks from now and they gave me a very confused look. Since I know that it’s my responsibility to make sure my messages are received properly I asked if they knew what a scream test was. Turns out they didn’t. That day they were one of the lucky 10,000. If you don’t know, today’s your day.

In short, a scream test is when you think something is unused, but you’re not sure and not sure how to find out. One way to be sure is to simply turn it off. If no-one screams then it was unused. If there are screams you’ve identified the users. It’s a pretty simple and accurate test.

I’m honestly not sure when I first heard the term, but it was probably sometime in the late 1990s. That’s when I was first running servers and services that other folks used on a regular basis. They were file shares, and back then storage was both expensive and hard to get, so dealing with hoarders was a full time job. But we never knew for sure who was using the data. We’d ask folks to clean up. We’d tell people it was going to go away on a given day. We’d remind them. Sometimes we get a couple of folks saying they didn’t care, but generally, silence. So eventually we’d make the data unavailable. Usually nothing happened. So we’d wait a week, archive to tape, do a restore (this is a critical step in backup planning), then remove it from the server. Space was recovered and everyone was happy.

But every once and a while about 30 seconds after we took the folder offline someone would do the office gopher and scream “Where’s my data?” Then we knew who was using that data. We’d either keep it or figure out exactly was needed and only keep that. Not perfect, but not a bad outcome.

Even more rarely, something completely unrelated would stop working. Then the real fun began. Why would the monthly accounting run care if the asset share for a 3 year old game was offline? It cared because the person who wrote the script had originally used their desktop for temporary storage as part of the process, but stopped doing that because they got a memo to turn off their machines at night to save power and the script failed. So they found some temp storage they could use and started using it. That’s wrong for many reasons. Not the least of which was that everyone at the company had access to that share and could have seen the data if they knew it was there. In that case we identified not only an unknown user, but we also closed a security hole that we didn’t even know we had.

Over the years since I’ve done similar things with file systems, internal and external services, and once even and entire data center. Turns out turning off a DC can break all sorts of things. Starting with DNS and routing 😊

And now you know what a scream test is.

June 13, 2022 by Leon Rosenshein

Automation

devops productivity

If you need to do something millions of times a day it needs to be repeatable and, ideally, idempotent. While it’s technically possible to get enough people in your system in a Mechanical Turk mechanism, in most cases what you want to do is truly automate the task.

The trick though, as, @Kelsey Hightower said the other day on the art of agile development book club is that “automation is a byproduct of understanding.”

You should listen to him explaining, but the basic idea is that first you understand what you need to do, then you figure out how to automate it. And the hard part is getting the understanding. Not just the knowledge of the steps you took. That’s the easy part. You also need to do the hard part, understanding why those steps, and why in that specific order.

It’s only after you have that understanding that you can really automate something. Without it you can execute the steps, and probably even notice and acknowledge failures along the way. But you can’t really understand the reason behind the failures. Which means you can’t do anything about them. Instead of automation, you’ve got a self-executing runbook. That’s a great place to start, and it might be enough for a while, but it’s not really an automated system.

An automated system understands your intent. In understands the goal you’re trying to achieve and helps you achieve it. It adapts around errors and delays. It figures out how to get from current state to the desired state. It notices deviations and corrects for them as well. And that gets into promise theory, which is a story for another day.

June 6, 2022 by Leon Rosenshein

TAD vs. TDD

tdd context code quality testing domain driven design refactoring

Test After Development (TAD) or Test Driven Development (TDD)? What’s the difference? They sound pretty similar, and the Levenshtein distance between them is only 5, but the difference is pretty profound.

In TAD you write some code, try it out with some test cases, and write some tests to cover the complicated or confusing parts. You want to make sure that you have some confidence that the code does what you want it to so you write some tests. Maybe you write some other tests to make sure some of the likely errors are handled. Simple and straightforward.

In TDD, on the other hand, you know what you want to do. But instead of figuring out how to do it, you write tests. You write tests that are examples of doing it. You write tests that demonstrate how you want people to use it. You write some showing you think they’ll try to use it. You write tests with the kinds of mistakes you think they’ll make. Of course, none of those tests pass. How could they? There’s no code to do it yet.

Then you look at all of those tests. And the APIs and function calls you’ll need to make it happen. And you think about them. Individually and as a group. How they fit together. If they’re consistent, in their inputs, outputs, error reporting, and “style”. If they solve someone’s problem, or are just a bunch of building blocks.

Then you go back and adjust the API. You make it consistent. Logically and stylistically. You make sure it fits together. You make sure that it can be used to solve the problem you’re trying to solve. And you re-write the tests to show it. But they still fail, since you haven’t written the code yet.

After all your tests are written, your APIs updated, all the tests are compiling but failing to run, you start to write the code. As you write more code more of the tests pass. Finally all the tests pass. Then you realize you’ve got an internal design problem. Something is in the wrong domain or you find some coupling you don’t want. So you refactor it. Some of the tests fail, but that’s OK. You find and fix the problem. All the tests pass and you’re done. Simple.

It took 1 paragraph to describe TAD and 4 to describe TDD. Based on that TAD is better, right? Maybe in some cases, but generally, no. And here’s some of the reasons why.

Looking at things from the outside

With TAD you generally don’t get a chance to look at your API as a whole. You don’t look for the inconsistencies, you just look at functionality. You might not upset your users, but you’re unlikely to delight them. It works, but it’s doesn’t feel smooth.

With TDD you have a ready-made source of documentation. Your tests are a great way to show your users how you expect the API to be used. And when your users tell you how they’re really using your API you can easily add that to your behavioral test suite to make sure your APIs are for life.

Looking at things from the inside

You will need to make changes. Your changes need to fit in with the existing system. With TDD you have a can see how the changes fit with the existing API. You can see if they are (or aren’t) working well together. You can ensure that your API stays logically consistent.

You’re always learning about the domain your modeling. As your understanding grows the boundaries between the domains will shift. Functionality and data will want to move between domains. There will be refactoring. TDD helps you know that what you’re changing doesn’t change the observable behavior. Because that’s what your users care about.

That’s why TDD generally leads to better results than TAD.

NB

One final caveat. While TDD makes your code and APIs better, it’s not the entire solution. No one methodology is. TDD is a great foundation, but for your own peace of mind you’ll also want other methodologies. Like unit tests for specific functionality and validating complex internal logic that doesn’t directly map to externally visible behavior. Or integration tests. Or testing on large data sets. Or whatever else your specific context demands.

June 3, 2022 by Leon Rosenshein

Lint Filters And Testing

tdd testing code quality

I don’t know about the lint filter on your dryer, but our current dryer, and every dryer I’ve used before this, had a message on or near the lint filter something like this.

Picture of dryer lint filter with the message CLEAN BEFORE EACH LOAD — *CLEAN BEFORE EACH LOAD*

That’s pretty good advice. Sure, you should clean the filter after you use it. It’s polite and you want to be helpful to the next person who uses the dryer, but that doesn’t always happen. As a user of the washer and dryer, it’s your responsibility to make sure that they are ready for your use. That means enough soap, softener (if you use it), bleach (if you want), and cleaning the lint filter.

It’s the same way with software. You should always endeavor to leave things the way you found them. Return unused resources. Clean up any temporary files. Reset shared state. In short, you want to eliminate any unintended side effects and make your system as idempotent as possible. You don’t want the current run to be unduly influenced by the previous run. The thing is, no matter how hard you try, you can’t be sure the code you have at the end runs successfully to completion.

Which means, before you do something, you need to make sure things are set up the way you need them to be. Especially if it’s a repetitive task. Like a unit or TDD test. You run them all the time. Potentially multiple times an hour. And sometimes they fail in the middle. So they need to be idempotent and side-effect free.

There’s a name for this pattern. It’s the Arrange-Act-Assert (AAA) pattern. Do all your setup. Everything from ensuring the (virtual) file system is in the right state to setting up the environment to building and seeding a database. Then make the ONE call that does the thing you’re testing. Finally, do as many asserts as you need to ensure that things went the way you expected.

The AAA pattern is that simple. Arrange things how they should be. Act on the system. Assert everything came out as expected. Like everything else though, in practice there are nuances and complexities. In many cases there’s a part of Arrange that you want to run once before any test, but not before every test. You might need to assert that the Arrange step actually arranged things correctly. You’ll find that a lot of the Arrange code should be re-used. You probably want to clean up after the Assert. And there might be a cleanup that runs after all tests are done.

Then there are the efficiency temptations. If you’re doing a CRUD test you’ll probably be tempted to do it in one test. After all, the Arrange for Read, Update, and Delete is Create, so why do it separately? It’s faster and less code to do it in one test right? Probably not. It might be faster on initial write, but over time, as the code changes, that unnatural coupling is going to slow you down. So don’t do it. Remember, Act is doing the one call that you’re going to test. If you have multiple Acts in your test you should split it up into multiple tests.

And don’t forget to clean the dryer lint filter before each load.

June 1, 2022 by Leon Rosenshein

Measuring and Managing

ooda deming agile

You’ve probably heard the quote from W. Edwards Deming that

if you can’t measure it, you can’t manage it

Which is an accurate quote. It comes right out of The New Economics. And on the surface, it makes sense. Whether you call it OODA, Plan-Do-Check-Act, or just a feedback loop, how can you control (manage) something if you don’t know what’s going on? And the more accurately you know, the better you can control it. That’s just a basic feedback system.

The thing is, that’s just part of the story. Or really, part of the quote, which is hinted at by that leading lower case i. The full quotation is actually

It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.

Think about the difference between those two quotes. The first is Taylorism. Scientific management at its most basic and most rigid. Measure everything. Then use those measurements to force results. And if you can’t measure it, figure out how or ignore it.

The second is the complete opposite. It’s about being able to manage the situation based on what it is, qualitatively. It’s about knowing where you want to get to and working towards it. Even if you don’t know where you are. It’s about dealing with ambiguity and making progress from a place of ambiguity. It’s about working with the situation we have, not the situation we wish we had.

That doesn’t mean you can’t or shouldn’t use accurate measurements when you have them. Far from it. It’s almost always better when have data to guide you. The more accurate your feedback signal the better you can craft your response. And you should. You should, whenever possible, measure things. Collect metrics from your apps, libraries, and services. You should monitor those metrics and respond to changes. Ideally you measure them well enough to use them not just as indicators of problems, but as predictors of problems so you can respond to them before the problem starts.

But you should also recognize that you need to manage things even without those metrics. It might be because you don’t know how to measure something, or because measuring it would take too much time/effort/resources that would be better spend on something else. For example, you might not be able to tell how much better option A or option B will be, before or after trying them. You might just have a feel for it. But you still have to make a choice. You have to manage the situation. And if anyone tells you Deming says you can’t, tell them The Rest Of The Story.

May 27, 2022 by Leon Rosenshein

Software Development Is A Social Activity

matrix context systems thinking development cognitive load

I’ve had a bunch of discussions lately around the nature of software development. Silos. Boundaries. Contexts. Cognitive load. Team Topologies. And how things flow and work together. Wheels withing wheels. Loops withing loops. Patterns within patterns.

Those discussions were all around how teams can and should work together. Where does one team’s responsibility stop and another’s begin. What’s the difference between a customer and a partner? How does value flow through the system? Platforms and Programs, Products, or Feature factories. The common thread in all of these discussions is that the relationships between the teams are more important than the code. That software development is a social endeavor.

Despite the myth of the lone wolf developer, for all but the tiniest projects/products on all but the tiniest embedded devices, you can’t do it alone. We build things with others. We build things for others. We don’t do our work, and our work does not exist in a vacuum.

On the other hand, no-one can know the details of everything. No one person can have enough information to make all the decisions needed. The cognitive load of that would be too high. No one can be in the right place to make all those decisions. Physics and time keep us from being in multiple places at the same time. So where does that leave us?

It leaves us with process and methodology. We come up with boundaries. Domain driven design. Bounded contexts. Discrete APIs. We come up with contracts and SLAs. Things that let us share knowledge and compartmentalize responsibility. Things that let us manage the complexity. Keep the cognitive load manageable. That let us focus on the things close to us and not worry about the things further away.

Or at least we aspire to have those things. But, honestly, we don’t. Instead, we have approximations. Leaky approximations. We have APIs and interfaces and microservices, which mostly let us not worry about the details. Except for the ones we do need to know.

Which means we’re back to people. And how they interact. People who need to talk to each other. Need to work with each other. And talk across all those leaky barriers. So they can make decisions across those leaky barriers. And use the fact that the barriers are leaky to help each other across those leaky barriers. And at the same time make the barriers less leaky.

As I write this I’m sitting on an airplane watching The Matrix: Resurrections. Talk about leaky barriers. There are the (semi) obvious barriers between the Matrix and the “real” world. There are the barriers between the world inside the (semi-real) Matrix and the Matrix computer game which Thomas Anderson has written. There’s the barrier between incarnations of the (semi) real Matrix. And that’s all inside the universe the movie takes place in

On top of that there are the barriers between the different Matrix movies. And The Animatrix series. And the changes the writers/directors have gone through. It’s almost like the directors and characters in the movie know that there’s an audience watching.

The leakiest barrier of them all is the audience. The audience knows many things about the in-movie universe that the people in the universe don’t. But there’s no guarantee that they’re not actually in a (the?) Matrix. After all, how would you know? Which is about as meta as it gets.

The thing is, what makes the caper in The Matrix work is the same thing that makes software work. The connections between groups. Within and across the leaky barriers. The understanding. The shared goals. The trust.

Because in the end, software development is a social activity. The choice is yours.

May 25, 2022 by Leon Rosenshein

Where do defects come from?

tdd code quality engineering excellence tyranny of or

For those that know me, it’s no secret that while I like a lot of Uncle Bob’s specific advice (clean code, unit tests, I have some problems with his overall approach. Consider his response to the question “Where do defects come from?”.

According to him, the cause is:

Too many programmers take sloppy short-cuts under schedule pressure.
Too many other programmers think it’s fine, and provide cover.

And the obvious solution is

Raise the level of software discipline and professionalism.
Never make excuses for sloppy work.

It’s true that as software engineers we need to do a better job, have more discipline, and not make excuses for those that don’t. Raising the bar on how we do our jobs is vitally important to eliminating defects.

But it’s not enough. You can always find someone who typed the line of code and blame them for the problem. There’s even a git command, blame to do just that. But it’s kind of disingenuous to stop there.

I’ve been doing this a long time. In all my years as a software engineer, I’ve never run into a co-worker who didn’t care. Some had more information than others, and some had different experience and viewpoints that let them see things others missed. But no-one thought sloppy short-cuts were OK to take or that it was OK if someone else took one.

I’m much more partial to Hillel Wayne’s approach to thing. Not just in this case, but in many others. He takes a much more nuanced approach. He avoids the Tyranny of Or Yes, someone typed that line, but the defect has a deeper cause than that. Why was that line allowed to be typed? Could we have added guardrails to at least prevent it from getting into user’s hands?

It’s because for all but the simplest bits of software, running in all but the simplest environments, it’s not just one developer, or even one team of developers that wrote the code. The code uses libraries and OS functionality from somewhere else. Someone else built the environment. The hardware the code runs on and the network it’s connected to are managed by someone else. It’s part of much bigger, interconnected, and interdependent system. To say that one person is responsible for an error in such a system isn’t accurate.

Which is not to absolve developers of responsibility for defects. They need to have discipline and be professional. There is no excuse for sloppy work. But it’s not enough. We also need to ensure that we have, and use, all of the tools at our disposal to not just find defect, but even better, keep them from happening in the first place. Because the goal is to make sure the defects don’t reach the customer, not figure out who to blame for them.

Older Newer