Recent Posts (page 25 / 65)

by Leon Rosenshein

Discipline

I cannot emphasize too much that architecture is as much about programmer discipline as any technical consideration. Without programmer discipline, all systems, no matter how well designed, degrade quickly into gray goo at the hands of people who don’t understand the "why." -- Allen Holub

Back in the stone age (actually the late 80s/early 90s) I was working for a 3rd tier aerospace company in southern California. One of the things we built was a tool we called ARENA. Basically a squadron level air combat simulation system. We handled flight modeling of aircraft and missiles, Air Combat AI, and a really fancy (for the time) display and replay system. And, I think, a pretty good domain driven architecture.

It helped that the domain was pretty simple (vehicles moving in 3D space) and that the entities didn’t really communicate with each other. At least not actively. Each object did it’s own physical modeling and put it’s location in a distributed shared memory system. After that each object was on it’s own to detect and respond to the other things in the world. And for each object we broke things down like the real world. Propulsion, aerodynamics, sensors, and either an AI or a set of input devices. Interfaces between systems were defined up front and we all knew what they were and respected them. We had external customers and internal users doing research into different flight modes and doing requirements tradeoffs.

Then we got a new customer. Someone wanted to use our system as a test bench for their mission computer (MC). Seemed reasonable at first. We already had a simulated world for the MC to live in, and we had well defined models of an individual aircraft, so how hard could it be to add a little more hardware to the loop? Turns out that it’s approximately impossible. At least with the architecture we had. Because our idea of the interfaces inside an aircraft were purely logical, while the MC expected distinct physical components talking to it over a single bus. So we wrote some adapters. That worked for some things, like the engine, because there was one input (throttle) and 3 outputs (fuel burn, thrust, and ingest drag). But it didn’t work for some of the more complex systems, like the radar, It had lots of inputs, including vehicle state, pilot commands, world state, and mission computer commands. And to get all of them we needed our adapter to reach into multiple systems. The timeline was tight so, instead of refactoring, we did the expedient thing and reached around our interfaces and directly coupled things. And it almost worked.

Everything was very close to correct, and often was, but the timing just didn’t work out. Things would drift. Or miss a simulation frame and stop, or worse, go backwards. So we added some double and triple buffers. That got rid of the backwards motion, and the pauses were usually better, but sometimes worse. So we added some extrapolation to keep things moving. Then we added another adjustment. And another. What was supposed to be a 4 week installation turned into a 3 month death march and resulted in a system that worked well enough to pass acceptance tests, but really wasn’t very good at what it was supposed to do. It went from a clean distributed system to a distributed ball of mud.

And that happened with the same team that built the initial simulation doing the mods. We knew why we had built ARENA the way we did. The reasons we put the boundaries and interfaces where we did. And why we shouldn’t have violated those boundaries. But we did. Not all of them, but enough of them. And we paid the price. And the system suffered because of it. Because we didn’t have the discipline to do the right thing.

Now imagine what would have happened if we didn’t know why things were the way they were. And there wasn’t any documentation of why. Which of course there wasn’t because hey, we all knew what was going on, so why right it down? Any semblance of structure would have fallen apart that much faster. And we probably would have not just had problems with the interface, but likely broken other things as well. It’s Chesterton’s Fence all over again. That’s where Architectural Decision Records (ADRs) come in. Writing down why you made the decisions you did. The code documents what the decision was, and the ADR documents why.

So two takeaways. First, next time you get into a situation where you have a choice between the “right” thing and the “expedient” thing, remember the long term cost of that decision. Remember that once you start down that slippery slope of expediency, it just gets easier and easier to make those decisions. Because once the boundaries and abstractions are broken, why bother to keep them almost correct? I’ll tell you why. Because this way lies the madness. Otherwise known as the big ball of mud.

Second, write down the why. The next person to work on that area won’t know all the background and the experiments that led to the current architecture/design. Having the why documented lets them avoid making those same mistakes all over again. Even (especially?) if that person is future you.

by Leon Rosenshein

NFRs and You

NFRs are non-functional requirements. I’ve talked about them before, but lately the name has been bothering me. It’s that non in the name. Unlike non-goals, which are the things you’re not going to do or worry about, Your NFRs are not things that your project won’t do, and they’re not requirements you can ignore. NFRs are things you do need to worry about. 

I mean, look at the phrase non-functional requirement. You mean you don’t want it to work? Or are you saying it doesn’t do anything, but it’s required anyway? Why would you put any effort into something that has no function?

Requirements without function or benefit should be pushed back against, but what we typically call NFRs are neither. NFRs are usually drawn from the list of -ilities. Things like burstability, extendability, reliability, and durability. For any given project you will likely need one or more of them, so ignoring them is a bad idea.

That said, the -ilities aren’t your typical functional requirements either. “The method shall return a (nil, error code) tuple when receiving invalid input.” That’s a pretty straightforward requirement that defines how the code should function. It’s also pretty easy to unit test. Write up a set of test cases that provide whatever level of coverage of the combinatorial explosion of possible inputs and call the method. As long as you get the expected answer you’ve met the functional requirement.

No, I think what we call NFRs are really operational requirements. Requirements around the way things behave when they’re operational. Things that are hard (impossible?) to test with unit testing. Consider burstability. Sure, you could write a test that, in parallel, calls your method 100’s of times per second. But does that give you any confidence that the system overall can handle it? To really get some confidence on burstability you need to actually try it. Maybe not in production, but in a production like environment, with all of the vagaries that implies.

And like any other requirement, operational requirements need to be part of the plan. You can’t just bolt extensibility on at the end. Or at least you can’t do it in a sustainable way. Which means you need to think about operational requirements from the beginning. They’re requirements, just like any other requirement. They’re just usually a little harder to figure out. Especially if you’re just looking at the business logic, because they’re not in the business logic, they’re in the business case.

The business logic will tell you what to do when a customer places an order, and what should happen if they do something wrong. It won’t tell you what to do on the Friday after Thanksgiving. It’s the business case that will tell you that while your typical order rate is X, on that day it will be 10x, and for the subsequent 6 weeks it will average 5x. And you better be ready for it, or else.

So next time you hear someone asking you about your NFRs, think about what they’re really asking you. Think about the scope of the question, and plan for it.

by Leon Rosenshein

Toasters

Toasters are pretty simple. Put in a slice of bread. Push a knob/lever and a few minutes later your toast pops up, with some approximation of the level of doneness you want. If you’ve got a commercial toaster with a conveyor belt you don’t even need to push the knob.

But it’s only an approximation of the level of doneness you want. And it seems to drift over time. And with different breads and slice thicknesses. And bagels, which should only be toasted on one side, are a whole different kettle of fish. Which leads to this story, which is almost 30 years old.

But even the simple toaster is hard to build, if you have to go back to first principles. I don’t mean being a maker in the modern sense, with a shop complete with 3D printers, CNC mills, and laser cutters. I mean first principles. Consider this attempt to make a toaster from scratch. It didn’t exactly work, but it didn’t exactly fail either.

But what has that got to do with us? Just this. On any given day we make lots of choices. Choices about what to build, what to reuse, and what to extend. Choices about building what we need right now vs what we know we’ll need next week/month vs what we think we might want in the future. So we need a framework to make those choices in.

A good place to start that framework is integrated value over time. Which usually means building on existing things instead of building from scratch. Improving things instead of replacing them. But not always. Sometimes a new requirement comes along and the cost of adapting current systems exceeds the cost of making something new. So do it. Carefully. Within the existing framework. And then iterate on that thing as well.

And that’s where it gets really interesting. You find out how well your mental model of the system translated to architecture and then code matches the problem domain. If they match well it’s easy to add something new. If they don’t the new thing will expose the gaps between your code and the problem space. Which brings you right back to deciding whether to adapt what you’ve got or build from scratch.

Because in the end, just like building a toaster, the right answer depends on figuring out the right place to start and the right place to go to.

by Leon Rosenshein

Pointers and Errors

Go is pass by value. But it has pointers. And multiple return values. So here’s a question. Let’s say you’ve got a Builder of some kind, but there are combinations of parameters that are invalid, so your builder returns a struct and an error. Simple enough. The caller just checks the returned error, and if it’s non-nil does whatever is needed. If it is nil, just continue on and use the struct.

But what if the user forgets to check the returned error? Then what? They’ve got a struct that might or might not be usable, but it’s certainly not what they asked for. What’s going to happen if they try to use it? Even worse, what happens if they store it for a while, then pass it down some long call tree, and then it gets used. And fails. Very far from where it was created? That can be a nightmare to debug.

One way to prevent that is by returning a pointer to a struct, instead of a struct itself. Then in the error case you can return a nil pointer and the error. There’s no chance of that nil pointer working. As soon as the code tries to use it you’ll get a panic. Panic’s aren’t good, but they’re often better than something random that almost works. So let’s always do that. Right?

Not so fast, because now you’re using a pointer to things, and suddenly you're exposed to spooky action at a distance. If one of those functions deep in the call tree changes something in the struct via pointer, it’s going to impact everyone who uses the struct from then on. That might or might not be what you want.

Consider this playground. When you create a Stopwatch you get one back, regardless of whether your offset string is valid (sq) or not (s2). So even if you get an error .Elapsed() will return a value. You don’t get a panic, but you do get an answer you weren’t expecting.

On the other hand, if you create a *Stopwatch with a valid offset you get one back (s3), but if your offset is invalid (s4, s5) you don’t get one back. As long as you check your error value (s3, s4) you can do the right thing with a *Stopwatch, but if you don’t (s5), boom - panic.

Finally, consider what happens in the timeIt methods. With the both the Stopwatch (s1) and *Stopwatch (s3) you get the right sub-time, but when you get back to the main function and check the overall time the Stopwatch (s1) gives you the right elapsed time, but the *Stopwatch (s3) has been reset and shows only 1 second of elapsed time.

So what’s the right answer? Like everything else, it depends. Don’t be dogmatic. I lean towards pointer returns because it’s harder to use an invalid one, and not pointers for method parameters unless I want to mutate the object. But that can change based on use case and efficiency. What doesn’t change is the need to think about the actual use case and make an informed decision.

by Leon Rosenshein

Understanding

Quick. What happens when someone passes a poorly formatted string to your interface? The interface notices and responds with the correct error code. Great. What if they set the colorize_output and the no_ouput flags? Do you colorize nothing? Arguably correct, but also arguably wrong, and probably indicative of a confused user. You can, and should, be testing all those cases, and others, and having well defined answers. Unit testing can help you avoid lots of customer issues. But that's just based on your interpretation, not your customers.

Then there's the whole "You gave me what I asked for, not what I wanted" situation. Again, you're arguably correct, but you don't have a happy customer. Remember, your customers want a problem solved, not a bag of features. So unit testing isn't enough. You need to do more, which leads to:

"Testing leads to failure, and failure leads to understanding." - Burt Rutan

Understanding what? Not just the specific feature on the ticket you're implementing. Understanding your customer's problems. Their workflow. Their environment. And their expectations. The kind of understanding you get from testing the thing you're building in the environment it will live in. That doesn't mean that you don't do unit testing or integration testing. Instead, it's another kind of testing you need to do.

And it's not just throw it over the wall testing, although that has value too. At least at first, you need to work with your customer to understand how they're using it. You need to help them understand how you expect it to be used. And together you come to understand the differences between the two. Then you can close the understanding gap.

You need to interact with your customer. You need to interact frequently to keep cycle times low. You need to interact deeply because you need to understand not just the surface issues, but why they're issues. And that kind of understanding requires a lot of information to be transferred. So you want high bandwidth communications. Bug reports might tell you what, but they're not too good at the why. So in person if possible, or at least screen share.

Because you can't solve a problem you don't understand, and you can't understand your customer without really seeing how they work. And that applies to all customers, internal and external.

by Leon Rosenshein

Best Practices?

Design patterns are best practices. Best practices are, by definition, the best way to do things. It’s right there in the name. Therefore we should use design patterns for everything. Taken to its logical conclusion, you end up turning the interview FizzBuzz question into the Fizz Buzz Enterprise Edition. That’s perfect, right? It’s got, among other things, frameworks, abstractions, factories, strategies, Travis for CI, and it’s even got an implementation. Which means if you need a scalable, enterprise-ready version of FizzBuzz, you’re set. End of story.

Or not. There are lots of design patterns there. But the simple usage of a pattern doesn’t mean it’s a good thing. After all, you can always add another level of abstraction.

More seriously, all of those things are important. When used at the right time. You could use a Builder to set up a State Machine every time you need to get the current state, but is that the right thing? Probably not. 

There are two times when those best practices are almost certainly wrong. At the very start of a project, and when you’ve pushed your project into areas where most folks haven’t gone. Because that’s not what the Best Practices are best at.

When we did the O’Reilly kata last year we almost certainly over-designed it. We had microservices, an authentication/authorization module, and provisions for adding a machine-learning component. For a customer with 100s of customers at 10s of locations in one city. In reality, you probably could have managed it with a spreadsheet and some web connections and a few Excel macros. The winning architecture was a monolith with some clear internal separation for if/when more scale was needed.

At the other end of the spectrum is getting the most performance out of a system. Which for me was epitomized by the Apple ][ hi res graphics, with it’s weird modes and loading order, because it saved a chip in the design and was cheaper/faster. There’s no Best Practice that will tell you that it’s a good idea to load things like that. It’s just something needed in that particular time/place. The StackOverflow folks have a good description of how they traded best practices for scale/speed, and how now, with more power and different goals they’re changing some things.

Which is to say that best practices are best practices when the situation you’re applying them to match the conditions the best practices were developed for.

by Leon Rosenshein

Leading Vs. Trailing

Indicators and metrics, that is. Or maybe we should call them drivers and outcomes. Leading indicators are predictors. Done right they help you know what’s going to happen in the future. Trailing indicators, on the other hand, are all about what happened. Consider a car without a fuel gauge. From experience you know you’ve got 6 hours of fuel. After 5 hours driving down the highway you can predict that you’re low on fuel. Hours driven is a leading indicator. But since you’re on vacation and the car has a lot of luggage in it after 5 ½ hours the engine sputters and dies. You’ve run out of fuel. RPM going to zero is a trailing indicator.

Let’s start with a reality. Anything you measure has happened. In that respect all metrics are trailing indicators. The hours you drove that car is a measure of what happened. On the other hand, you can also make predictions based on any metric, so they’re also leading indicators. If your car engine’s RPMs stay high you probably haven’t run out of fuel, so you can make a short term prediction.

It’s really about what you want and what you need to do to achieve it. Your key performance indicators (KPIs) tell you if you’re achieving the desired results. Meet them and you did. Don’t meet them and you’re not. But past performance is not a guarantee of future results. Failing to meet your KPIs won’t tell you what you need to do to fix the problem. That’s just the way trailing indicators are. When you’ve run out of gas, you don’t know why. Maybe you were going faster than usual. Maybe your watch is slow. Maybe you’ve got a rooftop cargo box on your car now. Maybe your fuel tank has a leak.

In this case, hours driven (or hours remaining) is one way to predict if you’ll meet the KPI (have enough fuel). But it’s not enough. Or accurate enough. A better one is fuel burn rate. Better if you combine it with distance traveled. Average miles per gallon is a pretty good leading indicator of how many more miles you can drive. But it’s not perfect. You might have a leak. Or the conditions you’ve averaged over are different enough from your current situation. So you want trailing indicators on your leading indicators to improve confidence.

Obviously this kind of thing has a lot to do with writing code to predict things, But what has this got to do with development in general? Consider the development process. A couple of the big things we measure and report on are time for a build and time to fix a broken build. Critical things to keep low. But their trailing indicators, and as such are great for telling us there’s a problem, but not so good at telling us what to change. For that we need to measure a bunch of other things. Like time from diff creation to approval. Number of changes to a diff before approval. Time from diff submission to test start time, The number of tests run. The number of tests that can be run at once. The reliability of the tests. The time that it takes to run all the tests.

There’s no silver bullet in that list. No one thing that will always make it better. What it points to is that you need to understand the drivers of the system (leading indicators) before you can get the results (trailing indicators) you want.

So what’s driving the result you’re looking for?

by Leon Rosenshein

Dogma

Dogma (n): a principle or set of principles laid down by an authority as incontrovertibly true.

        – Oxford Languages

That’s one definition of dogma. The problem I have with dogma is the incontrovertible part. Because even if something was absolutely true once, that doesn’t mean it’s true now. Or will be in the future.

And what happens when your dogma disagrees with itself? It’s not usually existence ending, but it can be confusing. Consider this bit of code

package main

import (
    "fmt"
)

const TWO = 2

func scaleList(numbers []int, scaleFactor int) []int {
    scaled := make([]int, len(numbers))
    for i, e := range numbers {
        scaled[i] = e * scaleFactor
    }
    return scaled
}

func doubleList(numbers []int) []int {
    return scaleList(numbers, TWO)
}

func main() {
    n1 := []int{1, 2, 3, 4}
    d := doubleList(n1)
    fmt.Println(d)
    n2 := []int{4, 2, 1, 3}
    fmt.Println(doubleList(n2))
    fmt.Println(doubleList(d))
}

It’s DRY. There’s no repetition of code. There are no magic hard-coded constants in functions. Functions do what they say they do. But is it YAGNI? KISS? Not really. Of course it’s a contrived example, but there’s no need for the scaleList function. There’s barely a need for doubleList, and a constant called TWO with a value of 2 is just added cognitive load.

Which is not to say that DRY, KISS, and YAGNI can’t live together. They can and they should. But it’s a balance. And the only bit of dogma I have is “It Depends”

by Leon Rosenshein

Flow vs. Utilization

What should you optimize, utilization (how busy everyone is) or flow (how tasks are moving through the system)? How do you measure those things? What does it even mean if they’re high or low?

What does utilization mean for a developer? What does it include? Of course, time spent typing in code counts. And time spent writing tests. But what about time building/running code and tests? Or time documenting the design decisions made? And the time spent making those decisions? Those should probably count. What about time spent tracking the time spent? Or time planning what work to do? Is time spent helping a co-worker understand something utilization? What about training, or research? All important things so how do you count them?

What does it mean to optimize flow? Is it as simple as increasing the number of Jira task state changes? Or maybe just increasing state changes to the right? Probably not because all Jira state changes aren’t equal. Is popping into existence a state change? Can you optimize development flow like an assembly line?

Or should we be optimizing something else? Like maybe value? Value as in customer value. Whoever your customer is, internal or external. And that value doesn’t exist until it’s delivered. Because working on your machine, or in staging, or some other test harness isn’t value in your customer’s hands.

Which might push you towards optimizing flow. After all, getting a task done adds value, right? Sometimes, but not always. Because, for tracking and dependency reasons, not every task in Jira adds value. In fact, most of them don’t. They’re necessary, but not sufficient. Until all of the required tasks are done, we can’t do the one task that adds value. You could spend a year doing precursor tasks and not add any value. So optimizing flow doesn’t seem right.

So how do you optimize value? First, figure out what value means. What use cases and user stories add the most value? Add new benefits to the user’s workflows? Then figure out which tasks are really required to provide those benefits. Once you know which tasks, focus on those specific tasks, not just any task. Look for bottlenecks and work to reduce them. Concentrate utilization in those areas. Think broadly. If you need domain expertise, get it. Ideally on the team, with the team if needed. If you need broad agreement between customers and suppliers, get that. Make sure everyone involved knows what you’re doing and why.

Or, to but it in the terms we started with, keep individual utilization high on the flow of tasks which roll up to adding value.

by Leon Rosenshein

We Like Ike

Dwight David Eisenhower. 34th president of the United States. Supreme Commander of the Allied Forces in Europe. Brought the term military-industrial complex into the vocabulary. All in all, a pretty busy guy for much of his career. He did a lot of things and made a lot of decisions. And regardless of if you agree with those things or his decisions, one of the other things he gave us was a framework for deciding what needs to get done when.

It’s not perfect, and of course, like all decision frameworks, there’s some judgement needed, but what it does have is simplicity. Just take each problem/task/decision and map it on two axes. Urgency and Importance. Because, “What is important is seldom urgent and what is urgent is seldom important.”

Urgency is about timeliness and deadlines. Given the delta-V of your rocket, there’s a launch window in 24 hours. Use it or not? That window will close whether you launch or not. The next opportunity will be in a few days/weeks/months/years, but there will be one. In this case, to quote Geddy Lee, “If you choose not to decide, you still have made a choice.”

Importance, on the other hand, is about the long term impact of the thing, and how much time/effort/money it will take to change later. Getting the hardware design right before you make a million of something is important. Back when software was delivered in a box, making sure the gold master you sent to the duplicator was about as important a choice as you could make.

So once you’ve got the two axes, plot them out.

Importance and Urgency matrix

The highest priority things are up and to the right. Urgent and Important. Do those first. And make sure they’re done. These are the crisis. If you don’t at least mitigate them something really bad will happen.

Next are the important ones that aren’t urgent, the top left. You want to get those things scheduled so you have time to give them the attention they need, before they end up in the top right. And by scheduled, I don’t mean schedule the crisis. I mean schedule the time you’re going to work on them and make sure they get done.

Then there’s the urgent things that aren’t as important. The key is less important, not unimportant. Can you get help on them? Is there someone better to make sure they get done? Find someone to help you work them in parallel with the other things on your list.

Finally, there are the non-urgent, unimportant things. These are the things that you try not to do. Requests that aren’t yours to resolve. Just say “No.”

One important thing to remember though. Just because you don’t think it’s important or urgent doesn’t mean it isn’t. The classification isn’t just about you and your wants/desires. You need to keep other’s priorities and your own biases in mind while classifying things. If you don’t, you end up in a silo. And you probably don’t want that.