Recent Posts (page 17 / 65)

by Leon Rosenshein

Back It Up There

Back in the house now and mostly back to normal. Heat and electricity are on, but we still need to boil our water. Turns out most of our processes and preparations worked, but not all of them. The biggest failure appears to be the emergency alert system. The city/county uses Everbridge to share alerts. That’s fine and dandy, and they’re highly available and distributed. The problem is, they’re OPT IN, so if you don’t register for alerts you don’t get them. And of course, if you don’t know about the system you don’t register. Well, we’re registered now, so we’ve got that going for us.

Which got me thinking about systems testing and assumptions. When something absolutely, positively, has to work right, how do you verify it? Do you believe that if your car’s “Check Engine” light is off everything is fine? That’s a pretty bold statement. Especially if you don’t have a test to tell you if that light even works. Even if everything is working as planned, is there something that can go wrong that the light doesn’t tell you about?

Having a little green light that says everything is OK is a little better. At least then if the bulb burns out you know something happened and you can check. But even that leaves you at the mercy of what the system behind the light checks.

The only way to really know if your system can handle a particular fault is to try it out. Back when I was working on Falcon 4.0 the team down the hall was working on Top Gun, and they were trying out a new source control system, CVS. It seemed great. It understood merges. Multiple people could work concurrently on the same file and deal with it later. And our IT team took good care of the server. RAID array for the disks, redundant power supplies, daily backups, the whole 9 yards. And of course, the server died. It was the Pepsi Syndrome. Someone knocked over the server. Disks crashed. Motherboards broke. Network ports got ripped out.

Short story, it wasn’t coming back. But that’s ok. We’ve got backups. Weekly full backups and daily incrementals. Pick up a new server, restore the backup, and keep developing. Just a couple of days delay. Until we tried to restore the backup. Turns out we had write-only backups. The little green lights came on every week. The scripts returned successfully. The tapes were rotated off site for physical safety as planned. We just couldn’t restore the server from them.

Luckily we were able to piece together a close enough approximation of the current state from everyone’s machines and the project moved forward. We also added a new step to the weekly tape rotation process. Every Monday we’d restore from tape and apply an incremental on top of it to make sure the entire process worked. And of course, after that we never needed the backups again.

So next time you trust the absence of a red light, or the existence of a little green light, make sure you know what you’re really trusting.

by Leon Rosenshein

2021 In Review

And what a year it’s been. Just some personal reflections. I worked from home for 11 1/2 months. 1 business trip, which was the first in ~19 months (and that hasn’t happened in about 12 years). 1 vacation. Became empty nesters as our youngest moved out. Restarted our latke fry. 1 office evacuation. 1 company change. 2 role changes. 3 different managers. And 153 blog posts.

Which was the biggest? That’s easy. Becoming empty nesters. Especially after working from home for over 20 months. It’s amazing how big a house can feel with just 2 adults and one dog. A lot quieter too. But what really made me notice was when I was on that business trip. Both my wife and I have traveled, together and separately since we got married, but for the first time in 30 years my wife spent the 6 nights at home alone. That’s a big change.

After that, it’s probably the change in roles. The move from ATG to Aurora, for as much churn as there was, really wasn’t that big of a change for me at first. My direct team and the larger group around it moved as a unit. We merged in some really sharp folks, but the work didn’t change much, nor did the environment. Picked up a few more responsibilities. Dropped a few as well. But over a few months I went from tech lead on the compute team to tech lead for Tech Foundations. Big change in scope. More indirect. More influence. But still exciting and I’m enjoying the work.

< Interlude for office evacuation >
SmokeFilledRoad
It’s a little smokey around here
< /Interlude for office evacuation>

And with 153 blog posts for the year, still sharing. Since we moved to Confluence I’ve got a little better access to history so I decided to play around with the confluence API and see what folks like.

Entry Likes
Artificial Stupidity 8
Drive 6
Tenure 6
Ignorance As A Service 5
Exploitation 5
Another 4 Questions 5
Hyrum’s Law 5
Experimental Results 5

Alternatively, there’s the ones with 30+ viewers (since August when we started keeping track)

Entry Viewers
Cloth Or Die 41
More Wat? 36
Ignorance As A Service 35
Language Matters 35
PR Comments 33
Continuous Education 32
Scrape To Taste 31
Core Services 30

 

by Leon Rosenshein

Language Matters

Hyrum Wright, of Hyrum’s Law posted a, in my view, valid, reasonable, and realistic question. Where do the responsibilities lie when a change to a core library/service/tool that is demonstrably good for many breaks tests for some? Is it with the person/team that made the fix? The person/team that wrote the test? The person/team that set things up so that nothing can be released with a failing test? And what should the next steps, immediate and long term, be anyway? Because in an automated, coupled system with gates to prevent bad things from happening this good thing either brings the system to a halt or gets delayed, which means value isn’t being delivered. The right answer, as is often the case, is “It depends”.

In a small (whatever that means) project it’s a non-question. It’s the same team, so just fix it. If it’s a periodically released, externally maintained thing, then fixing the issue is just part of the periodic upgrade tax. It’s only when you have a large project, with a team of teams, that the issue of responsibility becomes something to solve for. In theory, in the Team of Teams everyone’s responsible. In practice, when everyone’s responsible, it will often appear that no-one is responsible. Dealing with that conundrum is a topic for next year though.

The thing about language though is that how you ask the question is as important as the question itself. In the original post Wright asked whose fault the broken tests were. Based on later discussion, and the way I took it, “fault” was shorthand for “responsible for fixing”. But that’s not what question, as written, asked.

The literal question was about blame. That took the discussion in the team dynamics direction. Talking about dysfunctional teams and how even getting into that situation is possible. Blame is not productive or conducive to fixing problems and preventing them happening again in the future. It might have a short term impact, but all it does is push the problem further underground and make finding/fixing it harder next time.

Our jobs are to deliver as much value as possible as quickly as possible. Playing the blame game makes that harder. That’s why language matters.

by Leon Rosenshein

Core Services

Ran across an article that says core services are a bad idea. Given my background in infrastructure and things like BatchAPI I got a little upset at the title. But I’m a tolerant person and open to learning, so I decided to read the article.

Turns out I don’t disagree with the article, but I’d probably give it a different title. What the author is calling Core Services is what I’d call shared global data. It’s the difference between platforms that services are built upon and services that are built into other services. BatchAPI, and platforms in general, come under the heading of Domain Expertise. The author says, and I agree with them, that it does make sense to have a centralized, core team to provide that expertise as a core platform.

On the other hand, consider the “User” database. Sure, there’s a service in front of the database, but really it’s just global data masquerading as an RPC call. It’s not that the data shouldn’t be shared, it’s that direct, synchronous sharing like that adds all sorts of complexity and cross domain coupling that’s not always needed. Sometimes it’s needed. When you need immediate consistency is the biggest reason. You’ll need to design for those cases, but in most cases it’s not really needed. RPCs and user interactions are already asynchronous, and there’s nothing you can do to prevent the user from changing their mailing address for their account milliseconds after you print the address on the monthly billing statement. You need to handle that some other way anyway.

The idea is that by defining strong bounded contexts and only crossing the boundary when it's truly needed you decrease design and development time coupling. Reducing that coupling increases velocity, which means producing more value sooner.

by Leon Rosenshein

Constraints and Creativity

A puzzle is a problem we usually cannot solve because we make an incorrect assumption or self-imposed constraint that precludes a solution

“Creativity is the ability to identify self-imposed constraints, remove them, and explore the consequences of their removal.”

    -- Russell Ackoff

Speaking of systems and systems thinking, what you don’t think about is just as important as what you do think about. Sometimes more so. Our experience puts lots of constraints on our thinking and in most cases it makes things simpler. Sometimes though, what you know can hurt you.

It’s called the Einstellung effect. In the classic water jug experiment people were asked to get a set amount of water using 3 different sized jugs. The way the problems are presented a consistent pattern appears in the first set then in later problems there’s a more efficient way, but most people stick with the pattern. It’s not wrong, but it’s not the best answer.

Another place you might run into it is the XY Problem. You have a problem you want to solve and start down a solution. You know almost everything you need and are pretty sure your solution will work. You get most of the way there, but run into an issue. Now you’ve got a new problem, so you ask for help with that. Solving that problem might be the right answer, but very often I get to ask my favorite question, “What are you really trying to do?” and that leads to a better solution to the original problem.

Being aware of the einstellung effect is nice, but awareness isn’t a solution. Avoiding it takes work. Work to notice that it’s happening and work to do something different.

Sometimes it’s easy to notice. If you’re having trouble finding a solution to a problem that seems like it should be straightforward, that's a good sign. Another is when you get stuck implementing a solution on something that doesn’t seem to be connected to the real problem. The most subtle though, and the one to be most alert for, is patterning. If you’re applying the same solution to all of your problems that might be a sign you’re stuck.

Once you realize that you might be stuck, there are things you can do to break the cycle. The first is to simply step back. Take a break from that problem and do something else that lets your mind handle things in the background. Take a walk. Take a shower. Anything but pound your head against the same spot on the wall again.

A different option is to do something to break the pattern. Maybe brainstorming. Maybe think about all the weird Rube Goldberg ways you might solve the problem. How would you do it brute force? How would you do it in a completely different environment? What if you needed to solve the problem without whatever tool you’re currently thinking about using?

Finally, you can check your assumptions. What previous problems is the current one like? How is this problem different from all others? What are you assuming about the problem space? The solution space? The available options? Did someone tell you how you need to solve the problem or were they just offering a suggestion?

Bottom line. Removing self-imposed constraints won’t solve the problem, but it will open up new possibilities for you.

by Leon Rosenshein

Beware the RAT

Not rodentia, although Pizza Rat should be avoided, but Rework Avoidance Theory (RAT).

Back in the before times you shipped a product in a box, marketed the hell out of it, and hoped people bought it. What was in the box was it. There was never going to be more to it. If enough people liked it you might make a new version in a couple of years and try the same thing. If not, you moved on to some other idea, put that in a box, and tried to sell it. The cycle continued.

At the heart of this process was a thing called Milestone 0. Take a big chunk of the available time and dedicate it to planning. Not just figure out where you wanted to be or what the end result was, but detailed planning. What things to do. How long they would take. The order to do them in. Who would do them. When everything would be done. Good planners even accounted for vacation time.

The resulting plans were a marvel. Everything needed was accounted for. Nothing extra was scheduled. Nothing would need to be changed. The end result would be exactly what was expected at the beginning. It’s a wonderful idea. No wasted effort. No missteps. No rework. That’s what BDUF is all about. Being as efficient as possible

Of course, about the time the first line of code was written the plans were out of date. By the time the product shipped the difference between the plan and the actual was usually bigger than the plan itself. There was lots of wasted effort. Lots of missteps. Lots of rework. Because we didn’t know what we didn’t know. The only way to learn was by trying, and then we’d apply the lessons learned. By doing rework.

It turns out that trying to avoid rework without perfect knowledge isn’t a way to avoid rework, it’s a way to push your rework off to the future. To a time when it’s harder to do, costs more to do, and has far greater impact on things.

Instead, accept that there will be rework, and use that acceptance to guide your path. Make decisions as late as possible. Make them as changeable as possible. Don’t make a plan, make a sea chart.

One of the best ways to do that is to take many much smaller steps. But that’s a topic for another time.

by Leon Rosenshein

OODA and Systems Thinking

Lately I’ve been thinking about systems thinking a lot. The difference between complicated systems and complex systems. How you approach them and how you work with them. You can control a complicated system. It might take a while to find the right lever to pull or knob to turn, but once you figure it out you can control it pretty tightly with a feedback loop. Because a complicated system’s responses are (generally) linear and predictable.

Complex systems are different. You work with a complex system. You don’t control it. You can characterize it’s responses to small changes in constrained environments, but they’re hard to generalize about. Because complex systems have feedback loops in them already. And those feedback loops aren’t linear. They can be exponential. They have inflection points. They can change direction. They’re dependent not only on the initial conditions, and the change applied, but also on the rate of change.

Or to put it more concretely, with a complicated system, you know how the system will react to a change, and a person with lots of experience with the system has a high likelihood of knowing what to change,and how much to change it, to get a desired response. On the other hand, with a complicated system, having lots of experience with a given set of conditions and making small changes doesn’t give you high confidence about what will happen if you make large changes, nor does it tell you how small changes will impact the system if you start from a different state.

A complicated system, no matter how Rube Goldberg it is, can be boiled down to a series of runbooks, checklists, or decision trees. If this, then that. After A, always B. When the next step doesn’t happen you know exactly where to look to find the problem, and the solution is usually obvious. Executing the fix might be hard, but knowing what it is, not so much.

For complex systems it’s different. You might have an idea about how the system will respond to a change, but unless you’ve seen the exact same situation before then you don’t really know what will happen. That’s where the OODA loop comes in. I first learned about the OODA loop from fighter pilots back in my flight simulator days, but it really applies to any situation where you can’t predict exactly what the response to an input will be. 

OODA is short for Observe, Orient, Decide, and Act.

Simple Observe, Orient, Decide, and Act loop

But that simple loop just glosses over the critical details of the OODA loop. In its full form the OODA loop is a complex system in of itself, 

Complicated Observe, Orient, Decide, and Act loop with feedback

That’s because it takes a complex system to deal with a complex system. OODA is all about interacting with the environment, understanding how things are changing, due to its own inherent tendencies and reacting to your actions, and then modifying those actions to drive the system to the state you’re looking for.

Luckily, while OODA is a complex system, it’s really not that complicated of a system. While it has lots of feedback and feedforward loops, the process itself is straightforward. Where it gets hard is in the observe and orient steps. It takes effort and intellectual honesty to avoid biases. To see things as the are and as they are evolving, not how we want or expect them to.

So next time you run into a complex system that you’re trying to manage, Observe, Orient, Decide, and Act. And repeat as necessary.

by Leon Rosenshein

Scrape To Taste

our system of make-and-inspect, which if applied to making toast would be expressed: “You burn, I’ll scrape.”

-- W. Edwards Deming

After 2 weeks at sea, just a quick thought on testing and quality. You can’t test quality into a product or system. I think testing is critical. Unit tests, integration tests, exploratory tests, visual tests, and sometimes, fun-ness tests. But as important as they are, all you can do with after the fact testing is make sure that you don’t inflict known defects on your users. That’s critical, and we need to make sure we do enough of it, but we can’t rely on it.

I’m not sure when Deming first said it, but I’ve seen it quoted as far back as 1991. If you want consistent quality output you need to design the system to produce it, not just throw out things that don’t meet the requirements. If you want to consistently and quickly deliver a quality product you need a culture that values that quality. Otherwise you end up moving faster and delivering value later.

by Leon Rosenshein

Better Code

Ran into a really interesting video the other day by Woody Zuill (of Mob Programming) and Llewellyn Falco. Lots of good things to learn there. I’ll leave it to you to watch for yourself. It’s the meta point that I find really interesting.

The meta point is that you don’t need to understand code to refactor it, and that by refactoring you gain understanding. Then, with understanding, it’s easier, and overall faster, to do the thing you want to do.

I comes down to what Martin Fowler said many years ago:

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”

We’ve all been there. We run into code, often code we wrote ourselves, that for any number of good reasons, like lack of understanding, shifting requirements, or tight deadlines, that isn’t as understandable in retrospect as we thought it was when it was written. That’s when you’ve got a choice.

You can beat the code into submission or you can use all of the tools at your disposal to make the code understandable and then make the change. It might make things take a little longer, but probably not. It will make it easier to move forward next time. Because that’s what we’re really trying to do. Reach the finish line sooner, not go as fast as possible right now.

As we head into the holiday weekend, I’m thankful for all of you who put in that little bit of extra effort in the past or will do it in the future to help us get to the finish line (and make my life easier) that much sooner.

Happy Turkey Day and see you all in 2.5 weeks.

by Leon Rosenshein

Walking Through The Code

I’ve been thinking about code quality a bunch lately and the kinds of metrics that exist for how “clean” code is. Things like cyclomatic complexity, maintainability index, nesting depth, method length, and Halstead volume. They all provide some info, but they’ve also got issues. While the source code is the ultimate source of truth for what is, it’s by no means the full story. It can’t tell you what problem it’s solving, why it is the way it is, why it isn’t some other way, or what the foreseen but not yet handled problems are. And it’s particularly bad at helping someone understand the gestalt of it.

That’s where documentation comes in. The typical README.md can be good for the why’s. Why is it this way? Why isn’t it that way? Why are the domains split the way they are. There’s also usually some background on the problem being solved. Maybe even something about how the component can or should fit into the bigger picture. These are all great things and we should write them down for all of our components.

One thing I rarely see though, and wish I saw more of, is a walkthrough. How are control and data flow embodied by the actual code. Combining the details of class definition and control logic with the high level design by following a specific call or command from the user’s request until the work is done. You’ve done this as part of debugging. Before you can understand what went wrong with a bug, you had to step through the code to understand how the happy path worked. Only then can you try to figure out where things went wrong and do something about it.

You’ve certainly done this by yourself, and probably sometimes with someone more familiar with the code. It’s almost always easier to find your way through the code with a guide. The guide can help you stay on the main path and point out interesting and relevant things nearby along the way. They can provide context along the way that will help you understand the specific situation you’re looking in. The problem is, there might not be a guide available when you need one.

That’s where the walkthrough comes in. It’s not just a list of facts and details. Just like a museum self-guided tour, the walkthrough gives you an overview and context but lets you dig in to the thing that is most important to you. Which makes things easier for you the maintainer. Even if you were the one who wrote the walkthrough in the first place.