Recent Posts (page 32 / 65)

by Leon Rosenshein

Hindsight

Things go wrong. We make mistakes. We don't always consider all the possibilities. New, different things can happen. The goal is to do better in the future. That's where Root Cause Analysis comes in. And hindsight is 20/20, right, so we're done, right?

First, as I talked about last year, there's a difference between the proximate cause and the root cause. We need to avoid the trap of thinking that the first time a person could have done something different to prevent the issue is the root cause. It almost never is so we need to keep digging.

Second, while our view of the past may be 20/20 (it probably isn't, but that's a different issue), it also tends to suffer from target fixation. As part of mitigating/solving the problem and then preventing it, we (hopefully) wrote down all the steps taken and noted problems along the way. And if we're really on top of things at the post-incident review we even created tasks on the backlog for the longer-term things that need to happen to prevent a recurrence.

But what we almost never do is document the struggles we had diagnosing the problem and getting to the root cause. We don't document/remember the dead ends we followed, the tools we built (or wish someone else had built), or the side learnings. Even when doing RCA, we follow the happy path through the solution and forget about many of the possible error cases.

And that's bad. For a bunch of reasons. Those dead ends are probably incidents waiting to happen. Yes, this time the problem wasn't bad input from a sensor we weren't properly filtering, but it can still happen. Unless we fix the underlying issue, by capturing it in the backlog and then completing it, in the fullness of time it will happen. And then we'll go through the whole process again.

At which point we'll reach for the same tool we wished we had but didn't, and find that it's still missing. So we'll throw something together. Which will do the trick, but slow us down. Which means that we should have but building that tool on the backlog as well. As a side benefit I'll bet you find that there are other uses for that tool as well.

When you're dealing with an incident and doing RCA you're an explorer. While it's possible that you got lucky and your path to the root cause was simple and straightforward, more likely it wasn't. And each of those bends and twists is a part of the map that currently says "terra incognita", but you've got an opportunity to update it and make the blank spot smaller. Next time someone reaches that spot and has a working map they'll be happy you did. And that person might even be you.

So remember. Even when dealing with errors there's more to life than the happy path, and there are lots of learnings there to be had, by keeping your focus broad enough to see them.

by Leon Rosenshein

Naming Is Hard

There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton

We all know that. So what do you do? First, be consistent. A foolish consistency may be the hobgoblin of little minds, but this kind of consistency is not foolish.

Names, method or member, internal or external, form the basis of a ubiquitous language. Language lets us communicate what’s inside our heads with others, and having a consistent language reduces cognitive load. And as I’ve mentioned before, reducing cognitive load is a good thing. For example, if you decide to have a method that gives you the number of entries in an array called Length(), then you should have the same for sets, lists, maps, and any other collection. Calling it Length in some, Count() in others, and having a member variable called NumberOfItems for one will drive your users insane.

And not just consistency in names, but in style. Again, whether it’s CamelCase, SCREAMING_SNAKE_CASE, or kebab-case, pick one and stick with it. And when I say pick one, I don’t mean pick one for yourself and use it everywhere, I mean pick one for something with very clear boundaries, like a team, or a company, and then stick with it. Consistency is your friend.

That’s just the first thing though. If you are consistent your customers will eventually figure out your pattern/style and be able to follow along. Having a good name is also important. Something that lets the reader know immediately what the thing is/does. So what goes into a good name?

It should be Short, Intuitive, and Descriptive. Something like documentPageCountsectionPageCount, and chapterPageCount. Just from looking you can tell that they’re page counts for different parts of a document. It’s a Count, so it measures things. You don’t expect it to change as you move around in the document like pageNumber might.

Manage your context. Think about those different counts. The first word sets the basic context, and each word after gets more precise. By the time you get to Count you know exactly what you’re talking about. And always be consistent (there’s that word again) with which way the context increases. Order matters.

Speaking of context, avoid stuttering. If you have a context, use it to disambiguate. Don’t make the name longer to re-add the context. gofmt even has an error message just for that case, type name will be used as by other packages, and that stutters;

And finally, if you’re doing something, be very clear about what you’re doing. Put the verb up front. And be precise. If you’re updating an element in a collection then by overwriting, call it overwriteElement or updateElement. If you’re replacing an element in a collection then call it replaceElement. And whatever you do, don’t call it createElement if the element exists.

But again, and most importantly, be consistent. This is another example of when not to surprise your customer.

by Leon Rosenshein

Another Year In Review

At least I hope the year ends tonight, but I'll have to check covidstandardtime.com tomorrow to be sure. Going with the assumption that the year does end, some thoughts on the year that was.

It started like any other year. Lots of interesting challenges, what with shutting down one DC and turning up another. Lots of new tech to learn about and fit together. Then things changed. And they didn't change. Life went on. Work went on. We still turned down one DC and turned up another. Learned a lot about Authentication, AWS, Kubernetes and Prometheus, *and* how to fit them all together into a coherent whole. And I wrote a bunch of daily dev topics, over 225 of them. Here's a few of them, presented in chronological order, that either I liked or got a lot of interest. Enjoy and see you next year.

Jan 15th Naming Things

Feb 25th Breaking New Ground

Apr 29th Fear, Courage, and Professionalism

Jun 30th Vorlons vs. Shadows

Aug 3rd Careers

Sep 30th OKRs and KPIs

Oct 13th Silos vs Ownership

Oct 21st Stories vs Tasks

Oct 27th Commits vs PRs

Nov 20th Implicit vs Explicit

Nov 25th Take Time


by Leon Rosenshein

Unhappy Customers

Back at the beginning of the year, in the before times, I wrote that APIs are for life, and that if you change something, even if you make it better, you're likely to have unhappy customers. And that's true even if what you've changed isn't a public API.

One of the things we did with Combat Flight Simulator 3 (CFS3) was improve the graphics. We took advantage of newer hardware to increase resolution and fidelity. We did this in two main areas. The aircraft themselves and the terrain engine. We knew that the FlightSim community had already created hundreds of aircraft, and we wanted to allow them to be used, so while we improved rendering and animations on the aircraft we made sure to handle last years formats as well. Since CFS3 had a lot more air to ground action than earlier versions we also wanted to make sure that we could get sufficient detail to get a sense of speed and action everywhere. Because we started with the FlightSim codebase and data set we already had the entire world covered, so we didn't worry about add-on scenery.

According to our customers, we chose poorly. Even though we didn't officially support 3rd party updates to the scenery, a small portion of the customer base had figured out how to add scenery in earlier versions. They made changes not only for themselves, they shared those mods with others. There were even a few companies that had commercial products with that scenery. Which meant that when those add ons didn't work we didn't just have a few hard-core fans who were upset. All of their customers were upset with us too.

This was in the days of box products, so it was too late to do anything for CFS3, and that bad taste certainly impacted sales. Because of the low sales and reduced profit (CFS still made money, but not as much as expected) CFS4 was cancelled. Was that entirely because we changed a private API? Of course not. The entire market was down, flight sims were losing favor, and Microsoft was focusing on higher return on investment. But if we hadn't made that change and had happier customers and better press things might have turned out differently.

We did take the lesson to heart though, and while we incorporated many of the improvements to the terrain engine into the next version of Flight Simulator we took the extra time to ensure that the improvements worked with the old formats instead of replacing them. Flight Simulator was much better received, and managed to survive for 2 more versions and some add-ons over 5 years before it too was shut down for not having a high enough profit.

So remember, surprising your customers isn't always a good thing.


by Leon Rosenshein

Java Isn't Dying

Today's entry in the Betteridge file. Is Java Dying? True to Betteridge's law, no. It's one of those perennial questions, and the answer has always been the same. It fills a need, and it's been deployed to production in countless places. And as COBOL reminded us yet again this year, if it's in production it will never die.

And if you look at the Java ecosystem, it's actually growing, JVM languages continue to grow and evolve. Scala is a fairly functional language on the JVM, and Apache Spark is written in Scala. For you LISP fans there's Clojure. If you're developing for Android there's Kotlin. And of course, if you want COBOL MicroFocus has Visual COBOL. So developing in/on/with Java is far from dead.

And with all those different languages, there are some deeply ingrained Java/JVM use cases. For example, a lot of distributed processing, from general MapReduce to ML training to Log processing uses PySpark. Yes, there's a lot of Python written to process the data, but under the hood it's Spark and the JVM (Scala) doing the heavy lifting.

Even if you don't use PySpark you're using Java every day at work, even if you don't realize it. Our monorepo is on the large side. And keeping track of dependencies and what's up to date vs what needs to be built is rather complicated. So we use bazel. And bazel is written in Java.

Yes, Java is verbose. So what? You've got an IDE to help you. A bigger issue is dependency management. Yes. transitive dependencies can be hard to manage, and sometimes they can even lead to incompatibilities, where one dependency you have transitively includes version X or something and another includes version Y, but there's no version that makes them both happy (I'm looking at you guava). But you know what? Java at least has a way around the problem, the shaded JAR. If you need to, you can include both versions and the correct one will be used by each part. If you think that's ugly, try looking at npm and what gets dragged into the simplest Javascript web app. Now that's scary.

But really, none of this matters. Because the programming language you use is a tool. A means to an end, Not the end itself. I'm not using Java right now for my current work, and I probably wouldn't choose it if I found myself in a situation that really was a blank slate and there was nothing pushing me in a direction, but really, how often does that happen? I'm working with Kubernetes, which pushes me towards Go. There's also a lot of libraries/functionality around Kubernetes and gRPC which makes Go a good choice. When we built/ran Opus, which grew up in a Spark/Hadoop world, we used Java and Scala because that's what the rest of the tools we needed to talk to used. And it was very successful.

So don't worry if the language is alive or dead. Look at what's being used to solve the problems you're solving by the people you'll be solving them with. And use that language. It's good to have the discussion about what language/framework to use, but if there's something in use, you should probably use it.

by Leon Rosenshein

On Branching

There are lots of different branching strategies. And they have their strengths and weaknesses. And there is no one right strategy. Don't let anyone kid you. There are however strategies that are wrong, especially ones that are wrong in a specific time and place.

Given that there is no consistent best strategy, how do you pick which strategy to use. The answer to that strategy is highly dependent on why you're making the branch. Unless you know that, you can't make a good choice. The other thing you need to keep in mind when deciding how to branch is your exit strategy. How (if ever) are you going to merge the branch back in, and what happens to the branch after that first merge?

Are you optimizing for individual developer velocity? Product velocity? Size of development team? Stability? Each of these will push you towards a different strategy.

Long lived feature branches make it really easy for a developer to get a feature to a stable point for a demo, and that developer will feel really good about their progress. Right up until they need to share their work. Because the longer they've worked, the further their codebase has diverged from the shared view. If they're lucky their work is isolated enough that conflicts, physical and logical, are small, but that's usually not the case. So you pay for that individual velocity at the end when it comes time to bring things back together and someone, usually a build team, has to deal with those conflicts.

The most common way to avoid that divergence is to keep branches short lived, on the order of days or less. After all, how far can two branches diverge in a couple days? The corollary to that is of course, how much progress can you make in a couple of days? So lots of tension there. But there are ways to make it easier. Clear boundaries/interfaces. Feature Flags. Small changes. Lots of automated tests. At the limit this is Continuous Integration/Continuous Deployment.

On the other hand, consider the commercially released product, nominally shipped in a box. Or maybe the firmware inside a physical LIDAR product. It's going to be out in the field for years. And you need to support it. And probably make some fixes/modifications. But the hardware itself won't change, and you want to be able to support new hardware/functionality going forward. So you create a release/support branch for that work. That branch is very static. And it's got a simple merge strategy. Never. Things that happen in that branch never go anywhere else directly. You might make the same logical change, but you don't do it as a merge. That lets you continue to do new cool things while supporting the old. In that case you've got a clearly defined reason for the branch and an end state. Go ahead and branch with confidence.

As an organization we've picked, rightly I believe, to use trunk based development and optimize for overall product velocity. That lets us do lots of good things. Like minimizing the time between getting a change approved and the time it's in production. Like minimizing the time between a change be done and it being available to others to build on. Like alerting us to problems early.

And to take full advantage of it, it requires some discipline on the part of developers. To write and maintain good tests. To have and maintain clear boundaries. To use feature flags. To keep changes small. To reduce side effects of changes. All things that are straightforward to do.

We just need to do them.

by Leon Rosenshein

RCA

RCA can be a lot of things. Most of the usage of RCA I’ve seen has something to do with the Radio Corporation of America

rca

and all of the inventions and spin-offs from it. But more recently, RCA means root cause analysis. As in “Why did that really happen?

As in “Why did the program crash?” Easy. Like it says right there in the stack trace,

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4a52c6]

goroutine 1 [running]:

main.log(0x0, 0xc0000a9f68, 0x1, 0x1)
/tmp/sandbox483371455/prog.go:28 +0x66
main.main()
/tmp/sandbox483371455/prog.go:23 +0x74

Program exited: status 2.

So the fix is to test if format is nil and drop the message or log an error. That will always work. But is that the root cause? That’s certainly the proximal, or immediate cause, but that doesn’t tell the whole story. Why did it really panic? I mean the program usually works (outside the playground. It always works there, but that’s a different issue). But is that the right fix? To know that you need to know why it really failed.

Figuring that out is doing root cause analysis. And figuring out the root cause is crucial to making sure it doesn’t happen again. One way to figure out the root cause is to keep asking why.

Why was there a nil pointer reference on line 23? Because nil was passed in. Why was nil passed in? Because it wasn’t set in the conditional. Why wasn’t it set in the conditional? Because the conditional checks for before/after noon, and before/after 5PM, but ignores those two specific times. Why are those two skipped? Because English is a slippery language and computers are very literal. The requirements say “Greet with good morning before noon and good afternoon after noon”. The code does that. The requirements don’t say what to do AT noon, and the code does nothing. Now depending on how you define noon, the chances of the test happening exactly at noon go down, but because computers are discrete, there’s always that chance of testing exactly at noon.

In this case the right answer is probably to check for hour <= 12 instead of hour < 12, but you’d have to go back to the user to be sure. Or you could go for the snarky developer solution and set the default format to `‘Good <indeterminate time of day because the specs were incomplete> %s\n’.

There’s one more why you should have probably asked in this case. Why does log take a *string as a parameter instead of a string. If you fix that problem the compiler will make it much harder to make the mistake. But before you go fix that, you need to understand Chesterton’s Fence, but that’s a topic for another day

by Leon Rosenshein

This Would Never Occur To Me

I've written code for Apple ][ hi-res graphics with it's odd interleaving and understood what I was doing. I've done (a little) Fortran77 on a Cray-YMP. I've done low level graphics, from 2D lines to 3D projections to mip-mapped textures on a CPU and bashed the results into a manually double-buffered framebuffer. I've written shaders for procedural texture mapping and phong shading (don't remember much about it, but I know I did it). I've written code to use various versions of SSE code on Intel chips, with runtime detection of capabilities. So I'm reasonably familiar with graphics and parallel operations.

I'm also familiar with iD software. I downloaded a version of Wolfenstein 3D from a General Atomics server back in 1992. I've played multiple versions of Doom, Heretic, and Quake. I've followed Armadillo Aerospace.

Which is to say that I've not only experienced the amazing things that John Carmack has put together, but I've done similar things (albeit without the commercial success he's had). But you know what? Some people just think differently. I think I follow what he's saying here, but it never would have occurred to me to do texture mapping that way.

by Leon Rosenshein

ISQ vs ESQ

There's a couple of TLAs for you. But what do they mean, and what do you do with them? Let's start with some definitions. ISQ is Internal Software Quality and ESQ is External Software Quality.

External software quality is what your customers see. How easy/hard is it for them to do the things you promised them they could do with your software. Is it easy to use? Consistent? Reliable? Full of features? Provides useful benefits? Software with high ESQ surprises and delights your users.

Internal software quality is harder to measure and talked about much less often. A good way to think about ISQ is how easy or hard it is to change the software. When you need to add a feature or fix a bug, do you know where to start? How much confidence do you have that when you've finished making the change the change does what you expect, and has no other, unexpected impact? Software with high ISQ may delight developers, but it will never surprise them. 

But which is more important? We often talk about the software triangle, trading off features, quality, and time. And there's more than a little truth to that. But the important thing to remember is that that triangle is, in general, describing ESQ. And, you can reduce ESQ overall to get even more time.

ISQ is a little different. Because increasing ISQ makes it easier to improve ESQ. From clear domains and separation of concerns to reduce cognitive load to unit tests and validation to help you be sure something hasn't broken, high ISQ code lets you get more done. Lets you add more value. Lets you be more agile as goals change.

So how do you get high ISQ? That's a question for another day.

by Leon Rosenshein

Long Live The CLI

I've talked about the command line before. How important it is, its history, and I've made some recommendations. And it's still important.

My recommendations center mostly around usability and predictability. Because if you build something and it's not usable what was the point? Predictable because while you have one or more use cases in mind, the street finds its own uses for things and you never know how someone else might want to use it. So let them

All of that still applies. And it's nice to find someone who agrees with you. It's even better when that person or group takes the time to make it easily consumable. And the Command Line Interface Guidelines (CLIG) are just that. So check them out and keep them in mind next time you write a CLI.