Recent Posts (page 47 / 71)

July 16, 2020 by Leon Rosenshein

NO

As important as what we do as developers is, what we don't do is equally important. There are a fixed amount of seconds in the day, and you can't spend them on two things at once. The corollary of that is that you need to guard your time.

I've talked about the fact that a backlog should be an ordered list of work, and how to use Priority and Severity to help you order it, so no need to rehash that. Today's topic is the cost of the size of that backlog.

Some of the cost is very measurable. Periodically you go through your backlog and make sure that things are in the right order. Let's say it takes 30 seconds per item for the small, well defined ones and 5 minutes for each vague idea. You quickly get through the front of the backlog, then spend 2 hours shuffling the last 30 entries. If those items weren't on the backlog you'd save 2 hours every time.

Other costs aren't so measurable, but very real. Having hundreds of items in your backlog, many of which everyone knows will never get done, is depressing. Talking about them repeatedly even more so.

So what can you do about it? One simple thing is to say No and keep them off the backlog. Figure out what the scope/ROI limit of your backlog is and if a new item doesn't meet it then don't add it. If the idea becomes more important later it will come up again. At that time you can revisit if it meets the threshold. If it does you add it, if not, it will come up again. Until then you can ignore it. And if it doesn't come up again then you haven't wasted time and have kept the cognitive load low.

Of course, TANSTASFL, and this does impose some rigor on your process. You need to make sure you understand things when they come up. You need to understand their cost and their ROI. You need to understand the timescales of both. You need to think about how you would approach the work and if there are steps along the way that are above the threshold even if the end goal isn't.

July 15, 2020 by Leon Rosenshein

Natural Science

Wheels within wheels
In a spiral array
A pattern so grand and complex
Time after time
We lose sight of the way
Our causes can't see their effects

-- Neil Peart

Software development is about patterns. Design patterns, data patterns, and execution patterns. Flow graphs, timing graphs, and relationship graphs. Project structure, team structure, and organizational patterns. All fitting together to make an even bigger pattern

We've all heard about the GOFs design patterns. We know when to use them and when not to. A couple of days ago I talked about branching patterns. Again, there are guidelines about when to use which pattern and when to avoid it.

But, the really interesting part is that those patterns scale just like everything else we do. And what makes the most sense locally might not be the right choice when looked at from a larger scale. Or maybe it does *now*, but won't later.

I've mentioned the Global Ortho program before. One of the things we did as part of it was color balancing and blending. We went through _lots_ of iterations, looking at seam lines between images and between flightlines. And we were able to do a really good job of making those lines disappear. So we rolled it out. And we found that while things looked really good up close, there were still abrupt changes across collection boundaries, image delivery boundaries, and processing unit boundaries. Eventually we realized that in addition to all of the local knowledge and optimizations we used for blending adjacent images into a mosaic we needed to bring in high level knowledge, in this case Blue Marble imagery to provide a top-down, globally consistent, artistically color balanced data to use in each mosaic so that they worked well not only individually, but with each other.

And that's our challenge as engineers. To see the big picture. To make the right tradeoffs between local and global optimizations. Between building for today and preparing for the future. To see the "wheels within wheels", the "pattern so grand and complex". Plus, I get to quote Neil Peart, RIP

July 14, 2020 by Leon Rosenshein

Tag, You're It

Docker and immutability have an interesting relationship. Docker images are immutable. They even have a SHA, and when you pull an image with its associated SHA you always get the same one. Docker tags, on the other hand. are associations between a string (the tag) and an <repository>:<sha> pair. And that association is mutable. Or at least it's mutable according to the spec. That means that you could pull an image today, maybe docker pull debian:stretch-20200607 and get an image with a SHA256 of 554173b67d0976d38093a073f73c848dfc667d3b1b40b6df4adf4110f8a3e4ce. That's great. But tomorrow you could issue the same command and get an image with a different SHA. Because the author can change the tag to be associated with a different image while you weren't looking.

We like to think of docker images as a way to ensure that we always get the same environment. And that's sort of true. For a given image it's always the same. But a <repository>:<tag> can change. If you really want to be sure you need to pull by sha. So the correct command to ensure you always get the same image (assuming it hasn't been deleted) is docker pull debian@sha256:554173b67d0976d38093a073f73c848dfc667d3b1b40b6df4adf4110f8a3e4ce.

Then there's the magic latest tag. latest is especially fun when used with docker run. In this case it's not the latest image pushed. It's not the latest image:tag you ran. It's the latest image without a tag you ran. And unless you've got a very odd build/deploy/run process it's almost certainly not what you want.

If you think that's bad, throw Kubernetes into the mix and it gets weirder. Because Kubernetes acts like the tag is immutable, and will, by default assume that if there's an image with the right tag on the machine it must be the one you want. Yes, you can tell it to always pull, but what happens if the tag changes and then your pod dies? When the pod restarts you're going to get a different version of the image.

And to make matters worse, Kraken, which is very good at supplying images to 100s of nodes at once, has sort of mutable tags. When you create a new tag Kraken correctly writes down the SHA it refers to. And if you update the tag it will remember, in memory, what the new SHA is, but it never writes it down. So if the Kraken node ever restarts, it reads the old SHA from disk and starts serving the old one.

So the moral of the story is to treat your tags as immutable, even if they're not. Don't use latest, especially in production. If you feel the urge to use latest locally during development you're probably going to be OK, but don't push it to a remote docker registry. And don't ever re-use a tag. You, and everyone around you, will be happier.

FYI, here on the infra compute team we like to use <user>_<git branch>_<date>_<time> which is really hard to make collide, so we don't have to worry about mutability. We have that format baked into our bazel rules/makefiles so it just works. Yes, the tags are a bit long, but you don't type them that often, and unlike a box of chocolates, you always know what you're going to get.

July 13, 2020 by Leon Rosenshein

Branching

martin fowler kent beck

I've mentioned Martin Fowler before, and he recently finished his series on branching. While there's way too much info to share here, it can be summarized with this quote, so I'll just send you all over to his page to read about it there.

I start with: "what if there was no branching". Everybody would be editing the live code, half-baked changes would bork the system, people would be stepping all over each other. And so we give individuals the illusion of frozen time, that they are the only ones changing the system and those changes can wait until they are fully baked before risking the system. But this is an illusion and eventually the price for it comes due. Who pays? When? How much? That's what these patterns are discussing: alternatives for paying the piper.

-- Kent Beck

While you're there you should look around and see what else you find interesting. There's almost certainly something that will open your mind on something you thought was settled.

July 10, 2020 by Leon Rosenshein

Why Are We Doing This

ERDs, PRDs, RFCs. Regardless of what you call it, a lot of what these documents do is define what you're doing. Often in excruciating detail. And that detail can be important, for lots of reasons.

First, it forces you to think through what you intend to do. It forces you to think of corner cases and happy paths, and ways things can go wrong. It forces you to understand what you're doing well enough to describe it to others.

Second, it lets others know what you're doing. This lets them plan for or around it. It lets the group identify duplication. It lets others provide their experience so you can avoid past mistakes.

Third, it lets others know what you're not doing. Again, this lets them plan for or around it. They know what they'll have to do themselves, or at least find some other solution. And if you've added reasons why you're not doing it you might even help them avoid needing that thing too.

Fourth, it defines done. If you write down all the things you're going to do you have a checklist you can refer to to know when you've done them all. It's always good to know when you're done.

But what all of that doesn't do is say WHY you're doing it. And that's at least as important, if not more so. What problem are you solving? Who does it benefit? How does it benefit them? What's going to be better for the user/customer because of what you're doing?

There are two big reasons that why is important. First is that we can't do everything, or at least we can't do everything right now. That's why we have priority and severity discussions. Since we have to, at least to a certain extent, serialize things, the order of operations is important. The why is the input to the prioritization.

The second reason goes back to defining done. If you haven't defined the goal, the problem to be solved or the benefit to be enabled, you've just written a task list. You can complete the task list, but that doesn't mean you've solved the problem or enabled the benefit. If you have defined why then you've defined done in a much more impactful way.

And that's really why we're here. To have an impact.

July 9, 2020 by Leon Rosenshein

Unintended Consequences

Remember when Slack went down a couple of months ago? That probably got your attention. And appropriately, it got the attention of everyone at Slack too. It turns out that the big outage was a delayed consequence of an earlier outage, and exacerbated by "what you know that just ain't so".

There's a bunch of good lessons in here, but two of the biggest are that monitoring is important, but you need to make sure that it really is monitoring what you think, and small bugs in one place can manifest themselves in completely different areas weeks/months later.

July 8, 2020 by Leon Rosenshein

Target Fixation

I spent way too much time with fighter pilots early in my career and learned a lot about how they think. Yes, they're typically confident (arrogant actually) and maybe a little boastful, but they're also very disciplined, particularly when it comes to their craft. And one of the ways that manifests is their ability to focus on a target while maintaining enough situational awareness to respond to changes in the situation.

For a fighter pilot, failure to maintain situational awareness can manifest itself in several ways, the most common is when the pilot gets so fixated on the target that they miss the target's wingman getting behind them or, like Maverick did in Top Gun, breaking the hard deck to get the shot.

The same kind of thing can happen during development or debugging. You have a goal and an approach, and you're smart, so damn it, you're going to make it work. Sometimes, maybe even often, that's a good thing. But not always.

Just because you're sure you know what the problem is and how to fix it doesn't mean you're right. It's a great place to start, and usually you can avoid wasting time by going with it. When you run into a method that seems to be giving you the wrong answers all of a sudden the problem is probably in that method, so look there. But don't get so fixated that you ignore any new information.

Back when I was working with those fighter pilots we didn't have artists to build us fancy models of different aircraft, but we did have basic engineering drawings, so we'd just do the best we could from them. We were doing some F-22 simulations and we kept getting some flickering on the left stabilizer. I was convinced that it was a z-buffer issue. I fought it for days. Isolated it. Added z-bias. Changed draw order. Wound polygons in different directions. Nothing worked. The flickering was still there. It wasn't until I let go of my bias towards z-fighting that I was able to solve the problem.

It turned out that the problem was in the texture. Somehow, and we never did figure out how, there was an alpha mask on the texture and the texture was oversized. Since this was before mip-mapping the combination of the two caused the texture engine to have bands of transparency moving across the stabilizer as viewing angle changed.

But, because I was fixated on the z depth I didn't look into anything else for a few days. If I had been more open to possibilities I could have found and fixed the problem much sooner. So what assumptions are you making that might need to be re-evaluated?

July 7, 2020 by Leon Rosenshein

Multi-Mon

I've been dealing with multi-monitor issues for 30+ years now. Back when I was working on flight simulators in the aerospace industry we needed multiple monitors because aircraft had multiple monitors and a HUD, and we needed to drive them all from a single computer. For that, we cheated. We had analog monitors, and by simply ORing into the framebuffer with separate red (left multi-function display), green (right multi-function display), and blue (HUD) images we could drive three monochrome displays. And that doesn't count the multiple computers driving multiple projectors for the outside world displayed on the dome.

After that, in the gaming world it was a little different. We supported, and folks used, multiple graphics cards in a single computer to drive multiple monitors. Here, the primary reason was coverage. If you had more screens you could have a wider field of view without having to distort/compress the image.

And through all of that I would regularly have a monitor dedicated to the visual display and another one for writing code, and that one would usually have 3 or 4 windows open. One or two windows with code, one for compiling/linking, and another for random things.

These days I don't do visual simulation, but I'm still running multi-mon. Why, you ask? Because screen space. Not pixels, square inches of screen. My phone has more pixels than the 25 in CRT I used 30 years ago, but it can't show me as much readable text. And that's important, because for the vast majority of what I do, text is the way I get my information.

One of the first things I did when we moved to WfH was grab my monitors from the office, because all that screen space makes me more effective. it's more effective because it gives me not just the things I'm looking at, but the context around it. One of my monitors is in portrait mode and I can see 100+ lines of my terminal. Another monitor has VSCode and it's windows, zoom in one corner, and some grafana dashboard in another. I can be referring to one file, typing in another, and see what's going on with the rest of the team.. And I've still got the screen on my Mac for distractions (mail, web searches, calendar, reactor idle, etc)

My current setup is a 21 inch monitor in portrait mode on the left, a 23 inch monitor in landscape on the right, and centered below them my Mac. The monitors are logically arranged so the mouse flows freely between them where they touch so it's easy to select the things I want. And the best part? Because most of the windows I want to look at are already on a screen somewhere I don't have to go searching for them. The info I want is right there. And that reduces context switch time and cognitive load, which, as I say, is always a good thing.

So what's your setup, and why?

July 6, 2020 by Leon Rosenshein

Inconsistent Consistencies

it depends

ACID is Atomic, Consistent, Isolated, and Durable. CAP is Consistent, Available, and Partition Tolerant.

According to the definition, both ACID and CAP have consistency. But English is a slippery language, and often, as in this case, the context of a word is important. ACID and CAP use consistency very differently.

ACID consistency means that if a transaction completes then it has fully completed and that *all* of the rules in place for the dB are met. Unique things are unique and foreign key constraints are met.

CAP consistency, on the other hand, is about the data stored on the various distributed nodes, and what logical guarantees you can make about it. It also lets you know what the system designers expect you to be able to count on. That's a much lower bar than the consistency of ACID.

That's why most high throughput systems are only key-value stores. It's relatively easy to provide high availability to eventually consistent data in the presence of a network partition. It's also easy to let you know that you're in a potentially stale state. And if that's all your database promises then that's all it has to do.

And in many cases, that's enough. If you're building an online store then your inventory should be close, but if it's not exact that's bad, but not fatal. Consider the airline purchase process. It might be very reasonable to choose to display an approximation of what's available quickly and be able to show the same set of empty seats to 100s of people at once (CAP consistency). On the other hand, when you actually sell the seat, you need to be sure to only sell it to one person (ACID consistency).

So if you ask me which consistency you need, the answer is, it depends. It depends on your particular use case and what's important to you for a specific query/transaction.

July 2, 2020 by Leon Rosenshein

Chaos

Distributed systems are hard. And they're hard in lots of ways. One of them is emergent behaviour, or the idea that simple rule changes can have big impacts. Sometimes good, sometimes bad, but often surprising.

Consider the simple Nginx load balancer. Let's say you've got 5 stateless backends behind a single address. By default Nginx does round-robin load balancing, passing the same number of requests to each backend. Great. You'll end up with consistent, even load. Or not. You only have consistent, even load when all of the requests are consistent. If 20% of your load takes 3x the time to handle then what happens? Well, if that ⅕ of your requests are evenly distributed and you have 5 servers then 1 server is going to get most of them, become overloaded, and fall over. Now you only have 4 servers, which may not be enough to handle the load, so they start failing and suddenly you have users at the gate with pitchforks and torches.

But you can't think of everything. So what can you do? That's where chaos engineering comes in. Chaos engineering is the idea that instead of waiting until oh-dark-30 and for something weird to happen you build systems that cause those weird things to happen while you're awake and watching. Then you notice them, figure out how to mitigate and prevent them, and then don't worry about them happening in the middle of the night.

There are lots of things that your chaos injector can do. It can add latency, remove instances, fuzz data, increase load, eat memory or disk, fail a sensor or otherwise muck with inputs and capacity. And it might do them one at a time, or it might do them in combination, because if you run low on memory you can just swap to disk, unless the disk is also full. Then what? It falls over.

For us it's software, but really it's just another application of failure analysis of systems. In aerospace we had the "iron-bird" Hook up as many real parts as you can. Simulate/work around the rest and add an external environment. Then break something and see what happens. More advanced systems break things with switches and valves, but I've talked to people who have simulated failures of hydraulic systems by taking an axe to a hydraulic line. Kind of messy, but very realistic.

About 10 years ago Netflix popularized the concept in the software world. Things have come a long way since the original chaos monkey turned off a server just to see what would happen. So think about what adding that kind of testing to your systems might show you.

And in case you were wondering, the solution to the Nginx problem we came up with was to change the distribution police from round-robin to least-used. It's a little bit harder for Nginx to keep track of, but it does a better job of balancing time spent handling requests instead of balancing the count of requests handled. In our case that made a big difference. YMMV

Older Newer