Recent Posts (page 41 / 65)

by Leon Rosenshein

Ownership

In software development, as in many professions, way up near the top of the importance list is getting things done. That's why defining the problem you're solving is so important. How do you know if you're done if you don't know what problem you're solving.

And knowing who's responsible for getting things done is just as important as knowing what done is. This is more than having a task list with names attached to each item, although that's critical too. Because solving a problem is more than completing a list of tasks. Especially now, with agile development and shorter design/planning cycles. Many design docs specify the verbs/functions that will be supported, but not the exact parameters, and that's good. It's good because at design time you almost certainly don't know the implementation details well enough to specify what those parameters will be.

Which leads to the issue. How can multiple people collaborate on building a thing, each with enough ownership and responsibility for their own parts to be able to make their own decisions, get the bigger thing to done? Many moons ago in an MBA class the instructor told us that our job was going to be less about the architecture diagram itself, and more about the white space between all the boxes in it.

And that's where ownership and scope comes into play. Diffusion of responsibility is a real thing. One way to keep your cognitive load down is to know what you're responsible for and not worry about the rest. And task lists with names on them are very clear. It's easy to say "I'm responsible for these specific things" and not worry about the rest.

It's also problematic. Because your customers want solutions to problems, not features. Each of those tasks in the list, by itself, is a feature. They need to be integrated with each other and the existing systems to form a solution.

Back in the early VirtualEarth days we had a very manual process to turn stereo image pairs into 2.5D extruded models. We needed to get our production rate up, so we were automating everything we could. Each bit of automation made sense, but they weren't exactly integrated with each other. After a few months of everyone throwing together tools and scripts the manager of our production team came to me and told me that we could go on unleashing all the tools against them we wanted, but until we got our act together and figured out a way to solve his problems his team was just going to ignore us.

That was a bit of an eye-opener, but he was right. We had taken the list of pain points, split them up into bite sized chunks and built tooling for a bunch of them. But we lost sight of the actual problem we were solving. No one was specifically responsible for making production faster/easier, so we hadn't done that.

So I took his feedback, took a step back, and looked at his team's problems. All it took was a little bit of workflow framework, better error reporting/management, and some written SOPs (on top of all the tools/scripts we had built) and suddenly we made significant improvements to the production throughput and acceptance rate.

Whether it's the PM, the EM, the TL, or the IC, if you want to solve a problem, someone needs to own the problem itself, not just the parts.

by Leon Rosenshein

Hello. Who Are You?

There are lots of times in logging and observability that you write a nice standard function to handle the boilerplate part(s), making it easier to focus on what you want to say/observe. And that's a good thing. Being able to focus on what you want to say and not worry about the details of getting the message out reduces cognitive load and increases the likelihood that things will be said.

And one thing you often want to know with logging/observability is which function is making the call. You might be using structured logging and want to fill out the Function member, or you might want to set a tag on your metric. The easiest way to do that is add a string variable to function's parameters and pass it in. That's a good min bar.

But there are problems with that. If you have multiple calls in a function you need to spell it the same every time. Do some refactoring and it's wrong. Rename the method and it's wrong. A little bit of copy-pasta and it's wrong.

You can avoid that by good use of constants and local variables, but it's still error-prone and there's no way to catch it automatically. There is however a better way, and it's not subject to those limitations. Let the compiler do it for you.

You're probably familiar with the predefined C macros __FILE__ and __LINE__, but since C++ there's also __func__. That's a good place to start in C/C++. Different languages have different specifics, but there are also (almost) always ways to access this info as well as the current call stack, so if you really need the function name, getting it automatically is often a better approach.

One thing to keep in mind is that while the compile time methods are fast, any of the runtime methods may incur significant time penalties, so make sure your not blowing your time budget :)

https://golang.org/pkg/runtime/

https://docs.python.org/3/library/inspect.html#inspect.getframeinfo

https://www.nongnu.org/libunwind/

https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html#getStackTrace()

by Leon Rosenshein

A Scrolling We Will Go

Many of our datasets are big. Thousands or millions of rows. That's fine. We can search them, we can order them, we can get a subset. And getting a subset is good because UIs like small datasets so they can be responsive. Quick scrolling. Short data query times. And that leads to pagination, so users can scroll through the data without ever having to have the entire set available.

The simplest way to paginate is to use the equivalent of SQL's OFFSET and LIMIT. To get any particular page set OFFSET to pageSize * (pageNumber - 1) and LIMIT to pageSize and go. Simple. Easy to implement on the back end, Easy to add to a URL as a query parameter or a gRPC request. And for small datasets that are mostly read-only it works fine. But as you scale up and the write to  read ratio gets higher, not so much.

The first problem is simple scan time. If you want pages 19,000 and 19,001 of a 1,000,000 element resultset (assuming page size of 50) then each time you make the query you need to build the resultset, skip the first 950,000 results, then return something. That can take a bunch of time, and most of the time is spent dealing with information you're just going to drop on the floor since it's not part of the returned rows.

The second is the cost of the query. The more complicated your query, the longer it takes. If you're joining multiple tables the time can go exponential quickly. Exponentials and a large dataset are a bad combination if you're trying to be fast.

The third problem comes from changes to the datasource. If you've got 200 entries in your table and you want to get page 4 then that would be entries 151-200. Simple. But what happens if the first 10 entries get deleted between the time you ask for page 3 and page 4? Now you get entries 151-190, which isn't a problem, but, you never returned the original elements 151-160, since now they're elements 141-150. And you will have the inverse problem if things get added before the current page. Instead of missing entries you'll get the same entries at the beginning on the next page.

So what's a poor developer to do? First and foremost, indices are your friends. If you've got a natural key in your data, or you can create one from multiple columns, make an index and use it. Then you can use that for your pagination. Instead of using the OFFSET of the last row, use the last key value. Then you're doing a lookup on the index (much faster) and you know you're starting with the last row, regardless of any changes to data.

Another option, if it fits your use case, is to build a temporary table with the results of your query, give it a simple key, like row number, and index on it. Then just return pages from the temporary table. This one is a little more complicated since you need to somehow associate the query with the temp table and manage the table's lifetime. If your query is complicated/takes a long time, results in a big table, and can stand to be a little out of date, this can turn a slow UI into one that's lightning fast.

by Leon Rosenshein

Data, Information, Knowledge, And Wisdom

We like to say we're a data driven company and we have a data driven culture, and that's good. But what does it mean? Ideally,  being data driven means using current and past information to make knowledgeable decisions and plans to encourage a desired future outcome.

There's a lot to unpack there, so let's just dive into 3 of those words, data, information, and knowledge.

The first thing to realize is that while on the surface they're similar, they are not the same thing. And you have to understand those differences before you can go deeper. So here's my take on those 3.

Data - The raw numbers. Units produced, latency, number of riders. That sort of thing. It is explicitly measurable. The important part here is to keep it accurate. The data forms the foundation of the hierarchy, so if you put garbage in you get garbage out. Your data needs to account for everything. If your units produced only counts accepted units you better have another datastream that includes the units attempted but weren't produced.

Information - The data in context. How do the data points relate to each other? One of the key contexts is time. The same measurement (that units produced again), but over the last day, week, month, year, and compared to other time periods. Also how one datastream relates to another. Like units produced/units attempted`. Because how the data relates to itself is important.

Knowledge - Combining experience and information to learn. Knowledge lets us separate the driving forces from the results. Knowledge shows you what levers you have to pull and gives you some indication of what the results will be if you pull the lever. By itself, knowledge doesn't tell you which levers to pull or when.

Wisdom - Here's a bonus term. Wisdom is having the knowledge of how things that are not captured in your data will impact what you're trying to do. Things like culture, both company and external. Or what your competitors might do. 

Like many other things, it's a continuum, from the small scale point of data to the strategic wisdom, and not everything fits exactly into one box.

I'll have more to say about what being data driven means at some point in the future, but meanwhile, how do you define data, information, knowledge, and wisdom?


by Leon Rosenshein

Don't Always Minimize WIP

One of the questions I often ask during interviews is some variation of "If you take this job, what will your current (or last if appropriate) manager not miss about you?" Once we get past them realizing I'm looking for something their manager doesn't like and that this is partly about self-awareness and honesty they often ask for an example. So what I tell them is that my manager is a big proponent of minimizing WIP, and I'm usually working on multiple projects at once.

Over time we've come to an understanding. In general, I agree with him. When I'm fixing a bug or implementing a feature, I'm only working on that one bug/feature. If the feature has some clear deliverables along the way I structure the work so that they become available in order and start adding value instead of a big-bang reveal at the end. Small, bite-sized work, and finish one bite before moving on to the next.

The place where he gets annoyed and I'm working on multiple projects at once is somewhat different. It's at a higher/broader scope. At this point in my career I'm more interested in breadth than depth, and I do a lot of cross group/cross team/cross org work. And that work has longer timelines, lots of communication delays, and times when I'm waiting for other work to be done. 

To put it in programming terms, I've got multiple threads in the table, and many of them are blocked on IO. When it's time to switch to a new thread I look at what work is available, make a priority decision, do the context switch, and start working on the new thread. And of course, there are the non-maskable interrupts that need to get handled as they occur (unbreak now, pages, customer issues, etc). Just like your typical OS.

The big difference though is that, outside of NMIs, I decide when it's time to switch. It's not based on the clock or automatic if another thread becomes unblocked. That way I get to focus on what I'm doing and not spend too much time dealing with interrupts and context switches.

So, at one timescale I do minimize WIP, but at a large scale I maximize throughput and value-add, which is not quite the same thing

by Leon Rosenshein

NO

As important as what we do as developers is, what we don't do is equally important. There are a fixed amount of seconds in the day, and you can't spend them on two things at once. The corollary of that is that you need to guard your time.

I've talked about the fact that a backlog should be an ordered list of work, and how to use Priority and Severity to help you order it, so no need to rehash that. Today's topic is the cost of the size of that backlog.

Some of the cost is very measurable. Periodically you go through your backlog and make sure that things are in the right order. Let's say it takes 30 seconds per item for the small, well defined ones and 5 minutes for each vague idea. You quickly get through the front of the backlog, then spend 2 hours shuffling the last 30 entries. If those items weren't on the backlog you'd save 2 hours every time.

Other costs aren't so measurable, but very real. Having hundreds of items in your backlog, many of which everyone knows will never get done, is depressing. Talking about them repeatedly even more so.

So what can you do about it? One simple thing is to say No and keep them off the backlog. Figure out what the scope/ROI limit of your backlog is and if a new item doesn't meet it then don't add it. If the idea becomes more important later it will come up again. At that time you can revisit if it meets the threshold. If it does you add it, if not, it will come up again. Until then you can ignore it. And if it doesn't come up again then you haven't wasted time and have kept the cognitive load low.

Of course, TANSTASFL, and  this does impose some rigor on your process. You need to make sure you understand things when they come up. You need to understand their cost and their ROI. You need to understand the timescales of both. You need to think about how you would approach the work and if there are steps along the way that are above the threshold even if the end goal isn't.

by Leon Rosenshein

Natural Science

Wheels within wheels
In a spiral array
A pattern so grand and complex
Time after time
We lose sight of the way
Our causes can't see their effects

    -- Neil Peart

Software development is about patterns. Design patterns, data patterns, and execution patterns. Flow graphs, timing graphs, and relationship graphs. Project structure, team structure, and organizational patterns. All fitting together to make an even bigger pattern

We've all heard about the GOFs design patterns. We know when to use them and when not to. A couple of days ago I talked about branching patterns. Again, there are guidelines about when to use which pattern and when to avoid it.

But, the really interesting part is that those patterns scale just like everything else we do. And what makes the most sense locally might not be the right choice when looked at from a larger scale. Or maybe it does *now*, but won't later.

I've mentioned the Global Ortho program before. One of the things we did as part of it was color balancing and blending. We went through _lots_ of iterations, looking at seam lines between images and between flightlines. And we were able to do a really good job of making those lines disappear. So we rolled it out. And we found that while things looked really good up close, there were still abrupt changes across collection boundaries, image delivery boundaries, and processing unit boundaries. Eventually we realized that in addition to all of the local knowledge and optimizations we used for blending adjacent images into a mosaic we needed to bring in high level knowledge, in this case Blue Marble imagery to provide a top-down, globally consistent, artistically color balanced data to use in each mosaic so that they worked well not only individually, but with each other.

And that's our challenge as engineers. To see the big picture. To make the right tradeoffs between local and global optimizations. Between building for today and preparing for the future. To see the "wheels within wheels", the "pattern so grand and complex". Plus, I get to quote Neil Peart, RIP

by Leon Rosenshein

Tag, You're It

Docker and immutability have an interesting relationship. Docker images are immutable. They even have a SHA, and when you pull an image with its associated SHA you always get the same one. Docker tags, on the other hand. are associations between a string (the tag) and an <repository>:<sha> pair. And that association is mutable. Or at least it's mutable according to the spec. That means that you could pull an image today, maybe docker pull debian:stretch-20200607 and get an image with a SHA256 of 554173b67d0976d38093a073f73c848dfc667d3b1b40b6df4adf4110f8a3e4ce. That's great. But tomorrow you could issue the same command and get an image with a different SHA. Because the author can change the tag to be associated with a different image while you weren't looking.

We like to think of docker images as a way to ensure that we always get the same environment. And that's sort of true. For a given image it's always the same. But a <repository>:<tag> can change. If you really want to be sure you need to pull by sha. So the correct command to ensure you always get the same image (assuming it hasn't been deleted) is docker pull debian@sha256:554173b67d0976d38093a073f73c848dfc667d3b1b40b6df4adf4110f8a3e4ce.

Then there's the magic latest tag. latest is especially fun when used with docker run. In this case it's not the latest image pushed. It's not the latest image:tag you ran. It's the latest image without a tag you ran. And unless you've got a very odd build/deploy/run process it's almost certainly not what you want.

If you think that's bad, throw Kubernetes into the mix and it gets weirder. Because Kubernetes acts like the tag is immutable, and will, by default assume that if there's an image with the right tag on the machine it must be the one you want. Yes, you can tell it to always pull, but what happens if the tag changes and then your pod dies? When the pod restarts you're going to get a different version of the image.

And to make matters worse, Kraken, which is very good at supplying images to 100s of nodes at once, has sort of mutable tags. When you create a new tag Kraken correctly writes down the SHA it refers to. And if you update the tag it will remember, in memory, what the new SHA is, but it never writes it down. So if the Kraken node ever restarts, it reads the old SHA from disk and starts serving the old one.

So the moral of the story is to treat your tags as immutable, even if they're not. Don't use latest, especially in production. If you feel the urge to use latest locally during development you're probably going to be OK, but don't push it to a remote docker registry. And don't ever re-use a tag. You, and everyone around you, will be happier.

FYI, here on the infra compute team we like to use <user>_<git branch>_<date>_<time> which is really hard to make collide, so we don't have to worry about mutability. We have that format baked into our bazel rules/makefiles so it just works. Yes, the tags are a bit long, but you  don't type them that often, and unlike a box of chocolates, you always know what you're going to get.

by Leon Rosenshein

Branching

I've mentioned Martin Fowler before, and he recently finished his series on branching. While there's way too much info to share here, it can be summarized with this quote, so I'll just send you all over to his page to read about it there.

I start with: "what if there was no branching". Everybody would be editing the live code, half-baked changes would bork the system, people would be stepping all over each other. And so we give individuals the illusion of frozen time, that they are the only ones changing the system and those changes can wait until they are fully baked before risking the system. But this is an illusion and eventually the price for it comes due. Who pays? When? How much? That's what these patterns are discussing: alternatives for paying the piper.

    -- Kent Beck

While you're there you should look around and see what else you find interesting. There's almost certainly something that will open your mind on something you thought was settled.

by Leon Rosenshein

Why Are We Doing This

ERDs, PRDs, RFCs. Regardless of what you call it, a lot of what these documents do is define what you're doing. Often in excruciating detail. And that detail can be important, for lots of reasons.

First, it forces you to think through what you intend to do. It forces you to think of corner cases and happy paths, and ways things can go wrong. It forces you to understand what you're doing well enough to describe it to others.

Second, it lets others know what you're doing. This lets them plan for or around it. It lets the group identify duplication. It lets others provide their experience so you can avoid past mistakes.

Third, it lets others know what you're not doing. Again, this lets them plan for or around it. They know what they'll have to do themselves, or at least find some other solution. And if you've added reasons why you're not doing it you might even help them avoid needing that thing too.

Fourth, it defines done. If you write down all the things you're going to do you have a checklist you can refer to to know when you've done them all. It's always good to know when you're done.

But what all of that doesn't do is say WHY you're doing it. And that's at least as important, if not more so. What problem are you solving? Who does it benefit? How does it benefit them? What's going to be better for the user/customer because of what you're doing?

There are two big reasons that why is important. First is that we can't do everything, or at least we can't do everything right now. That's why we have priority and severity discussions. Since we have to, at least to a certain extent, serialize things, the order of operations is important. The why is the input to the prioritization.

The second reason goes back to defining done. If you haven't defined the goal, the problem to be solved or the benefit to be enabled, you've just written a task list. You can complete the task list, but that doesn't mean you've solved the problem or enabled the benefit. If you have defined why then you've defined done in a much more impactful way. 

And that's really why we're here. To have an impact.