Recent Posts (page 35 / 65)

by Leon Rosenshein

Ubiquitous Language

I like bounded contexts. More generally, I like both bounds and contexts. Together they're really good at defining scope, both "in scope" and "out of scope". Knowing what you're talking about goes a long way to reducing cognitive load. If it's in scope you think about it. If it's out then you don't need to worry about it.

But that's only part of the battle. Even when you know what's in scope and everyone is in agreement on the scope, you still need to use the same language to discuss things. And you need to use the same language everywhere. Because when it comes to the Venn diagram of what everyone knows, what you know, and what your customers know, the overlap can be surprisingly small.

Ubiquitous language comes from the Domain Driven Design world. And language in this case isn't English vs German vs Spanish, or even C++ vs Go vs Python. It's about terminology. It's about naming. And naming is hard. It's one of the really hard problems in programming. DDD's solution is to ask the expert in the field (the Domain Expert) and use that language. And use it everywhere. In the user docs. In the design docs. In the code. Especially in the code. Because the developers are not the domain experts, and the developers need to be able to communicate with the users and domain experts. If you're already using the same terms for the same things then there's no translation needed and no additional cognitive load. On the other hand, if you're not using the same term, every time you have a conversation the terms need to be translated. In real time. With loss of accuracy. And additional cognitive load.

In the batch processing world we deal with jobs a lot. Customers want to run jobs. They want to start them. They want to know if they're finished. They want to know what the errors were. But what exactly is a job? To Kubernetes it's a structure/API. To Amazon it's a feature in the console. To ML model developers it's a bunch of GPUs running tensorflow on a dataset. To others it's a directed acyclic graph (DAG) of tasks. So what's a "job"? So what's a job? We all need to agree at all times or we're setting ourselves up for problems later.

We decided a job is DAG of tasks (task definition is another part of the ubiquitous language). It could be a simple DAG of one thing, it could be a DAG of N things that need to happen together, or it could be a complicated DAG. In general customers don't need to care. The model they use lets them do everything they want regardless of the complexity of the DAG. As developers we don't care because it's all a DAG and we can treat it as such. And we all agree to ignore Amazon's and Kubernete's definitions. Because they just get in the way and add ambiguity. If we (the developers) need to care about those definitions we wall them off carefully behind facades so that they don't interfere with our ubiquitous language.

And of course, every domain/bounded context can (should?) have its own ubiquitous language. And the interfaces/translators need to be clear. But at least inside a domain/scope/context you have a language that's shared ubiquitously.

So next time you need to abstract something, think about the language you use, because language matters.

by Leon Rosenshein

Just One More Thing

There are almost as many reasons to ask questions as there are questions to ask many of them fit into a few categories. This is by no means an exhaustive list, but some of the bigger ones are:

  1. You're looking for context to help identify a problem
  2. You know what your problem is and you're looking for a solution
  3. You've got a solution and you're looking for an implementation
  4. You know the answer and you're leading the witness (Columbo style)

When you have the first kind of question you ask open ended questions, follow up questions, and why questions. You're trying to find a gap in your knowledge or understanding so you need to start, at least internally, by defining what you do know and where the boundaries are. You're an explorer and you're working with a guide. The guide might not have the answer, but together you can find it. And you need to be comfortable with silence while the guide thinks about the question.

A good question of the second kind takes some up front thought. Asking your peer or Stack Overflow a good question requires that you understand your problem. That you know what you've tried and why it didn't work. And you need to be able to explain that well so you don't waste anyone's time.

The third type of question is when I get to ask my favorite question, "What is it you're really trying to do?" It's the kind of question you get when someone knows just what they want to do, but can't figure out how to bend the system to their will. If you find yourself asking that kind of question, take a step back and think about what you're trying to do and if your approach makes sense.

That last kind of question is when you don't really have a question. You're trying to make a point or get buy-in. After all, if you can get the other person to convince themselves of something you don't need to do it. And that convincing is much more likely to last.

Just one more thing. If you've never watched Columbo, take some time and watch an episode or two. You'll be glad you did.


by Leon Rosenshein

A Little Architecture

A couple of months ago I mentioned that O'Reilly was holding an Architectural Kata competition. A few of us spent some time a couple of weeks ago putting together a response. It was a lot of fun and we all learned a lot about software architecture, what it is, and what it isn't. If you're interested in getting some architectural experience I think the idea of a Kata is a great place to start.

Another place to start is looking at what people who have been there and done that have to say about architecture.  Ralph Johnson famously defined software architecture as “the important stuff (whatever that is).” The architect’s job is to understand and balance all of those important things (whatever they are)

And what are the important things? I'll give you a hint, it's NOT the choice of framework, language, or cloud provider. And it's not about a 6 month planning cycle where you define everything up front. One of the problems with that is that you don't know what you don't know, so how can you make any decisions about it? The decisions you need to make up front are the ones that let you make the other decisions as late as possible, when you have as much information as possible.

And that's good architecture.


by Leon Rosenshein

GC At Scale

Many of us have used garbage collected languages. Golang, Java, C#, or maybe experimenting with Rust. When we ran the HDFS cluster in WBU2 we routinely had heap sizes of 250+GB, and GC would sometimes reap 50+ GB and take seconds. That caused all sorts of runtime problems with our customers as the Namenode would go unresponsive and clients would time out. We solved that one by switching garbage collectors and tuning it better, as one does. But that's not the biggest or longest GC I've had to deal with.

Back in the Bing Maps days we managed a 330 PB distributed file system that was intimately connected to our distributed processing engine. Every file it tracked was associated with the job that created it. From the smallest temporary file to the biggest output file. There are lots of benefits to that. Since we also tracked all the inputs and output of each task it was easy to track the provenance of any given file. So if we found a problem somewhere in the middle of a processing chain it was easy to fix the problem then go and find everything that depended on it re-run those tasks with the same input labels and get everything up to date.

Now disk space, just like memory, is a precious resource. And we were always running out of available disk space. One of the features we had was the ability to mark every file generated by processing as task scope or job scope. Task scope files went away when the overall job was marked as done or cancelled. Job scope files, on the other hand, only went away when the job was cancelled. And by go away I mean they were marked as deleted, and the cleaner would come along periodically and delete them out of band. So far, so good. 

And of course, whenever we got close to running out of disk space we'd remind users to close or cancel any old jobs so the space would be reclaimed. And they would and we would move on. But remember when I said every file was associated with the job that created it? That included the files that were imported as part of bootstrapping the lab. Those files included the original source imagery for pretty much everything we had when the cluster came online, and they were added with a bunch of jobs called import_XXX.

You can probably guess where this is going. We had never marked those import jobs as done, and as part of one of those periodic cleanups, someone cancelled those import jobs instead of closing them. And the system then happily GC'd all of that data. About 30 PB worth of data. Now that's garbage collection at scale :)

by Leon Rosenshein

Thesaurus In Use

Software Architecture is bricks made up of the recursive application of sequence, selection, and iteration, emplaced within a superstructure of object oriented design, and plumbed with pipelines immutable data governed by functional programming.

    -- Uncle Bob Martin

There's a lot to unpack there. I think there's a typo in there, but the two changes I've come up with give the statement different meanings. And both are correct. So who knows. Let's break it down.

Software architecture is bricks: A group of self-contained blocks that are assembled in some particular fashion to produce something. I can get behind that.

Made up of the recursive application of sequence, selection, and iteration: Each of those bricks operates on the things inside. It puts them in some defined order, it picks the ones that matter, and it cycles through them all. When you get down to it, computers can only do simple things. They just do lots of them very quickly, giving the illusion of something bgi happening.

Emplaced within a superstructure of object oriented design: Placed within a framework of decoupled systems communicating via message passing (Alan Kay's definition, NOT C with classes)

Plumbed with pipelines immutable data: Here's where I think there's a typo. It could be pipelined immutable data or pipelines of immutable data. I like the former AND its Levenshtein distance is lower, so I'm going with that. In essence, the data flowing through the system shouldn't change. Sometimes it does, but we try to keep it to under control

Governed by functional programming: The key word here is governed. As in Control, influence, or regulate (a person, action, or course of events. Functional programming is a guideline. An aspirational goal, tempered by the reality of things like data size and storage. Functional means things never change value, so you don't need to worry if it has. The more functional, the less cognitive load. And that's always a good thing.

So what do you all think?


by Leon Rosenshein

Inclusiveness

We all deal with lots of hierarchies. There's your organization hierarchy. There's the class structure in code. There's the ownership hierarchy in code. There's the ownership hierarchy in uOwn. There's a hierarchy to resource budgets. It would be wonderful if the shape of these trees was all the same, but while they're similar, there are subtle differences to them. And those differences grow over time as priorities shift, roles and scopes change, and re-orgs occur. Dealing with those differences is a challenge, and something we all need to work on minimizing where possible. But that's not really the point of today's post, just some context.

Today's about regular expressions and avoiding customer pain. The connection between hierarchies and regular expressions is that while we have hierarchies in our resources budgets, Kubernetes explicitly doesn't have hierarchies in namespaces, and it uses namespaces to manage resource budgets and access to resources. The problem is how to bridge the gap between the two.

The solution we've implemented involves YAML files and nesting. The YAML file lets us define a set of resource constraints that (at least closely) match the existing hierarchies, but can be flattened into something Kubernetes can understand. We do this by building the flattened name for a given namespace from all of its parents. It's a neat trick, but what has that got to do with regular expressions?

Like I said a month ago, regular expressions are dangerous things, but here we need them. Because another Kubernetes feature is resource isolation by namespace. Which means that if we create a resource, say an NFS mount, inside a given namespace it can only be used inside that namespace. Which means if we want to make a resource available to all of the namespaces in a sub-tree we need to create it in the sub-tree. Or what would be the sub-tree if we had one.

This is important because when we create a new namespace that is logically in a tree we want it to have access to things that the parent has. And we got that wrong a few times. We would create the new namespace, but not make things available to it because we would forget to add them to the allow list. And our customers would complain. And we would feel bad.

And that's where the regular expression comes in. Because we build the name of the namespace with its parents, we can look at the name and decide if a given name is part of a namespace. It's easy to define a regex that looks at the start of a string, decides if the beginning matches a pattern, and not care about the rest. Something like `/^<ThingToLookFor>.*/`. Instead of looking for a string equal, treat the desired string as a regex, then look for a match. Simple. Easy.

Mostly. First of all we need to be backward compatible. In this case that was pretty easy. `/<ThingToLookFor>/` is a perfectly valid regex. It's the regex equivalent of string equals, so we've got that going for us. Second, what happens if the template isn't a valid regex? These things are in a config file, so we can't fail the compilation. In this case we've chosen to validate at build time AND validate when reading the config file. This should keep us safe from runtime panics. We've still got to worry about extra or missed matches, but at least it will run.

And as long as we don't mess up the regex, when we add new leaves to the virtual tree they will automatically inherit the things they should from the parent.

by Leon Rosenshein

Time Passes

I've mentioned before that time is hard. From leap days/seconds to time zone definition changes to GPS time resetting to 0 every Sunday, it's just not that simple.

But it's even worse than that. Consider the truly functional languages, such as Haskell, Erlang, and OCaml. Functions don't have side effects, because they can't. Except… One side effect you can't prevent, even in the most functional of languages, is the passage of time. The result of calling time.Now() before and after the function call will be different. And that's a side effect.

Of course, you can't change time. Unless you're running a simulation. Then things are different. If you're running a simulation then one of the first things you need to simulate is time. And then you need to rigidly control it. If you're running a distributed simulation you also need to make sure that all of the nodes in your system always agree on what time it is. And that's hard. Doable, but hard. If you don't, someone will figure out a way to deal with it or come up with a plausible (but wrong) explanation.

Back when I was doing aerospace work I was working with some folks doing simulations of the Northrop YF-23, and a big part of the testing was M vs.N air combat. This was a hard real time system, with most of the effort going into accurately simulating the performance and response of the YF-23s. Everything else in the simulation was just there to provide a realistic environment. And the pilots of other aircraft complained that at the beginning of the simulation their aircraft were "heavy" and unresponsive, but after a while they started to respond better. Turns out that wasn't the case. Instead, because it was a hard real-time system and everything had to happen within the fixed frame time, the scheduler, to make everything fit, would take the things at the end of the list of tasks and put some of them in the even frames and some in the odd frames, then tell them to simulate twice the time. While this is accurate, it's not quite right because it doubled the transport delay and latency, which the pilots saw as the aircraft being "heavy" and unresponsive. In reality it was that as the number of simulated objects shrank, eventually everything was run every frame, and the pilots felt better. Whether this was a bug or a feature is a different question, but this was only possible because the system had complete control over time in the simulation and could do whatever it needed to keep it in sync with wall time.

Another place that time as a side-effect can bite you is in unit tests. Luckily, unit tests are kind of like simulations, in that you can control time there as well. Or you can if you access time as an injected dependency. Then you can mock it and get time to do what you want, not what it ends up doing on its own. Which makes it possible to test all of those edge cases without writing long tests with random sleeps or ending up with flaky tests. And we all want to eliminate flaky tests.

So remember, time is hard, so be careful where you get it.

by Leon Rosenshein

Radar Time

The new Thoughtworks radar is out. There are a handful of things that I'm really interested in, and high on the list are Security Policy as Code and Run Cost as an Architectural Fitness Function (since we're cloud first). Also interesting are Data Meshes and Kustomize (which we are already using for a lot of our Spinnaker deployments). And of course there's the perennial Zero-Trust and Rust. What's on your radar?

by Leon Rosenshein

Commits vs. PRs

For each desired change, make the change easy (warning: this may be hard), then make the easy change

    -- Kent Beck

Pretty good rule of thumb. Similar to the Unix way, do one thing and do it well. Then pipe them together to reach the end goal. But what has that got to do with commits and pull requests?

This. For small, additive changes, a single commit (modulo bug fixes/typos) is often all you need to implement your change, so you have a 1:1 ratio between commits and pull requests. On the other hand, if you've been avoiding unneeded generalization and layers of abstraction there will come a time when you do need that generalization/abstraction. And that's the time to implement it. Now you could do the big bang approach, and have a single commit that refactors the code into the new layout, implements the current functionality with the new layers, and also adds the new bits.

Most would agree that a single commit that does all that would be sub-optimal. You can't go back a step. You can't test the refactor to make sure that everything still works the same before you add anything. It's hard to know where the refactor stops and the new change begins. So let's not do it that way.

Instead, let's break the change down into steps. First remove any cruft. Make sure everything still works exactly the same. Bug for bug compatibility. That's a nice commit right there. Then do the basic refactor. Again, make sure everything still works exactly the same. That's another commit. There's probably some nearby code that should change to take advantage of the new refactor. Verify you still haven't actually changed results. Another commit. That's three commits and you've verified that there are no external changes. Have you done any work? Yes. You've reduced entropy. You've made it easy to make the change you started out to make. You've made it possible to actually determine the delta for the change you want, and to test the change you want to make.

Now you can make the change that started this whole journey. And test that the only thing that changed is the result you're looking for. So you've got at least 4 commits, plus however many you made along the way for checkpoints, experimentation, and "just in case".

So here's the question. How many PRs are there? For a long time I would do that as a single PR. But lately I've been thinking that's not the best approach. Think about what the world would be like if each of those 4 commits was also a PR. Think about it as a reviewer. Each of those PRs is isolated. The purpose of the PR is clear, and for 3 of them you can verify that nothing external changed. You can start using those changes in your own future work. And when the change shows up in that last PR you can see just what the change is and how it interacts with existing code. That's much easier to review and you're less likely to miss something than in a single 75 file mixed PR.

And, it's also the agile way. Preferring working code over docs, responding to change, continuously adding value. Think about it.

by Leon Rosenshein

To Float Or Not To Float

They say money makes the work go 'round. And these days computers are a big part of how the money moves. From accounting to checkbook management to online purchases, it's all about the Benjamins. Unfortunately, computers are really bad with money.

Which is odd, because what computers are really good at is arithmetic. They can add, subtract, multiply, and, except for some Pentium 4s, divide. Pretty much everything else they do is just doing that really quickly in complex patterns. So why are they so bad with money?

It comes down to the fact that computers do everything in base 2, and, since they only use non-negative exponents, the only thing that you can represent exactly are integers. Unfortunately, there are an infinite number of real numbers between any two integers. And almost all the programming languages we use attempt to model them with approximations. Floats, Doubles, Longs, Long Longs, and when we eventually start talking about Double Plus Long Longs, are just more and more accurate attempts.

But no matter how much precision you have, it's still an approximation. Your typical Float has 7 digits of precision. When you're dealing with money that can be a problem. Not for the kind of transactions that people typically deal with, but when the numbers get large, things happen. 

We ran into this when we were working on Falcon 4.0. We modeled the entire Korean peninsula to handle a dynamic simulation of the air-land battle. Everything worked fine below the DMZ, but as the front line moved north the land part of the battle ground to a halt. Our tanks and infantry literally couldn't move. It turned out that the combination of velocity and simulation step size meant that each time we advanced the simulation that forward distance the vehicles traveled was less than the minimum quantum of a floating point number that far from the origin. So nothing moved. To keep things moving we ended up moving the origin of the grid from the southwest corner of the map to the center. Then things could move at both extremes.

The same thing can (and will, if you're not careful) happen with money. If you have over $1,000,000 and you try to calculate interest over time then your interest calculations are going to be wrong. And the more money you have, the more inaccurate they'll be. You can switch from floats to longs to long longs, but it just buys you time, not a real solution.

So what can you do? One option might be to model pennies instead of dollars. Then you're dealing with integers instead of approximations, but that doesn't really work either. Your percentages are still non-integers. So you multiply them but some factor to make them whole numbers. Which works for a while, until someone wants to use pennies per thousand dollars instead of pennies per hundred dollars for a mill levy.

SQL has a solution. The Decimal type. Which makes sense since the early databases were all about tracking money, Instead of approximating, you specify how many digits before and after the decimal you want and it does the right thing in the background. But it's slow, and only works in SQL, so it's not a general solution. Java's BigDecimal works the same way.

So, before you go stuffing your dollars and cents into a Float, think about your boundary cases. Maybe a double or long long is enough. Maybe you can switch to integers, or maybe you need to find a Decimal library for your language. It depends.