Recent Posts (page 40 / 70)

November 6, 2020 by Leon Rosenshein

GC At Scale

Many of us have used garbage collected languages. Golang, Java, C#, or maybe experimenting with Rust. When we ran the HDFS cluster in WBU2 we routinely had heap sizes of 250+GB, and GC would sometimes reap 50+ GB and take seconds. That caused all sorts of runtime problems with our customers as the Namenode would go unresponsive and clients would time out. We solved that one by switching garbage collectors and tuning it better, as one does. But that's not the biggest or longest GC I've had to deal with.

Back in the Bing Maps days we managed a 330 PB distributed file system that was intimately connected to our distributed processing engine. Every file it tracked was associated with the job that created it. From the smallest temporary file to the biggest output file. There are lots of benefits to that. Since we also tracked all the inputs and output of each task it was easy to track the provenance of any given file. So if we found a problem somewhere in the middle of a processing chain it was easy to fix the problem then go and find everything that depended on it re-run those tasks with the same input labels and get everything up to date.

Now disk space, just like memory, is a precious resource. And we were always running out of available disk space. One of the features we had was the ability to mark every file generated by processing as task scope or job scope. Task scope files went away when the overall job was marked as done or cancelled. Job scope files, on the other hand, only went away when the job was cancelled. And by go away I mean they were marked as deleted, and the cleaner would come along periodically and delete them out of band. So far, so good.

And of course, whenever we got close to running out of disk space we'd remind users to close or cancel any old jobs so the space would be reclaimed. And they would and we would move on. But remember when I said every file was associated with the job that created it? That included the files that were imported as part of bootstrapping the lab. Those files included the original source imagery for pretty much everything we had when the cluster came online, and they were added with a bunch of jobs called import_XXX.

You can probably guess where this is going. We had never marked those import jobs as done, and as part of one of those periodic cleanups, someone cancelled those import jobs instead of closing them. And the system then happily GC'd all of that data. About 30 PB worth of data. Now that's garbage collection at scale :)

November 5, 2020 by Leon Rosenshein

Thesaurus In Use

Software Architecture is bricks made up of the recursive application of sequence, selection, and iteration, emplaced within a superstructure of object oriented design, and plumbed with pipelines immutable data governed by functional programming.

-- Uncle Bob Martin

There's a lot to unpack there. I think there's a typo in there, but the two changes I've come up with give the statement different meanings. And both are correct. So who knows. Let's break it down.

Software architecture is bricks: A group of self-contained blocks that are assembled in some particular fashion to produce something. I can get behind that.

Made up of the recursive application of sequence, selection, and iteration: Each of those bricks operates on the things inside. It puts them in some defined order, it picks the ones that matter, and it cycles through them all. When you get down to it, computers can only do simple things. They just do lots of them very quickly, giving the illusion of something bgi happening.

Emplaced within a superstructure of object oriented design: Placed within a framework of decoupled systems communicating via message passing (Alan Kay's definition, NOT C with classes)

Plumbed with pipelines immutable data: Here's where I think there's a typo. It could be pipelined immutable data or pipelines of immutable data. I like the former AND its Levenshtein distance is lower, so I'm going with that. In essence, the data flowing through the system shouldn't change. Sometimes it does, but we try to keep it to under control

Governed by functional programming: The key word here is governed. As in Control, influence, or regulate (a person, action, or course of events. Functional programming is a guideline. An aspirational goal, tempered by the reality of things like data size and storage. Functional means things never change value, so you don't need to worry if it has. The more functional, the less cognitive load. And that's always a good thing.

So what do you all think?

November 4, 2020 by Leon Rosenshein

Inclusiveness

We all deal with lots of hierarchies. There's your organization hierarchy. There's the class structure in code. There's the ownership hierarchy in code. There's the ownership hierarchy in uOwn. There's a hierarchy to resource budgets. It would be wonderful if the shape of these trees was all the same, but while they're similar, there are subtle differences to them. And those differences grow over time as priorities shift, roles and scopes change, and re-orgs occur. Dealing with those differences is a challenge, and something we all need to work on minimizing where possible. But that's not really the point of today's post, just some context.

Today's about regular expressions and avoiding customer pain. The connection between hierarchies and regular expressions is that while we have hierarchies in our resources budgets, Kubernetes explicitly doesn't have hierarchies in namespaces, and it uses namespaces to manage resource budgets and access to resources. The problem is how to bridge the gap between the two.

The solution we've implemented involves YAML files and nesting. The YAML file lets us define a set of resource constraints that (at least closely) match the existing hierarchies, but can be flattened into something Kubernetes can understand. We do this by building the flattened name for a given namespace from all of its parents. It's a neat trick, but what has that got to do with regular expressions?

Like I said a month ago, regular expressions are dangerous things, but here we need them. Because another Kubernetes feature is resource isolation by namespace. Which means that if we create a resource, say an NFS mount, inside a given namespace it can only be used inside that namespace. Which means if we want to make a resource available to all of the namespaces in a sub-tree we need to create it in the sub-tree. Or what would be the sub-tree if we had one.

This is important because when we create a new namespace that is logically in a tree we want it to have access to things that the parent has. And we got that wrong a few times. We would create the new namespace, but not make things available to it because we would forget to add them to the allow list. And our customers would complain. And we would feel bad.

And that's where the regular expression comes in. Because we build the name of the namespace with its parents, we can look at the name and decide if a given name is part of a namespace. It's easy to define a regex that looks at the start of a string, decides if the beginning matches a pattern, and not care about the rest. Something like `/^<ThingToLookFor>.*/`. Instead of looking for a string equal, treat the desired string as a regex, then look for a match. Simple. Easy.

Mostly. First of all we need to be backward compatible. In this case that was pretty easy. `/<ThingToLookFor>/` is a perfectly valid regex. It's the regex equivalent of string equals, so we've got that going for us. Second, what happens if the template isn't a valid regex? These things are in a config file, so we can't fail the compilation. In this case we've chosen to validate at build time AND validate when reading the config file. This should keep us safe from runtime panics. We've still got to worry about extra or missed matches, but at least it will run.

And as long as we don't mess up the regex, when we add new leaves to the virtual tree they will automatically inherit the things they should from the parent.

November 2, 2020 by Leon Rosenshein

Time Passes

time

I've mentioned before that time is hard. From leap days/seconds to time zone definition changes to GPS time resetting to 0 every Sunday, it's just not that simple.

But it's even worse than that. Consider the truly functional languages, such as Haskell, Erlang, and OCaml. Functions don't have side effects, because they can't. Except… One side effect you can't prevent, even in the most functional of languages, is the passage of time. The result of calling time.Now() before and after the function call will be different. And that's a side effect.

Of course, you can't change time. Unless you're running a simulation. Then things are different. If you're running a simulation then one of the first things you need to simulate is time. And then you need to rigidly control it. If you're running a distributed simulation you also need to make sure that all of the nodes in your system always agree on what time it is. And that's hard. Doable, but hard. If you don't, someone will figure out a way to deal with it or come up with a plausible (but wrong) explanation.

Back when I was doing aerospace work I was working with some folks doing simulations of the Northrop YF-23, and a big part of the testing was M vs.N air combat. This was a hard real time system, with most of the effort going into accurately simulating the performance and response of the YF-23s. Everything else in the simulation was just there to provide a realistic environment. And the pilots of other aircraft complained that at the beginning of the simulation their aircraft were "heavy" and unresponsive, but after a while they started to respond better. Turns out that wasn't the case. Instead, because it was a hard real-time system and everything had to happen within the fixed frame time, the scheduler, to make everything fit, would take the things at the end of the list of tasks and put some of them in the even frames and some in the odd frames, then tell them to simulate twice the time. While this is accurate, it's not quite right because it doubled the transport delay and latency, which the pilots saw as the aircraft being "heavy" and unresponsive. In reality it was that as the number of simulated objects shrank, eventually everything was run every frame, and the pilots felt better. Whether this was a bug or a feature is a different question, but this was only possible because the system had complete control over time in the simulation and could do whatever it needed to keep it in sync with wall time.

Another place that time as a side-effect can bite you is in unit tests. Luckily, unit tests are kind of like simulations, in that you can control time there as well. Or you can if you access time as an injected dependency. Then you can mock it and get time to do what you want, not what it ends up doing on its own. Which makes it possible to test all of those edge cases without writing long tests with random sleeps or ending up with flaky tests. And we all want to eliminate flaky tests.

So remember, time is hard, so be careful where you get it.

October 28, 2020 by Leon Rosenshein

Radar Time

The new Thoughtworks radar is out. There are a handful of things that I'm really interested in, and high on the list are Security Policy as Code and Run Cost as an Architectural Fitness Function (since we're cloud first). Also interesting are Data Meshes and Kustomize (which we are already using for a lot of our Spinnaker deployments). And of course there's the perennial Zero-Trust and Rust. What's on your radar?

October 27, 2020 by Leon Rosenshein

Commits vs. PRs

kent beck refactoring unix way

For each desired change, make the change easy (warning: this may be hard), then make the easy change

-- Kent Beck

Pretty good rule of thumb. Similar to the Unix way, do one thing and do it well. Then pipe them together to reach the end goal. But what has that got to do with commits and pull requests?

This. For small, additive changes, a single commit (modulo bug fixes/typos) is often all you need to implement your change, so you have a 1:1 ratio between commits and pull requests. On the other hand, if you've been avoiding unneeded generalization and layers of abstraction there will come a time when you do need that generalization/abstraction. And that's the time to implement it. Now you could do the big bang approach, and have a single commit that refactors the code into the new layout, implements the current functionality with the new layers, and also adds the new bits.

Most would agree that a single commit that does all that would be sub-optimal. You can't go back a step. You can't test the refactor to make sure that everything still works the same before you add anything. It's hard to know where the refactor stops and the new change begins. So let's not do it that way.

Instead, let's break the change down into steps. First remove any cruft. Make sure everything still works exactly the same. Bug for bug compatibility. That's a nice commit right there. Then do the basic refactor. Again, make sure everything still works exactly the same. That's another commit. There's probably some nearby code that should change to take advantage of the new refactor. Verify you still haven't actually changed results. Another commit. That's three commits and you've verified that there are no external changes. Have you done any work? Yes. You've reduced entropy. You've made it easy to make the change you started out to make. You've made it possible to actually determine the delta for the change you want, and to test the change you want to make.

Now you can make the change that started this whole journey. And test that the only thing that changed is the result you're looking for. So you've got at least 4 commits, plus however many you made along the way for checkpoints, experimentation, and "just in case".

So here's the question. How many PRs are there? For a long time I would do that as a single PR. But lately I've been thinking that's not the best approach. Think about what the world would be like if each of those 4 commits was also a PR. Think about it as a reviewer. Each of those PRs is isolated. The purpose of the PR is clear, and for 3 of them you can verify that nothing external changed. You can start using those changes in your own future work. And when the change shows up in that last PR you can see just what the change is and how it interacts with existing code. That's much easier to review and you're less likely to miss something than in a single 75 file mixed PR.

And, it's also the agile way. Preferring working code over docs, responding to change, continuously adding value. Think about it.

October 26, 2020 by Leon Rosenshein

To Float Or Not To Float

it depends

They say money makes the work go 'round. And these days computers are a big part of how the money moves. From accounting to checkbook management to online purchases, it's all about the Benjamins. Unfortunately, computers are really bad with money.

Which is odd, because what computers are really good at is arithmetic. They can add, subtract, multiply, and, except for some Pentium 4s, divide. Pretty much everything else they do is just doing that really quickly in complex patterns. So why are they so bad with money?

It comes down to the fact that computers do everything in base 2, and, since they only use non-negative exponents, the only thing that you can represent exactly are integers. Unfortunately, there are an infinite number of real numbers between any two integers. And almost all the programming languages we use attempt to model them with approximations. Floats, Doubles, Longs, Long Longs, and when we eventually start talking about Double Plus Long Longs, are just more and more accurate attempts.

But no matter how much precision you have, it's still an approximation. Your typical Float has 7 digits of precision. When you're dealing with money that can be a problem. Not for the kind of transactions that people typically deal with, but when the numbers get large, things happen.

We ran into this when we were working on Falcon 4.0. We modeled the entire Korean peninsula to handle a dynamic simulation of the air-land battle. Everything worked fine below the DMZ, but as the front line moved north the land part of the battle ground to a halt. Our tanks and infantry literally couldn't move. It turned out that the combination of velocity and simulation step size meant that each time we advanced the simulation that forward distance the vehicles traveled was less than the minimum quantum of a floating point number that far from the origin. So nothing moved. To keep things moving we ended up moving the origin of the grid from the southwest corner of the map to the center. Then things could move at both extremes.

The same thing can (and will, if you're not careful) happen with money. If you have over $1,000,000 and you try to calculate interest over time then your interest calculations are going to be wrong. And the more money you have, the more inaccurate they'll be. You can switch from floats to longs to long longs, but it just buys you time, not a real solution.

So what can you do? One option might be to model pennies instead of dollars. Then you're dealing with integers instead of approximations, but that doesn't really work either. Your percentages are still non-integers. So you multiply them but some factor to make them whole numbers. Which works for a while, until someone wants to use pennies per thousand dollars instead of pennies per hundred dollars for a mill levy.

SQL has a solution. The Decimal type. Which makes sense since the early databases were all about tracking money, Instead of approximating, you specify how many digits before and after the decimal you want and it does the right thing in the background. But it's slow, and only works in SQL, so it's not a general solution. Java's BigDecimal works the same way.

So, before you go stuffing your dollars and cents into a Float, think about your boundary cases. Maybe a double or long long is enough. Maybe you can switch to integers, or maybe you need to find a Decimal library for your language. It depends.

October 22, 2020 by Leon Rosenshein

Your Personal Board

career

Companies have a board of directors to provide oversight, guidance, and advice. They're generally not involved in the day to day operations of the company, and are composed of people with different experiences and backgrounds, often from other industries. They work to advance the company, but are not beholden to it.

That's nice, but how is it relevant to a developer? It's relevant because the main role of a company's board is to help the company progress and grow. As a developer you want the same thing. You want career progression and growth. And that's why you should have a personal board of directors.

Now a personal board is much less formal than a corporate board. There are no articles of incorporation, the positions are unpaid, and often the people on the board don't know they're on it. Your board is a set of people who you trust to give you honest advice and help open doors for you.

So who goes on your personal board? The specific people are up to you, but there are a few roles you want to fill. You'll want at least one traditional mentor. Someone you can go to and actively seek advice and guidance. The person you go to when you have a particularly thorny problem you need to work through. And you might have more than one if you have different types of problems.

You'll want someone who knows people and can open doors for you. Someone who can provide introductions to the people you really need to talk to. Whatever it is you're doing, it probably impacts and involves other people/teams, and this person can get you in touch with the appropriate counterpart on the other side and facilitate the initial discussion. Again, you might have more than one of these depending on what you're doing.

You'll probably need someone who can get things done for you. Someone who, if you get them on board, will bring others with them. A force multiplier so that after you get them on board you can focus on something else. This board member is the most situational, as that kind of influence is often specific.

Finally, and surprisingly, you're going to want an honest critic. Someone who can be critical of what you're saying/doing, and do it in a way that provides clear, actionable feedback. This is the person who helps you say things not just in a way that can be understood, but in a way that can't be misunderstood. They point out the holes and flaws in your arguments and help you fix them.

Remember, you're responsible for your career, so you should do what you can to improve it. That starts by getting the help and advice you need.

October 21, 2020 by Leon Rosenshein

Stories vs. Tasks

user stories

We Scrum. Or at least we Sprint. And our sprints come complete with Stories and Tasks. Just ask Jira. Stories and tasks are both important. Together, along with Epics and Initiatives, they make up our backlogs, both Sprint and Product.

Stories have phrases like "As <persona>", "I should be able to <X>", and "so that I can <Y>". They have Acceptance Criteria that describe what done looks like. They often (almost always) require work on multiple components. And above all, when they're done the user/customer gets added value.

Tasks, on the other hand, are things that need to be done. They are defined by imperative statements. Like "Do <X>", "Automate <Y>", or "Write postmortem for <Z>". The definition of done for a task is implicit in its imperative nature. Just follow the instructions (along with all of its implied NFEs) and you're done.

What's important is not confusing the two. A Story will encompass multiple tasks, but the individual tasks are not stories. It's always a temptation, when a Story is too large, to break it down into smaller Stories. And we should. Being Agile, at its heart, is about continuously delivering value to the customer. If you can, deliver some value sooner.

What we need to avoid is breaking a Story down into tasks, and then treating the tasks like stories. Some Stories are just that big, or touch too many components, or are serialized in a way that keeps them from being done in parallel. In those cases avoid the temptation. To help planning and set expectations, create the tasks and add them to the backlog. Consider them and the time it takes to do them when doing Sprint planning. Tasks, in and of themselves, don't add customer value.

Remember. The goal is to add customer value, and to do that you need to know which things on the backlog add customer value and which things are steps. Because it's easy to maximize items closed/velocity without maximizing value add. And that leads to the dreaded feature factory, which we all want to avoid. But that's a subject for another day.

October 20, 2020 by Leon Rosenshein

A Fiddler On The Roof

tuckman

Have you ever been on a high performing team? Not just a good team, with a good manager, where things mostly get done on time and people enjoy their work and team, but a truly high performing team? A team where you can't wait to start the day. Where problems seem to evaporate before your eyes and the team accomplishes more than anyone expected? The team has an outsized impact and other folks look at the team and want to join?

Most of the teams I've been on have been good, solid teams, doing good work and making a difference. I've also been on a few high performing teams, and that is a wondrous experience. You learn, you have fun, and you have an impact. The question is, what turns a good team into a high performing team?

It's not just having the smartest people on the team. How many all-star teams in whatever your favorite sport is could beat that year's champion? Very few, if any. Because while those players might be the best at their position, they haven't spent enough time as a team to play together. Back when I was building my first ethernet network none of us knew how to do it, only that it could be done. So we built a multi-cockpit man-in-the-loop simulator out of the tools we had.

It's not just having good management. Bad management can destroy a team. Micro-managing, information hoarding, and favoritism are all things that will keep a team from being high performing. On the other hand, the absence of those doesn't mean you'll be high performing.

So what makes a high performing team? I think there are a couple of enabling things. The first is a good mix of skills and levels. The team needs to have people who are willing to lead in all of the areas that need leadership, and willing to follow in the areas where there is leadership. Ideally you have people who have experience in the problem space and know some of the pitfalls, and people who are willing to learn and/or challenge the status quo and figure out new ways as well.

Second, you need to understand that performance takes time to develop, and there are stages people and teams go through on their journey to high performance. The best models I've found for these are Hershey and Blanchard's Situational Leadership for individuals and Tuckman's Stages of Group Development. And if you look at them you'll see that individuals and teams go through mostly the same stages.

But what turns a group of D3s (from Hershey and Blanchard), capable, cautious performers who are Norming (Tuckman) as a good team into a high performing team? That, as Tevye said, I can answer in one word. Tradition. Ok, not tradition as Tevye described it, but a tradition of TRUST.

Trust within the team. Trust that the entire of the team is going in the same direction. Trust that you all have the same goals. Trust that the team has your back. Trust that the little things won't fall through the cracks. Trust that if someone says they'll do something it will get done, or you'll know it won't with plenty of time to adjust, Basically the folks on the team trust each other to work together in support of the each other and the team.

Trust of the team. Trust outside the team that the team knows when it can make its own decisions, trust that it knows when it needs external inputs to make those decisions, and trust that it knows when to provide options to others for them to make decisions. In short, the organization trusts the team to do the right thing, not micro-managing or constraining the team.

Trust of the organization by the team. Trust that there will be support from the org. Trust that the overall goals and direction are not going to be changed on a whim. Trust that people are listening to the team. Trust that no one is trying to undermine the team. Or, in other words, trust that what the team is doing is important and that the team will be allowed to do it.

And maybe, when you think about it, that tradition of trust isn't all that different from the tradition that Tevye was talking about.

Older Newer