Recent Posts (page 50 / 70)

by Leon Rosenshein

Pictures Or It Didn't Happen

One of the oddities of software is that the vast majority of the work done can't be seen. It's just a series of 0's and 1's stored somewhere. You can spend man years working on paying down technical debt or setting yourself up for the future.

When I was working on the Global Ortho project we had something we called the MBT spreadsheet. The Master Block Tracking list. It started life as a PM's way of tracking the status of those 1 degree cells through the system, from planning to acquisition, ingest, initial processing, color balancing, 3D model generation, then delivery and going live to the public. There were only 5 - 10 blocks in progress at any given time, and it was all manual.

Then we scaled up. There were about 1500 blocks in total to process and we usually had more than 100 in the pipe at any time. it got to be too much for one person to handle. So our production staff did what they usually do. A bunch of people did a bunch of manual work outside the system and just stuffed the results in. And no one knew how it worked or trusted it. But the powers that be liked the idea, and they really liked the overview that let them know the overall state and let them drill down for details when they wanted.

So that gave us our boundaries. A bunch of databases and tables with truth, and a display format we needed to match. We just needed to connect the two. The first thing we did was talk to the production team to figure out what they did. We got lots of words, but even the people who wrote down what they did weren't sure if that was right. So we went to the trusty whiteboard. We followed the data and diagrammed everything. We figured out how things were actually flowing. That was one set of architectural diagrams. But they were only for reference.

Then we took the requirements and diagrammed how we wanted the data to flow, including error cases and loops. And it was really high level. More of a logic diagram. All of the "truth" was in a magic database. All of the rules were in another magic box. The state machine was a box. We used those few boxes with a few others (input, display, processing, etc) we figured out how to handle all of the known use cases and some others we came up with along the way.

The next step was to map those magic ideal boxes onto things we had or could build. Database tables and accessor, processing pipelines, state machines, manual tools, etc. That led to some changes in the higher level pictures because there was coupling we didn't have time to separate, so we adjusted along the way. But we still couldn't do anything because this was a distributed system and we needed to figure out how the parts scaled and worked together.

And that led to a whole new round of diagrams that described how things scaled (vertically or horizontally) and how we would manage manual processes inside a mostly automated system. How we would handle errors, inconsistencies, and monitoring. How we would be able to be sure we could trust the results.

Once we had all those diagrams (and some even more detailed ones), we did it. We ended up with the same spreadsheet, but with another 10 tabs, connected to a bunch of tables across the different databases, with lots of VBA code downloading data then massaging it and holding it all together. Then we went back to the powers that be and showed them the same thing they were used to looking at and told them they could trust it.

Understandably they were a little surprised. We had about 3 man-months into the project and everything looked the same. They wanted to know what we had been doing all that time and why this was any different. Luckily, we expected that response and were prepared. We had all those architectural diagrams explaining what we did and why. And by starting at the high level and adding detail as needed for the audience we were able to convince them to trust the new MBT.


by Leon Rosenshein

Optimization

Premature optimization may or not be the root of all evil, but optimizing the wrong thing certainly doesn't help. When we talk about optimization we're usually talking about minimizing time, and the example says that if you have two sub-processes and one takes 9.9 seconds and the other takes .1 seconds then you should spend your time on the first one, because even if you manage to eliminate the second no one will notice. And that's an important thing to remember.

But optimization goes way beyond that. What if that process was part of a system that was only used by 1% of your user base, or the system had been retired? What if the question isn't what function should I make faster, but which bug should I fix or feature should I implement next? It goes back to one of my favorite questions. "What are you really trying to do?"

Because in almost all cases, what you are really trying to optimize out where to spend your time to provide maximum user value and when. Before you can do that you have to define those 4 terms.

And that's where scope comes in. You could be yourself, your team, your organization, ATG, or Uber Technologies. User has the same scope or larger. Are you working on something for internal consumption only, or does it accrue to an externally facing feature? Or is it something even larger like the public safety or the environment?

What does value mean? There's dollar value, but that might not be the right way to measure it. It could be time, or ease of use (customer delight), or more intangible, like brand recognition. And when will that value be realized? Will it be immediate, or will it be much further down the road?

Those are a lot of questions that you need to answer before you can optimize, and sometimes we don't have all the answers. The answers also impact each other. We have a finite amount of resources and you can't do everything.

So what do we do? We optimize what we can see and understand with what we can do and make the optimal choices based on what we know. Then we observe the situation again and make some more optimizations. Then we do it again. And again. Kind of like a robot navigating down the road in an uncertain environment :)

by Leon Rosenshein

Cost Of Control

The other day I ran across a presentation which appeared to have a very different take on incident management. And it talked about some of my hot topics, hidden complexity, cognitive load and their costs, so I was looking forward to seeing the recommendations and seeing how I might be able to apply. But the presentation kind of fizzled out, ending up with some observations of how things were, and identifying some issues/bottlenecks, but not taking the next step.

Thinking about it, what I realized was that what the presentation described was a C3 (command, control, and communication) system, and the differences between examples that worked well and ones that didn't wasn't the process/procedure, it was the way the teams worked together. Not the individuals on the team, not the incident commander (IC) or the person in the corner who came up with the answer alone, but how those individuals saw each other and worked together.

The example of a situation where an incident was well handled was the Apollo 13 emergency, and the counterexample was a typical service outage and zoom. And that took me back to something I learned about early in my Microsoft career. The 4 stages of team development. Forming, Storming, Norming, and Performing. There's no question that how the IC does things will impact the team performance, but generally the IC isn't the one who does the technical work to solve the problem.

And that's how Gene Kranz brought the Apollo 13 astronauts home safely. Not by solving the problem or rigidly controlling the process and information flow, but by making sure the team had what it needed, sharing information as available, providing high level guidance, and above all, setting the tone. And he was able to do that because he had a high performing team. People who had worked together before, knew what others were responsible for, and capable of doing, and that had earned each other's trust.

Compare that with a typical service outage. You get paged, open an incident, and then you try to figure out which upstream provider had an issue. You get that oncall, then someone adds the routing team and neteng gets involved. Pretty soon you've got a large team of experts who know everything they might need to know about their system, but nothing about the other people/systems involved. It's a brand new team. And when a team is put together they go through the forming/storming phases. The team needs to figure out their roles, so you end up with duplicate work and work dropped between the cracks. You get both rigid adherence to process and people quietly trying random things to see what sticks. We've got good people and basic trust, so we get through that quickly and solve the problem. Then we tear the team down, and the next time there's an incident the cycle starts again.

So what can we do about it? One thing that Core Business is doing is the Ring0 idea, A group of people that have worked together to form the core of incident response, so the forming/storming is minimized. Another thing you can do is get to know the teams responsible for tools/functionality you rely on. They don't need to be your best friends, but having the context of each other really helps. Be aware of the stages of team development. Whether you're the IC or just a member of the team, understanding how others are feeling and how things look to them helps you communicate. Because when you get down to it, it's not command and control that will solve the problem, but communication.

Finally, if you're interested in talking about the 4 stages of teams let me know. There's a lot more to talk about there.

by Leon Rosenshein

The Art Of Computer Programming

A little history lesson today. Have you ever used a compiler? Written an algorithm? Used the search function inside a document? Then chances are you've used work by Donald Knuth. For the last 60-odd years he's been learning, teaching, doing, and writing about computer science and it's fundamentals.

By far his greatest contribution to the field is his magnum opus, The Art of Computer Programming. It was originally going to be one book, but it's up to 4 volumes now and growing. It covers everything from fundamentals search/sort, numerics, combinatorics, data structures, and more.

by Leon Rosenshein

Scale Out, Not Up

If you've spent much time around folks who work on large scale data centers you've probably heard them talking about pets and cattle and wondered why that was. Yes, data centers are sometimes called server farms, but that's not why. Or at least it's not that simple.

It all goes back to the idea of scale and how that has changed over the years. 50 years ago computers were big. Room sized big. And they took a lot of specialized knowledge just to keep running, let alone doing anything useful. So every computer had a team of folks dedicated to keeping it happy and healthy. Then computers got smaller and more powerful and more numerous. So the team of specialists grew. But that can only scale so far. Eventually we ended up with more computers than maintainers, but they were still expensive, and they were all critical.

Every computer had a name. Main characters from books or movies, or the 7 dwarfs. Really big data centers used things like city or star names. It was nice. Use a name and everyone knew what you were talking about, and there just weren't that many of them. We had a render farm for Flight Sim that built the art assets for the game every night. They had simple names with a number, renderer-xx. But we only had 10 of them, so they all had to be working. Even the FlightSim team had a limited number of machines.

Eventually machines got cheap enough, and the workloads got large enough, that we ended up with 1000s of nodes. And the fact is that you can't manage 5000 nodes the way you manage 5, or even 50. Instead, you figure out which ones are really important and which ones aren't. You treat the important ones carefully, and the rest are treated as replaceable. Or, to use the analogy, you have pets and cattle. Pets are special. They have names. You watch them closely and when they're unwell you nurse them back to health. Cattle, on the other hand, have a number. If it gets sick or lost or confused, who cares. There's 5000 more just like it and it can be replaced.

That concept, coupled with some good automation, is what allowed a small team to manage the 5000 node WBU2 data center. Out of those 5000 nodes we had 28 pets, 28 semi-pets, and 4900+ cattle. 8 of those 28 pets were the namenodes I talked about in our HDFS cluster. We really cared about them and made sure they were happy. The other 20 we watched carefully, and made sure they were working, but re-imaging them was an easy fall-back if they got sick. And we really didn't care about the cattle. At the first sign of trouble we'd reimage. If that didn't work we'd take offline and have someone fix the bad hardware. And our users never knew. Which was the whole point of pets and cattle.

by Leon Rosenshein

IP Tables To The Rescue

HDFS is great for storing large amounts of data reliably, And we had the largest HDFS cluster at Uber and and one of the largest in the world in WBU2. 100 PB, 1000 Nodes, 25,000 Disks. HDFS manages all those resources, and it does it in a reasonably performant manner. It does it by keeping track of every directory, file, and file block. You can think of it like the master file table on a single disk drive. It knows the name of the file, where it is in the directory hierarchy, and it's metadata (owner, perms, timing, blocks that make it up, their location in the cluster, etc). The datanodes, on the other hand, just keep track of the state and correctness of each block they have. For perf reasons the datanodes keep track of what info they've sent to the namenodes and what the namenode has acknowledged seeing. From there it only sends deltas. Much less information, so less traffic on the wire, fewer changes to process. Goodness all around. And to keep the datanodes from all hitting the namenode at once, the updates are splayed out over time.

The problem we had was that our users were creating lots of little files. And lots of very deep directory structures with those tiny files sprinkled within them. So the number of Directories/Files/Blocks grew. Quickly. To well over a billion

At that point startup became a problem. We had a 1000 node thundering herd sending full updates to a single namenode. 1000 nodes splayed out over 20 minutes gives you just over 1 second per node. With 100 TB of disk per machine, transmitting and processing that data took well over a second. So the system backed up. And connections timed out, And the namenode forgot about the datanode, or the datanodes forgot about the namenode and the cycle started all over again. So we had to do something.

We could have turned everything off and then turned the datanodes back on slowly, but hardware really doesn't like that. Turning the nodes in a DC off/on is a really good way to break things, so we couldn't do that. And we didn't want to build/maintain a multi-host distributed state machine. We wanted a solution that ran in one place where we could watch and control it.

So what runs on one machine and lets you control what kind of network traffic gets in? IPTables of course. Now IPTables is a powerful and dangerous tool. And this was right around the time someone misconfigured IPTables at Core Business and took an entire data center offline, requiring re-imaging to get them back, so we wanted to be careful. Luckily we knew which hosts would be contacting us and what port they would be contacting, so we could write very specific rules.

So that's what we did. Before we brought up the namenode we applied a bunch of IPTable rules. Then we watch the metrics and slowly let the datanodes talk. That got things working, but it still took 6+ hours and lots of attention. So we took it to the next level. Instead of doing things manually, a bash script that applied the initial rules, then monitored the live metrics until the namenode was feeling ok, then released another set of nodes. And that got us down to ~2.5 hours of automation. We still worried about it since we were either down or had no redundancy at that time, but it was reliable and predictable.

And that's how we used IPTables to calm the thundering herd.

by Leon Rosenshein

The 3 Tells

I'm not talking about how to tell when the person across the poker table from you is bluffing. I'm talking about how you can effectively get your point across in a limited amount of time. You've got good ideas. Knowing how to get your point across, whether it's formal with a slide deck or a quick hallway conversation, makes it much more likely that those good ideas will become realities. And it starts with the three tells:

  1. Tell them what you're going to tell them and why they care
  2. Tell them
  3. Tell them what you told them and why it's important


That doesn't mean start every presentation with an agenda slide and end with a conclusion slide, both of which you read verbatim, although that's not a bad place to start. It means make sure you know and share the destination up front. And not just the destination, but why your audience wants to get there. Call it a hook or a vision, or just "What's in it for them". Make sure they understand why they should care about what you're saying.

Then you tell them. Tell them what they need to know. You know the subject. It's important to you. Let that enthusiasm show. But, don't get carried away. Make sure you stay at the right level. Stay on message. Knowing what not to say is as important as knowing what to say. That doesn't mean glossing over problems, or that you don't need to know the details. Think of common questions and have the answer available, but wait for someone to let you know they're needed before you provide them.

It's like teaching physics. You use the appropriate level for your students. In middle school F=MA, and s = ½ at^2. Simple. Straightforward. Throw a baseball straight up at X ft/s. How high does it go? When does it get back to your hand? In high school you might mention air resistance. In college you throw things farther and faster. The curvature of the earth matters. There's friction, and gravity becomes a function of position. Suddenly you're talking about satellites in orbit. Relativistic effects and the changing shape of the Earth and the atmosphere become important. All those effects are real and important, but you don't want to generalize too much or get lost in the weeds. Freshman college physics should include friction, probably doesn't need relativistic effects.

Finally, close it out. Go back to the high points and tie them in to why the audience cares. Don't repeat all the details or the related anecdotes. Don't rabbit hole. Stick to your hook/vision and how it can help your audience in whatever it is you're talking about. Make sure they know why everything you just say is important to them.

So there you have it. The three tells and how using them can help you get your point across and your ideas implemented. All by telling them what you're going to tell them, telling them, and finally, telling them what you told them.

See what I did there?

by Leon Rosenshein

How Much Config Is Enough?

Pop quiz. You're building an application. What parameters do you put in your config file? The answer, like any other architectural question is, "It depends." On one extreme you spend a lot of time with your customers and bake everything into the application. Changing anything, from the name on the splash screen to the on-disk fully qualified path to the splash screen image is compiled right into the binary. On the other end, the only thing the application knows how to do is read a configuration file that tells it where to find the dynamic libraries and custom DSL files full of business logic that actually do anything.

Like any other tradeoff, the right answer lives somewhere between those two extremes, and is tightly coupled to your customer(s). As a rule of thumb, the bigger the spread of customers, whether it's scale (whatever that means), domain, expertise, commonality of use cases, etc, the more configurable you need to be. A simple single user personal checkbook application doesn't need much configuration. Some user info, some financial institute info, some categorization, and maybe a couple of other things. A US tax preparing application, on the other hand needs all that, plus a way to handle the different states and rules, including ones that change or get clarified after you ship.

So how do you approach the configuration challenge? First, think about your customer and what they need to do. What do they need to customize, and how often does it change? Can you come up with a sensible default for your majority case? Is there a convention you can follow for that default? Maybe the right answer is wizard that runs the first time (and maybe on demand) that walks the user through the setup. One nice thing about a wizard is that you can validate that the configuration chosen makes sense.

Another thing to think about is if the defaults are in the config file or the application. If it's in the file it's obvious what the knobs are at least, but then you get folks poking at them just to see what they do. And what if your customers don't make backups? They think they know what they want and change the default. How do they get it back? Maybe a better idea is a base configuration file they can't change with an override file. Safer, and probably easier to recover, but each level of override makes it that much harder to know what the actual configuration is.

But is a local configuration file the right place to store that information? In most cases, probably, but what about the enterprise? How do you centralize configuration? How do you keep things in sync? Central databases, Flagr, Puppet/Chef and pushed configuration? Semi-random user triggered pulls (`update-uber-home.sh` anyone)? All options, and unless you really understand the user problem you're trying to solve, you can't make the tradeoffs, let alone come up with the best answer.

When it comes time to figure out how to configure your configuration, it depends.

by Leon Rosenshein

Your Bug, Your Customer's Feature

It's not a bug, it's a feature. We've all heard it. We've probably all said it. But think about it. What exactly is a feature. It's the application doing something the customer wants, when the customer wants it done. It doesn't matter if that's what you expected. It doesn't matter if that's what the spec called for or the PM really wanted. It's fulfilling a customer need. And to the customer, that's a feature.

Back when I was working on Falcon 4.0 we kept a lot of internal display information in a couple of buffers. To make runtime debugging easier we turned those buffers into a shared memory segment and would run external tools to display it. It was so useful we created some additional structures and added other internal state. Really handy and let us do things like write to a memory mapped bitmap display (a Hercules graphics card) while DirectX had the main monitor.

The "bug" was leaving it on when we sent out beta versions to testers. Our testers were a creative bunch. Many of them had homemade cockpits. And multiple displays. One of them even had physical gauges that they used with someone else's product. And they found our shared memory. Then they decoded part of it. They used it to drive their own displays and gauges. Then, they thanked us for building such a useful feature and asked if the documentation of the memory layout was ready yet. And suddenly an internal debugging aid that slipped out to our beta testers became a feature. In this case it wasn't a big deal and didn't add a lot of support burden, but it could have.

When I worked on MS FlightSim backward compatibility was a major feature of every release, and not just the "supported" features. As a "platform" we choose to support all add-ons developed with the last 2 or 3 versions, and most of the rest, regardless of which bug/corner case, or side effect the add-on used. It made for great PR, but it was a lot of work.

So be careful what you allow to slip out. It will come back to haunt you.

NB: This is the kind of thing your users can do with your "bugs"

https://github.com/lightningviper/lightningstools/blob/master/src/F4SharedMem/FlightData.cs (Lines 177-229 are definitely from my original shared mem structure. I think the next 30 or so are. After that it's aftermarket stuff)

by Leon Rosenshein

Disk Errors

I've been working on datacenters and parallel computing for over 10 years now. Back in 2007 we had about 100PB of spinning disk and had our own home-grown distributed file system. Roughly equivalent to HDFS, but of course Windows (NTFS) based since we were part of Microsoft. We had filesets (roughly equivalent to an HDFS file) composed of individual NTFS files (blocks in HDFS terms), and instead of a namenode/journalnode we had a monstrous MS-SQL server that kept track of everything. And, like HDFS we had redundancy (nominally 3 copies) and that usually worked. But sometimes it didn't. And we couldn't figure out why.

Luckily, as part of Microsoft we had access to the folks that actually wrote NTFS and the drivers. So we talked to them. We explained what we were doing and why. Number of hosts. Number of disk drives. Number of files and filesets, Average file size. What our data rates were (read and write). We talked about our pipelines and validation. We listened to best practices. We learned that no-one else was pushing Windows Server that hard with that many disks per node. And all through the discussion there was one guy in the back of the room scribbling away furiously.

After about 30 minutes of this, he picks his head up and says. You have at least 2 problems. First, at those data rates at least 7 times a day you're going to write a zero or one to the disk, and it's not going to stick. Second, in your quest for speed in validation you're validating the disk's on-board write through cache, not the actual data on disk.

We'd run into one of the wonders of statistics. All disk writes are probabilistic, but we were pushing so much data that we were pretty much guaranteed to have multiple write errors a day. And then we compounded the problem by doing a bad job of validating. So we fixed the validation problem (not much we could do about physics) and things got a lot better. After that, over 10 years and scaling up to 330 PB of data, we only lost one file that wasn't caused by people deleting them by mistake. Human error is a story for another day.

But that didn't solve all of our problems. We solved the data loss problem, but not the availability problem. That was a whole different problem. That one was caused because we were too smart for ourselves. For safety we distributed our files randomly across the cluster. And many of our filesets contained hundreds of files, so we made sure we didn't put them together. That way when we lost a node we didn't take out all of the replicas. Sounds good so far. But that caused its own problems.

We had about 2000 nodes. How do you do a rolling reboot of 2000 nodes when you can only take down 2 at a time? You do it 2 at a time 1000 times. At 10 minutes a reboot that's 10000 minutes, 166+ hours if everything goes smoothly. 20+ days if you do it during working hours. That sucks. Also, disks popped like popcorn. Power supplies failed. Motherboards died. On any given day we'd have 10-20 nodes down. You wouldn't think that was a problem, but the way we did things, filesets were the unit of availability, not files, so if one file was broken then that instance of the fileset was down. And 3 different files on three different machines could, and often did, make all 3 versions of a fileset unavailable.

So how do you solve those problems? With intelligent placement. Since we had 3 copies of every file we didn't spread things out randomly, we made triplets. 3 nodes that looked exactly like each other. You could reboot ⅓ of the nodes at once. Do the whole lab in a morning. When a node went down we fixed it, and the other two carried the load. If we ever had 2 nodes in a triplet down ops knew and jumped all over it. And that's how we solved the availability problem.

So what's the lesson here? We ran into issues no-one had ever seen. Issues at all sorts of levels. From the firmware level on the harddrive to the macro level of how do you manage fleets of that size. Could we have foreseen them? Possibly, but we didn't. We had to recognize them as they occurred and deal with them. Because scale is a problem all its own. The law of large numbers turns the possible or occasional problem into a daily event that you just have to deal with. It's no longer an exception. It's just another state in the graph.