Recent Posts (page 50 / 70)

May 20, 2020 by Leon Rosenshein

Everything Can't Be P1

comments priority

A few weeks ago I talked about priority inflation in bugs and tasks. It really happens, but it also kind of doesn't matter. You could have P0 - P3, P1 - P4, or P[RED|YELLOW|GREEN|BLUE|BLACK|ORANGE] or any other way of distinguishing priorities. That's because the priorities themselves don't mean anything. They're just a ranking. They have no intrinsic value/meaning.

And once you get past the fact that all priority is relative, and in fact only relative to the other things it was prioritized with, you realize that with a finite set of priorities, your prioritized list of bugs/tasks becomes far less useful. Consider this situation. You've got a deadline tomorrow and you've got time to work on one more thing. You check the bug list and find there are 3 Pri 1 bugs, 10 Pri 2, and about 50 Pri 3. You've got time to fix one of them. What do you do?

You can quickly eliminate the Pri 2 and Pri 3 items because you know the Pri 1 items are more important. But which one? You've certainly got your opinion, and you've probably heard other peoples, but you really don't know. So you pick your favorite one, or the easiest one, or the one that you think is more important. Meanwhile, the one that's really most important (as defined by the product owner/PM/whomever is in charge) languishes. And this can go on for days as new issues are found, added to the list, and prioritized.

So how do you know what's really the most important and make sure that it gets worked on? What about the case where a series of small tasks, each not very important on its own but is more important/provides more value as a group than other single tasks? That's where an ordered list of tasks comes in.

The most important thing, the next thing that should be worked, is on the top of the list. Do that first. If you need roll-up tasks to show the value put them on the list, not the sub-tasks. And keep the list ordered. Frequency of that depends on lots of things. Where you are in the cycle, how fast the team is working through the list, and how fast things are being added to the list at least.

Your list doesn't need to be perfect and as long as the next day or two's items are at the top, the rest doesn't matter all that much. Under the principle of "Don't make any decisions before you need to", make those decisions later. When you have more information to inform the decision. So you can spend less time making and remaking decisions and more time acting on them.

There's another dimension orthogonal to priority, severity. Severity helps inform your decision with information about what happens if you don't address the issue, but that's a topic for a different day.

May 19, 2020 by Leon Rosenshein

Swallowing The Fish

Back when I joined Microsoft as the dev lead for the Combat Flight Sim team Bill Gates was CEO and the mission statement was pretty simple, "A computer on every desk and in every home", with the unspoken ending, running Microsoft software. By the time I joined Uber, Satya Nadella was CEO and the mission statement had gone through a few iterations and was "To empower every person and every organization on the planet to achieve more." Since then the company has changed even more. And that means that Microsoft was able to do something very few companies of its size and market position have been able to do.

When I started pretty much everything was in service to Windows or Office. SQL server and Developer Division were there, but mostly to get folks to buy Windows and develop software for it to, in today's terms, drive the flywheel. I started out in the games group, and we were there to give people a reason to buy Windows based computers. So much so that FlightSim was canceled because its CM was only in the millions.

And that's the Innovator's Dilemma. How does a company handle a small competitor when it can make more money focusing on high end customers and ignoring those small threats? That was the position Microsoft found itself in about 10 years ago. Windows/Office was making money hand over fist and enterprise sales for DevDiv and SQL were up. And that's where the investments were being made.

Meanwhile, online, subscription based delivery and cheaper, "good enough" options were showing up. Sure, Word and Excel could do more than the Google equivalents, but for many (most?) folks they were enough. Companies have run into these problems before. The history of tech companies is full of them. From hardware (DEC, Sun, Prime) compilers (Borland, Watcom, Metroworks), databases (Ashton Tate, Paradox), operating systems (Solaris, CP/M), and browsers (Netscape) to business software (Wordstar, WordPerfect, Visicalc, Lotus 123, Lotus Notes) there were lots of dominant players that disappeared. And in many of those cases it was Microsoft that replaced them.

So what did Microsoft do? They recognized that they needed to take a longer view. That there needed to be a short term hit to the bottom line to shift how things were done and what the goals were. They invested in new products and moving to newer business models. Or in other words, they paid down their technical debt

And that's an important takeaway. Don't get so tied into your current way of mode/idea that you don't look at new ideas.

May 18, 2020 by Leon Rosenshein

Pictures Or It Didn't Happen

One of the oddities of software is that the vast majority of the work done can't be seen. It's just a series of 0's and 1's stored somewhere. You can spend man years working on paying down technical debt or setting yourself up for the future.

When I was working on the Global Ortho project we had something we called the MBT spreadsheet. The Master Block Tracking list. It started life as a PM's way of tracking the status of those 1 degree cells through the system, from planning to acquisition, ingest, initial processing, color balancing, 3D model generation, then delivery and going live to the public. There were only 5 - 10 blocks in progress at any given time, and it was all manual.

Then we scaled up. There were about 1500 blocks in total to process and we usually had more than 100 in the pipe at any time. it got to be too much for one person to handle. So our production staff did what they usually do. A bunch of people did a bunch of manual work outside the system and just stuffed the results in. And no one knew how it worked or trusted it. But the powers that be liked the idea, and they really liked the overview that let them know the overall state and let them drill down for details when they wanted.

So that gave us our boundaries. A bunch of databases and tables with truth, and a display format we needed to match. We just needed to connect the two. The first thing we did was talk to the production team to figure out what they did. We got lots of words, but even the people who wrote down what they did weren't sure if that was right. So we went to the trusty whiteboard. We followed the data and diagrammed everything. We figured out how things were actually flowing. That was one set of architectural diagrams. But they were only for reference.

Then we took the requirements and diagrammed how we wanted the data to flow, including error cases and loops. And it was really high level. More of a logic diagram. All of the "truth" was in a magic database. All of the rules were in another magic box. The state machine was a box. We used those few boxes with a few others (input, display, processing, etc) we figured out how to handle all of the known use cases and some others we came up with along the way.

The next step was to map those magic ideal boxes onto things we had or could build. Database tables and accessor, processing pipelines, state machines, manual tools, etc. That led to some changes in the higher level pictures because there was coupling we didn't have time to separate, so we adjusted along the way. But we still couldn't do anything because this was a distributed system and we needed to figure out how the parts scaled and worked together.

And that led to a whole new round of diagrams that described how things scaled (vertically or horizontally) and how we would manage manual processes inside a mostly automated system. How we would handle errors, inconsistencies, and monitoring. How we would be able to be sure we could trust the results.

Once we had all those diagrams (and some even more detailed ones), we did it. We ended up with the same spreadsheet, but with another 10 tabs, connected to a bunch of tables across the different databases, with lots of VBA code downloading data then massaging it and holding it all together. Then we went back to the powers that be and showed them the same thing they were used to looking at and told them they could trust it.

Understandably they were a little surprised. We had about 3 man-months into the project and everything looked the same. They wanted to know what we had been doing all that time and why this was any different. Luckily, we expected that response and were prepared. We had all those architectural diagrams explaining what we did and why. And by starting at the high level and adding detail as needed for the audience we were able to convince them to trust the new MBT.

May 15, 2020 by Leon Rosenshein

Optimization

Premature optimization may or not be the root of all evil, but optimizing the wrong thing certainly doesn't help. When we talk about optimization we're usually talking about minimizing time, and the example says that if you have two sub-processes and one takes 9.9 seconds and the other takes .1 seconds then you should spend your time on the first one, because even if you manage to eliminate the second no one will notice. And that's an important thing to remember.

But optimization goes way beyond that. What if that process was part of a system that was only used by 1% of your user base, or the system had been retired? What if the question isn't what function should I make faster, but which bug should I fix or feature should I implement next? It goes back to one of my favorite questions. "What are you really trying to do?"

Because in almost all cases, what you are really trying to optimize out where to spend your time to provide maximum user value and when. Before you can do that you have to define those 4 terms.

And that's where scope comes in. You could be yourself, your team, your organization, ATG, or Uber Technologies. User has the same scope or larger. Are you working on something for internal consumption only, or does it accrue to an externally facing feature? Or is it something even larger like the public safety or the environment?

What does value mean? There's dollar value, but that might not be the right way to measure it. It could be time, or ease of use (customer delight), or more intangible, like brand recognition. And when will that value be realized? Will it be immediate, or will it be much further down the road?

Those are a lot of questions that you need to answer before you can optimize, and sometimes we don't have all the answers. The answers also impact each other. We have a finite amount of resources and you can't do everything.

So what do we do? We optimize what we can see and understand with what we can do and make the optimal choices based on what we know. Then we observe the situation again and make some more optimizations. Then we do it again. And again. Kind of like a robot navigating down the road in an uncertain environment :)

May 14, 2020 by Leon Rosenshein

Cost Of Control

career tuckman cognitive load

The other day I ran across a presentation which appeared to have a very different take on incident management. And it talked about some of my hot topics, hidden complexity, cognitive load and their costs, so I was looking forward to seeing the recommendations and seeing how I might be able to apply. But the presentation kind of fizzled out, ending up with some observations of how things were, and identifying some issues/bottlenecks, but not taking the next step.

Thinking about it, what I realized was that what the presentation described was a C3 (command, control, and communication) system, and the differences between examples that worked well and ones that didn't wasn't the process/procedure, it was the way the teams worked together. Not the individuals on the team, not the incident commander (IC) or the person in the corner who came up with the answer alone, but how those individuals saw each other and worked together.

The example of a situation where an incident was well handled was the Apollo 13 emergency, and the counterexample was a typical service outage and zoom. And that took me back to something I learned about early in my Microsoft career. The 4 stages of team development. Forming, Storming, Norming, and Performing. There's no question that how the IC does things will impact the team performance, but generally the IC isn't the one who does the technical work to solve the problem.

And that's how Gene Kranz brought the Apollo 13 astronauts home safely. Not by solving the problem or rigidly controlling the process and information flow, but by making sure the team had what it needed, sharing information as available, providing high level guidance, and above all, setting the tone. And he was able to do that because he had a high performing team. People who had worked together before, knew what others were responsible for, and capable of doing, and that had earned each other's trust.

Compare that with a typical service outage. You get paged, open an incident, and then you try to figure out which upstream provider had an issue. You get that oncall, then someone adds the routing team and neteng gets involved. Pretty soon you've got a large team of experts who know everything they might need to know about their system, but nothing about the other people/systems involved. It's a brand new team. And when a team is put together they go through the forming/storming phases. The team needs to figure out their roles, so you end up with duplicate work and work dropped between the cracks. You get both rigid adherence to process and people quietly trying random things to see what sticks. We've got good people and basic trust, so we get through that quickly and solve the problem. Then we tear the team down, and the next time there's an incident the cycle starts again.

So what can we do about it? One thing that Core Business is doing is the Ring0 idea, A group of people that have worked together to form the core of incident response, so the forming/storming is minimized. Another thing you can do is get to know the teams responsible for tools/functionality you rely on. They don't need to be your best friends, but having the context of each other really helps. Be aware of the stages of team development. Whether you're the IC or just a member of the team, understanding how others are feeling and how things look to them helps you communicate. Because when you get down to it, it's not command and control that will solve the problem, but communication.

Finally, if you're interested in talking about the 4 stages of teams let me know. There's a lot more to talk about there.

May 13, 2020 by Leon Rosenshein

The Art Of Computer Programming

learning book review

A little history lesson today. Have you ever used a compiler? Written an algorithm? Used the search function inside a document? Then chances are you've used work by Donald Knuth. For the last 60-odd years he's been learning, teaching, doing, and writing about computer science and it's fundamentals.

By far his greatest contribution to the field is his magnum opus, The Art of Computer Programming. It was originally going to be one book, but it's up to 4 volumes now and growing. It covers everything from fundamentals search/sort, numerics, combinatorics, data structures, and more.

May 12, 2020 by Leon Rosenshein

Scale Out, Not Up

If you've spent much time around folks who work on large scale data centers you've probably heard them talking about pets and cattle and wondered why that was. Yes, data centers are sometimes called server farms, but that's not why. Or at least it's not that simple.

It all goes back to the idea of scale and how that has changed over the years. 50 years ago computers were big. Room sized big. And they took a lot of specialized knowledge just to keep running, let alone doing anything useful. So every computer had a team of folks dedicated to keeping it happy and healthy. Then computers got smaller and more powerful and more numerous. So the team of specialists grew. But that can only scale so far. Eventually we ended up with more computers than maintainers, but they were still expensive, and they were all critical.

Every computer had a name. Main characters from books or movies, or the 7 dwarfs. Really big data centers used things like city or star names. It was nice. Use a name and everyone knew what you were talking about, and there just weren't that many of them. We had a render farm for Flight Sim that built the art assets for the game every night. They had simple names with a number, renderer-xx. But we only had 10 of them, so they all had to be working. Even the FlightSim team had a limited number of machines.

Eventually machines got cheap enough, and the workloads got large enough, that we ended up with 1000s of nodes. And the fact is that you can't manage 5000 nodes the way you manage 5, or even 50. Instead, you figure out which ones are really important and which ones aren't. You treat the important ones carefully, and the rest are treated as replaceable. Or, to use the analogy, you have pets and cattle. Pets are special. They have names. You watch them closely and when they're unwell you nurse them back to health. Cattle, on the other hand, have a number. If it gets sick or lost or confused, who cares. There's 5000 more just like it and it can be replaced.

That concept, coupled with some good automation, is what allowed a small team to manage the 5000 node WBU2 data center. Out of those 5000 nodes we had 28 pets, 28 semi-pets, and 4900+ cattle. 8 of those 28 pets were the namenodes I talked about in our HDFS cluster. We really cared about them and made sure they were happy. The other 20 we watched carefully, and made sure they were working, but re-imaging them was an easy fall-back if they got sick. And we really didn't care about the cattle. At the first sign of trouble we'd reimage. If that didn't work we'd take offline and have someone fix the bad hardware. And our users never knew. Which was the whole point of pets and cattle.

May 11, 2020 by Leon Rosenshein

IP Tables To The Rescue

HDFS is great for storing large amounts of data reliably, And we had the largest HDFS cluster at Uber and and one of the largest in the world in WBU2. 100 PB, 1000 Nodes, 25,000 Disks. HDFS manages all those resources, and it does it in a reasonably performant manner. It does it by keeping track of every directory, file, and file block. You can think of it like the master file table on a single disk drive. It knows the name of the file, where it is in the directory hierarchy, and it's metadata (owner, perms, timing, blocks that make it up, their location in the cluster, etc). The datanodes, on the other hand, just keep track of the state and correctness of each block they have. For perf reasons the datanodes keep track of what info they've sent to the namenodes and what the namenode has acknowledged seeing. From there it only sends deltas. Much less information, so less traffic on the wire, fewer changes to process. Goodness all around. And to keep the datanodes from all hitting the namenode at once, the updates are splayed out over time.

The problem we had was that our users were creating lots of little files. And lots of very deep directory structures with those tiny files sprinkled within them. So the number of Directories/Files/Blocks grew. Quickly. To well over a billion

At that point startup became a problem. We had a 1000 node thundering herd sending full updates to a single namenode. 1000 nodes splayed out over 20 minutes gives you just over 1 second per node. With 100 TB of disk per machine, transmitting and processing that data took well over a second. So the system backed up. And connections timed out, And the namenode forgot about the datanode, or the datanodes forgot about the namenode and the cycle started all over again. So we had to do something.

We could have turned everything off and then turned the datanodes back on slowly, but hardware really doesn't like that. Turning the nodes in a DC off/on is a really good way to break things, so we couldn't do that. And we didn't want to build/maintain a multi-host distributed state machine. We wanted a solution that ran in one place where we could watch and control it.

So what runs on one machine and lets you control what kind of network traffic gets in? IPTables of course. Now IPTables is a powerful and dangerous tool. And this was right around the time someone misconfigured IPTables at Core Business and took an entire data center offline, requiring re-imaging to get them back, so we wanted to be careful. Luckily we knew which hosts would be contacting us and what port they would be contacting, so we could write very specific rules.

So that's what we did. Before we brought up the namenode we applied a bunch of IPTable rules. Then we watch the metrics and slowly let the datanodes talk. That got things working, but it still took 6+ hours and lots of attention. So we took it to the next level. Instead of doing things manually, a bash script that applied the initial rules, then monitored the live metrics until the namenode was feeling ok, then released another set of nodes. And that got us down to ~2.5 hours of automation. We still worried about it since we were either down or had no redundancy at that time, but it was reliable and predictable.

And that's how we used IPTables to calm the thundering herd.

May 8, 2020 by Leon Rosenshein

The 3 Tells

I'm not talking about how to tell when the person across the poker table from you is bluffing. I'm talking about how you can effectively get your point across in a limited amount of time. You've got good ideas. Knowing how to get your point across, whether it's formal with a slide deck or a quick hallway conversation, makes it much more likely that those good ideas will become realities. And it starts with the three tells:

Tell them what you're going to tell them and why they care
Tell them
Tell them what you told them and why it's important

That doesn't mean start every presentation with an agenda slide and end with a conclusion slide, both of which you read verbatim, although that's not a bad place to start. It means make sure you know and share the destination up front. And not just the destination, but why your audience wants to get there. Call it a hook or a vision, or just "What's in it for them". Make sure they understand why they should care about what you're saying.

Then you tell them. Tell them what they need to know. You know the subject. It's important to you. Let that enthusiasm show. But, don't get carried away. Make sure you stay at the right level. Stay on message. Knowing what not to say is as important as knowing what to say. That doesn't mean glossing over problems, or that you don't need to know the details. Think of common questions and have the answer available, but wait for someone to let you know they're needed before you provide them.

It's like teaching physics. You use the appropriate level for your students. In middle school F=MA, and s = ½ at^2. Simple. Straightforward. Throw a baseball straight up at X ft/s. How high does it go? When does it get back to your hand? In high school you might mention air resistance. In college you throw things farther and faster. The curvature of the earth matters. There's friction, and gravity becomes a function of position. Suddenly you're talking about satellites in orbit. Relativistic effects and the changing shape of the Earth and the atmosphere become important. All those effects are real and important, but you don't want to generalize too much or get lost in the weeds. Freshman college physics should include friction, probably doesn't need relativistic effects.

Finally, close it out. Go back to the high points and tie them in to why the audience cares. Don't repeat all the details or the related anecdotes. Don't rabbit hole. Stick to your hook/vision and how it can help your audience in whatever it is you're talking about. Make sure they know why everything you just say is important to them.

So there you have it. The three tells and how using them can help you get your point across and your ideas implemented. All by telling them what you're going to tell them, telling them, and finally, telling them what you told them.

See what I did there?

May 7, 2020 by Leon Rosenshein

How Much Config Is Enough?

it depends

Pop quiz. You're building an application. What parameters do you put in your config file? The answer, like any other architectural question is, "It depends." On one extreme you spend a lot of time with your customers and bake everything into the application. Changing anything, from the name on the splash screen to the on-disk fully qualified path to the splash screen image is compiled right into the binary. On the other end, the only thing the application knows how to do is read a configuration file that tells it where to find the dynamic libraries and custom DSL files full of business logic that actually do anything.

Like any other tradeoff, the right answer lives somewhere between those two extremes, and is tightly coupled to your customer(s). As a rule of thumb, the bigger the spread of customers, whether it's scale (whatever that means), domain, expertise, commonality of use cases, etc, the more configurable you need to be. A simple single user personal checkbook application doesn't need much configuration. Some user info, some financial institute info, some categorization, and maybe a couple of other things. A US tax preparing application, on the other hand needs all that, plus a way to handle the different states and rules, including ones that change or get clarified after you ship.

So how do you approach the configuration challenge? First, think about your customer and what they need to do. What do they need to customize, and how often does it change? Can you come up with a sensible default for your majority case? Is there a convention you can follow for that default? Maybe the right answer is wizard that runs the first time (and maybe on demand) that walks the user through the setup. One nice thing about a wizard is that you can validate that the configuration chosen makes sense.

Another thing to think about is if the defaults are in the config file or the application. If it's in the file it's obvious what the knobs are at least, but then you get folks poking at them just to see what they do. And what if your customers don't make backups? They think they know what they want and change the default. How do they get it back? Maybe a better idea is a base configuration file they can't change with an override file. Safer, and probably easier to recover, but each level of override makes it that much harder to know what the actual configuration is.

But is a local configuration file the right place to store that information? In most cases, probably, but what about the enterprise? How do you centralize configuration? How do you keep things in sync? Central databases, Flagr, Puppet/Chef and pushed configuration? Semi-random user triggered pulls (`update-uber-home.sh` anyone)? All options, and unless you really understand the user problem you're trying to solve, you can't make the tradeoffs, let alone come up with the best answer.

When it comes time to figure out how to configure your configuration, it depends.

Older Newer