Recent Posts (page 45 / 65)

by Leon Rosenshein

Solving Problem Solving

I've mentioned this before, and it's still true. They don't pay us to write code. They pay us to further the mission, and the ATG mission statement is _Our mission is to bring safe, reliable self-driving transportation to everyone, everywhere_. Nothing in there about writing software, or building AI models or more secure hardware. Nope. We write code (or do all the other things we're doing) in support of the mission. To solve a problem. It's a very big problem, and it's going to require software and models and hardware and all those other things to solve. So how do we get better at solving problems?

Problem solving is a skill, and like any other skill, to get better you improve your technique and practice. So what does an improved technique look like? It starts with understanding the problem. Don't just write the function, but look at the underlying problem you're trying to solve. The 5 whys are a good place to start for that. Then think about edge cases. What are they? What are the failure modes? What's the happy path? Understand your boundaries and constraints, both technical and business-wise.

Once you understand what you're really trying to do, do some research. See how others have solved similar problems. Think about analogies. What problems are like yours, but in a different field. Decompose the problem into smaller parts and research them. Look for inspirations and ideas there. Think about how to put those smaller parts together into a coherent whole.

Next, make a decision. Look at the pros and cons for the different options. Think about long and short term solutions. Think about hidden costs and future benefits. Then, just do it.

What about practice? What does practice mean in this case, and when can you do it? Start small. There are multiple times a week when we need to solve a problem. Even if you think you know the solution, go through the steps. Be mindful about it and notice how you're going through the steps. Each time you go through it, it gets easier and more natural. That's practice.

Remember, practice is about the process, not the result. The goal here is not to change your mind, but to get used to a new way of solving problems and making that method second nature. As a band teacher once said, don't practice until you get it right, practice until you don't get it wrong.

by Leon Rosenshein

Velocity Is Internal

Story points and burndown charts. Two staples of the agile planning process. They're what let you have confidence that what you're planning for any given sprint is close to what will be delivered. And they're pretty good at that. They let you compare projections and actuals over time for your team. But that's pretty much all they're good for.

One of the things they're very bad at is comparing across teams. You can make all the claims you want about common definitions for story points and capacity, but the reality is that every team goes through the 4 stages of development (remember forming/storming/norming/performing?) and gets to a point where the team consistently agrees on what a story point means to them. And once you have a consistent value for a story point and historical data on how many of those story points gets done in a sprint you can use it to predict, with some accuracy, how much work to put in a sprint and expect it to get done. But, that only applies to that team. Other teams have different definitions of story points and different capacities, so comparing velocity across teams is like comparing the efficiency of a regional bus and a taxi by the number of trips they make a day. With enough context you might be able to compare them, but just looking at the two numbers doesn't tell you much. If you want to compare teams using some of that information then maybe look at comparing the accuracy of their predictions. If the ratio of story points predicted to accomplished gets close to 1 and stays there then the team is making reliable predictions.

Another thing velocity is bad at is predicting the project completion date. And that's for similar reasons. The task list, let alone the story points, estimated early in the cycle just aren't that accurate. So until you get close to the end and really know how many points are left, as defined by the team that is going to be working on it, you can't just divide by velocity and trust the answer.

Bottom line, historical velocity is a short term leading predictor, not a long term performance measurement. So don't treat it as such.

by Leon Rosenshein

Everything Can't Be P1

A few weeks ago I talked about priority inflation in bugs and tasks. It really happens, but it also kind of doesn't matter. You could have P0 - P3, P1 - P4, or P[RED|YELLOW|GREEN|BLUE|BLACK|ORANGE] or any other way of distinguishing priorities. That's because the priorities themselves don't mean anything. They're just a ranking. They have no intrinsic value/meaning.

And once you get past the fact that all priority is relative, and in fact only relative to the other things it was prioritized with, you realize that with a finite set of priorities, your prioritized list of bugs/tasks becomes far less useful. Consider this situation. You've got a deadline tomorrow and you've got time to work on one more thing. You check the bug list and find there are 3 Pri 1 bugs, 10 Pri 2, and about 50 Pri 3. You've got time to fix one of them. What do you do? 

You can quickly eliminate the Pri 2 and Pri 3 items because you know the Pri 1 items are more important. But which one? You've certainly got your opinion, and you've probably heard other peoples, but you really don't know. So you pick your favorite one, or the easiest one, or the one that you think is more important. Meanwhile, the one that's really most important (as defined by the product owner/PM/whomever is in charge) languishes. And this can go on for days as new issues are found, added to the list, and prioritized.

So how do you know what's really the most important and make sure that it gets worked on? What about the case where a series of small tasks, each not very important on its own but is more important/provides more value as a group than other single tasks? That's where an ordered list of tasks comes in.

The most important thing, the next thing that should be worked, is on the top of the list. Do that first. If you need roll-up tasks to show the value put them on the list, not the sub-tasks. And keep the list ordered. Frequency of that depends on lots of things. Where you are in the cycle, how fast the team is working through the list, and how fast things are being added to the list at least.

Your list doesn't need to be perfect and as long as the next day or two's items are at the top, the rest doesn't matter all that much. Under the principle of "Don't make any decisions before you need to", make those decisions later. When you have more information to inform the decision. So you can spend less time making and remaking decisions and more time acting on them.

There's another dimension orthogonal to priority, severity. Severity helps inform your decision with information about what happens if you don't address the issue, but that's a topic for a different day.

by Leon Rosenshein

Swallowing The Fish

Back when I joined Microsoft as the dev lead for the Combat Flight Sim team Bill Gates was CEO and the mission statement was pretty simple, "A computer on every desk and in every home", with the unspoken ending, running Microsoft software. By the time I joined Uber, Satya Nadella was CEO and the mission statement had gone through a few iterations and was "To empower every person and every organization on the planet to achieve more." Since then the company has changed even more. And that means that Microsoft was able to do something very few companies of its size and market position have been able to do.

When I started pretty much everything was in service to Windows or Office. SQL server and Developer Division were there, but mostly to get folks to buy Windows and develop software for it to, in today's terms, drive the flywheel. I started out in the games group, and we were there to give people a reason to buy Windows based computers. So much so that FlightSim was canceled because its CM was only in the millions.

And that's the Innovator's Dilemma. How does a company handle a small competitor when it can make more money focusing on high end customers and ignoring those small threats? That was the position Microsoft found itself in about 10 years ago. Windows/Office was making money hand over fist and enterprise sales for DevDiv and SQL were up. And that's where the investments were being made.

Meanwhile, online, subscription based delivery and cheaper, "good enough" options were showing up. Sure, Word and Excel could do more than the Google equivalents, but for many (most?) folks they were enough. Companies have run into these problems before. The history of tech companies is full of them. From hardware (DEC, Sun, Prime) compilers (Borland, Watcom, Metroworks), databases (Ashton Tate, Paradox), operating systems (Solaris, CP/M), and browsers (Netscape) to business software (Wordstar, WordPerfect, Visicalc, Lotus 123, Lotus Notes) there were lots of dominant players that disappeared. And in many of those cases it was Microsoft that replaced them.

So what did Microsoft do? They recognized that they needed to take a longer view. That there needed to be a short term hit to the bottom line to shift how things were done and what the goals were. They invested in new products and moving to newer business models. Or in other words, they paid down their technical debt

And that's an important takeaway. Don't get so tied into your current way of mode/idea that you don't look at new ideas.

by Leon Rosenshein

Pictures Or It Didn't Happen

One of the oddities of software is that the vast majority of the work done can't be seen. It's just a series of 0's and 1's stored somewhere. You can spend man years working on paying down technical debt or setting yourself up for the future.

When I was working on the Global Ortho project we had something we called the MBT spreadsheet. The Master Block Tracking list. It started life as a PM's way of tracking the status of those 1 degree cells through the system, from planning to acquisition, ingest, initial processing, color balancing, 3D model generation, then delivery and going live to the public. There were only 5 - 10 blocks in progress at any given time, and it was all manual.

Then we scaled up. There were about 1500 blocks in total to process and we usually had more than 100 in the pipe at any time. it got to be too much for one person to handle. So our production staff did what they usually do. A bunch of people did a bunch of manual work outside the system and just stuffed the results in. And no one knew how it worked or trusted it. But the powers that be liked the idea, and they really liked the overview that let them know the overall state and let them drill down for details when they wanted.

So that gave us our boundaries. A bunch of databases and tables with truth, and a display format we needed to match. We just needed to connect the two. The first thing we did was talk to the production team to figure out what they did. We got lots of words, but even the people who wrote down what they did weren't sure if that was right. So we went to the trusty whiteboard. We followed the data and diagrammed everything. We figured out how things were actually flowing. That was one set of architectural diagrams. But they were only for reference.

Then we took the requirements and diagrammed how we wanted the data to flow, including error cases and loops. And it was really high level. More of a logic diagram. All of the "truth" was in a magic database. All of the rules were in another magic box. The state machine was a box. We used those few boxes with a few others (input, display, processing, etc) we figured out how to handle all of the known use cases and some others we came up with along the way.

The next step was to map those magic ideal boxes onto things we had or could build. Database tables and accessor, processing pipelines, state machines, manual tools, etc. That led to some changes in the higher level pictures because there was coupling we didn't have time to separate, so we adjusted along the way. But we still couldn't do anything because this was a distributed system and we needed to figure out how the parts scaled and worked together.

And that led to a whole new round of diagrams that described how things scaled (vertically or horizontally) and how we would manage manual processes inside a mostly automated system. How we would handle errors, inconsistencies, and monitoring. How we would be able to be sure we could trust the results.

Once we had all those diagrams (and some even more detailed ones), we did it. We ended up with the same spreadsheet, but with another 10 tabs, connected to a bunch of tables across the different databases, with lots of VBA code downloading data then massaging it and holding it all together. Then we went back to the powers that be and showed them the same thing they were used to looking at and told them they could trust it.

Understandably they were a little surprised. We had about 3 man-months into the project and everything looked the same. They wanted to know what we had been doing all that time and why this was any different. Luckily, we expected that response and were prepared. We had all those architectural diagrams explaining what we did and why. And by starting at the high level and adding detail as needed for the audience we were able to convince them to trust the new MBT.


by Leon Rosenshein

Optimization

Premature optimization may or not be the root of all evil, but optimizing the wrong thing certainly doesn't help. When we talk about optimization we're usually talking about minimizing time, and the example says that if you have two sub-processes and one takes 9.9 seconds and the other takes .1 seconds then you should spend your time on the first one, because even if you manage to eliminate the second no one will notice. And that's an important thing to remember.

But optimization goes way beyond that. What if that process was part of a system that was only used by 1% of your user base, or the system had been retired? What if the question isn't what function should I make faster, but which bug should I fix or feature should I implement next? It goes back to one of my favorite questions. "What are you really trying to do?"

Because in almost all cases, what you are really trying to optimize out where to spend your time to provide maximum user value and when. Before you can do that you have to define those 4 terms.

And that's where scope comes in. You could be yourself, your team, your organization, ATG, or Uber Technologies. User has the same scope or larger. Are you working on something for internal consumption only, or does it accrue to an externally facing feature? Or is it something even larger like the public safety or the environment?

What does value mean? There's dollar value, but that might not be the right way to measure it. It could be time, or ease of use (customer delight), or more intangible, like brand recognition. And when will that value be realized? Will it be immediate, or will it be much further down the road?

Those are a lot of questions that you need to answer before you can optimize, and sometimes we don't have all the answers. The answers also impact each other. We have a finite amount of resources and you can't do everything.

So what do we do? We optimize what we can see and understand with what we can do and make the optimal choices based on what we know. Then we observe the situation again and make some more optimizations. Then we do it again. And again. Kind of like a robot navigating down the road in an uncertain environment :)

by Leon Rosenshein

Cost Of Control

The other day I ran across a presentation which appeared to have a very different take on incident management. And it talked about some of my hot topics, hidden complexity, cognitive load and their costs, so I was looking forward to seeing the recommendations and seeing how I might be able to apply. But the presentation kind of fizzled out, ending up with some observations of how things were, and identifying some issues/bottlenecks, but not taking the next step.

Thinking about it, what I realized was that what the presentation described was a C3 (command, control, and communication) system, and the differences between examples that worked well and ones that didn't wasn't the process/procedure, it was the way the teams worked together. Not the individuals on the team, not the incident commander (IC) or the person in the corner who came up with the answer alone, but how those individuals saw each other and worked together.

The example of a situation where an incident was well handled was the Apollo 13 emergency, and the counterexample was a typical service outage and zoom. And that took me back to something I learned about early in my Microsoft career. The 4 stages of team development. Forming, Storming, Norming, and Performing. There's no question that how the IC does things will impact the team performance, but generally the IC isn't the one who does the technical work to solve the problem.

And that's how Gene Kranz brought the Apollo 13 astronauts home safely. Not by solving the problem or rigidly controlling the process and information flow, but by making sure the team had what it needed, sharing information as available, providing high level guidance, and above all, setting the tone. And he was able to do that because he had a high performing team. People who had worked together before, knew what others were responsible for, and capable of doing, and that had earned each other's trust.

Compare that with a typical service outage. You get paged, open an incident, and then you try to figure out which upstream provider had an issue. You get that oncall, then someone adds the routing team and neteng gets involved. Pretty soon you've got a large team of experts who know everything they might need to know about their system, but nothing about the other people/systems involved. It's a brand new team. And when a team is put together they go through the forming/storming phases. The team needs to figure out their roles, so you end up with duplicate work and work dropped between the cracks. You get both rigid adherence to process and people quietly trying random things to see what sticks. We've got good people and basic trust, so we get through that quickly and solve the problem. Then we tear the team down, and the next time there's an incident the cycle starts again.

So what can we do about it? One thing that Core Business is doing is the Ring0 idea, A group of people that have worked together to form the core of incident response, so the forming/storming is minimized. Another thing you can do is get to know the teams responsible for tools/functionality you rely on. They don't need to be your best friends, but having the context of each other really helps. Be aware of the stages of team development. Whether you're the IC or just a member of the team, understanding how others are feeling and how things look to them helps you communicate. Because when you get down to it, it's not command and control that will solve the problem, but communication.

Finally, if you're interested in talking about the 4 stages of teams let me know. There's a lot more to talk about there.

by Leon Rosenshein

The Art Of Computer Programming

A little history lesson today. Have you ever used a compiler? Written an algorithm? Used the search function inside a document? Then chances are you've used work by Donald Knuth. For the last 60-odd years he's been learning, teaching, doing, and writing about computer science and it's fundamentals.

By far his greatest contribution to the field is his magnum opus, The Art of Computer Programming. It was originally going to be one book, but it's up to 4 volumes now and growing. It covers everything from fundamentals search/sort, numerics, combinatorics, data structures, and more.

by Leon Rosenshein

Scale Out, Not Up

If you've spent much time around folks who work on large scale data centers you've probably heard them talking about pets and cattle and wondered why that was. Yes, data centers are sometimes called server farms, but that's not why. Or at least it's not that simple.

It all goes back to the idea of scale and how that has changed over the years. 50 years ago computers were big. Room sized big. And they took a lot of specialized knowledge just to keep running, let alone doing anything useful. So every computer had a team of folks dedicated to keeping it happy and healthy. Then computers got smaller and more powerful and more numerous. So the team of specialists grew. But that can only scale so far. Eventually we ended up with more computers than maintainers, but they were still expensive, and they were all critical.

Every computer had a name. Main characters from books or movies, or the 7 dwarfs. Really big data centers used things like city or star names. It was nice. Use a name and everyone knew what you were talking about, and there just weren't that many of them. We had a render farm for Flight Sim that built the art assets for the game every night. They had simple names with a number, renderer-xx. But we only had 10 of them, so they all had to be working. Even the FlightSim team had a limited number of machines.

Eventually machines got cheap enough, and the workloads got large enough, that we ended up with 1000s of nodes. And the fact is that you can't manage 5000 nodes the way you manage 5, or even 50. Instead, you figure out which ones are really important and which ones aren't. You treat the important ones carefully, and the rest are treated as replaceable. Or, to use the analogy, you have pets and cattle. Pets are special. They have names. You watch them closely and when they're unwell you nurse them back to health. Cattle, on the other hand, have a number. If it gets sick or lost or confused, who cares. There's 5000 more just like it and it can be replaced.

That concept, coupled with some good automation, is what allowed a small team to manage the 5000 node WBU2 data center. Out of those 5000 nodes we had 28 pets, 28 semi-pets, and 4900+ cattle. 8 of those 28 pets were the namenodes I talked about in our HDFS cluster. We really cared about them and made sure they were happy. The other 20 we watched carefully, and made sure they were working, but re-imaging them was an easy fall-back if they got sick. And we really didn't care about the cattle. At the first sign of trouble we'd reimage. If that didn't work we'd take offline and have someone fix the bad hardware. And our users never knew. Which was the whole point of pets and cattle.

by Leon Rosenshein

IP Tables To The Rescue

HDFS is great for storing large amounts of data reliably, And we had the largest HDFS cluster at Uber and and one of the largest in the world in WBU2. 100 PB, 1000 Nodes, 25,000 Disks. HDFS manages all those resources, and it does it in a reasonably performant manner. It does it by keeping track of every directory, file, and file block. You can think of it like the master file table on a single disk drive. It knows the name of the file, where it is in the directory hierarchy, and it's metadata (owner, perms, timing, blocks that make it up, their location in the cluster, etc). The datanodes, on the other hand, just keep track of the state and correctness of each block they have. For perf reasons the datanodes keep track of what info they've sent to the namenodes and what the namenode has acknowledged seeing. From there it only sends deltas. Much less information, so less traffic on the wire, fewer changes to process. Goodness all around. And to keep the datanodes from all hitting the namenode at once, the updates are splayed out over time.

The problem we had was that our users were creating lots of little files. And lots of very deep directory structures with those tiny files sprinkled within them. So the number of Directories/Files/Blocks grew. Quickly. To well over a billion

At that point startup became a problem. We had a 1000 node thundering herd sending full updates to a single namenode. 1000 nodes splayed out over 20 minutes gives you just over 1 second per node. With 100 TB of disk per machine, transmitting and processing that data took well over a second. So the system backed up. And connections timed out, And the namenode forgot about the datanode, or the datanodes forgot about the namenode and the cycle started all over again. So we had to do something.

We could have turned everything off and then turned the datanodes back on slowly, but hardware really doesn't like that. Turning the nodes in a DC off/on is a really good way to break things, so we couldn't do that. And we didn't want to build/maintain a multi-host distributed state machine. We wanted a solution that ran in one place where we could watch and control it.

So what runs on one machine and lets you control what kind of network traffic gets in? IPTables of course. Now IPTables is a powerful and dangerous tool. And this was right around the time someone misconfigured IPTables at Core Business and took an entire data center offline, requiring re-imaging to get them back, so we wanted to be careful. Luckily we knew which hosts would be contacting us and what port they would be contacting, so we could write very specific rules.

So that's what we did. Before we brought up the namenode we applied a bunch of IPTable rules. Then we watch the metrics and slowly let the datanodes talk. That got things working, but it still took 6+ hours and lots of attention. So we took it to the next level. Instead of doing things manually, a bash script that applied the initial rules, then monitored the live metrics until the namenode was feeling ok, then released another set of nodes. And that got us down to ~2.5 hours of automation. We still worried about it since we were either down or had no redundancy at that time, but it was reliable and predictable.

And that's how we used IPTables to calm the thundering herd.