Recent Posts (page 49 / 70)

June 3, 2020 by Leon Rosenshein

Toward A Better PM

I'm not talking about Nyquil or Tylenol PM or [Program|Project|Product] Manager. I'm talking about Post-Mortems, or Post-Incident Reviews as they're starting to be known. Whatever you call them, they're critical to improving the system. And that's the goal. Make the system better. Make it more reliable. More self-healing. To do that you need to figure out two things. What happened, both leading up to and during the incident, and why those things happened. You don't need to figure out where to place blame, and trying to do that makes it less likely that you'll be able to find out the important things.

What and why leading up to the incident. How we got into this situation. The 5 whys. It's so important to know how we got there that it's baked right into the incident response template. But here's the thing. We call it the 5 whys, but it often ends up being the 5 whats.

The server was overloaded.
The server was overloaded because there were 3 instances instead of 5.
There were 3 instances instead of 5 because the autoscaler didn't scale the service up.
The autoscaler didn't scale up because there was no more capacity
There was no more capacity because growth wasn't in the plan.

That's what happened, not why. Why didn't we know the service was getting close to overloading? Why did we get down to 3 instances without knowing about it? Why didn't we know the autoscaler wasn't able to autoscale? Why didn't we know we were close to capacity? Why didn't we anticipate more growth? You need to answer those questions if you want to make sure to fix that class of problem. Each of those "whys" has a solution that would have prevented the problem, or at least let you know before it got that bad.

A good PM goes over what happened during the incident. And that's good. Understanding the steps is important if you want to make an SOP in case something like that happens again. But there's more to it than that. Unless your incident/solution is such a well worn path that you know exactly what to do and just do it (in which case, why isn't it automated?), chances are there was some uncertainty going at the beginning. Those 6 stages of debugging. So instead of just recording/reporting the steps that were taken to mitigate the problem, think about that uncertainty. Why did it happen? Was there an observability gap? A documentation gap? A knowledge gap? Maybe it was permissions, or too many conflicting signals. Ask questions about why there was uncertainty and what could be done to reduce the time spent clearing up that uncertainty.

We naturally gravitate to the what questions. They typically have very concrete answers and it's easy to point to those answers and the process and say that's why there was an incident. And while we're focusing on the what we tend to ignore the why. However, if you really want to make the system better, to prevent things from happening in the first place, you need to spend less time on what happened, and more time on why it happened.

June 2, 2020 by Leon Rosenshein

Logging

Ever run into an error dialog on a web page that said something like "An error has occurred. Hit ok to continue"? Not very helpful was it. What about "An error has occurred and has been logged. Please try again later." Still not really helpful, but at least you feel better because someone knows about it now.

Well this isn't about that error dialog (although some of it applies). This is about that log that got mentioned. We've got a decently sized team, and every 7 weeks I'm on call, so I get to deal with those log files. We (and our customers) use lots of 3rd party libraries and tools, so the log I get to wade through are of varying quality. Lots of information. Lots of status. Some indication of what's happening, and the occasional bit of info about something that's gone wrong. Spark and HDFS in particular are verbose, but not all that informative. There are lots of messages about making calls that succeeded and the number of things that are happening. The occasional message about a retry that succeeded. And then the log ends with "The process terminated with a non-zero exit code". Thus starts the search for the error. Somewhere a few pages (or more) up in the log is the call that failed.

So what can we do to make it better? First and foremost, log less. Or at least make it possible to log less by setting verbosity levels. In most cases logging the start, middle, and end of every loop doesn't help, so only do that if someone sets a flag. All that logging does is add noise and hide the signal.

Second, don't go it alone. Use existing structured logging where possible. Your primary consumer is going to be a human, so log messages need to be human readable, but the shear size of our logs means we need some automated help to winnow them down to a manageable size, so add some machine parsable structure.

Third, When you do detect an error log as much info as possible. What were you doing? What was the actual error/return code? What was the call stack? What were the inputs? The more specific you can be the better. The one thing you want to be careful about is logging PII/Financial info. If you're processing a credit card payment and it fails, logging the order id and the error code is good. Logging the user name, address, credit card number and security code is too much.

Finally, think about the person who's going to be looking at it cold months from now. Is the information actionable? Will the person looking at it know where to start? There's a high probability that the person dealing with it will be you, so instead of setting your future self up for a long search for context, provide it up front. You'll thank yourself later.

June 1, 2020 by Leon Rosenshein

Avoiding The Wat

In the beginning was the command line. And the command line is still there. Whether you're using sh, bash, csh, tch, zsh, fish or something else, you end up running the same commands. Today I want to talk about those commands. Or at least those commands and the principle of least surprise.

Let's start with a simple one. `-?` Should print out help for your command. I know someone can come up with a valid case where that doesn't make sense, but unless you've got a really good reason, go with that. Bonus points for also giving the same output for --help.

Another easy one. Have a way for the user to get the version of your command. If you have subcommands use version if not, use -v.

You should be able to type each flag by itself (in any order), or collect flags into a series of characters preceded by a `-`. If any of those flags need an option and don't get it then that's an error.

Having more than one instance of a flag that can't be repeated should be an error. If you don't like that, take the last value.

Simple, common options should have simple one-letter ways to specify them. They should also have more verbose options. And it's OK to only have more verbose options for some things. Always use -- for your verbose options.

Speaking of --help, if you have sub-commands then -? or --help should work at any level, giving more detailed info about the sub-command when known.

Have sensible defaults. That's going to be very context sensitive, but if you have defaults they need to make sense. An obvious example would be rm using the current directory as a base unless given a fully qualified path.

Use stderr and stdout appropriately. Put the pipeable output of your command into stdout, and send errors to stderr. And once you've released an output format think 3 or 4 times about changing it. Someone has parsed it and is using it as the input to something else.

Command completion is your user's friend, so support it. Especially if you have lots of sub-commands and options. Bonus points here for being able to compete options and providing alternatives when the user spells something wrong (like git does).

And above all, never make them say "WAT"

May 29, 2020 by Leon Rosenshein

6 Stages Of Debugging

That can't happen.
That doesn't happen on my machine.
That shouldn't happen.
Why is that happening?
Oh, I see.
How did that ever work?

Not quite the 5 stages of grief, but surprisingly similar.

First, Denial. It's impossible. The user did something wrong. It's not my code. The sunspots did it. Come back when you've got real proof that this is my bug and I there's something for me to fix. To get through it, trust that the bug report is identifying an issue. It might not be what the user thinks, but there is a problem. The user was surprised, which means you need to do a better job of expectation management.

Then on through Anger and Bargaining. My testing didn't show that problem. Who's fault is it? Just fix DNS. Get the user to be more careful. Get someone else to do a better job on scrubbing the inputs. Yes, the system you're in needs to be taken into consideration. The problem could be elsewhere so it makes sense to step back and get a slightly wider perspective. Is the problem related to other, similar issues? Is there something more systemic that could/should be done? Make sure you're putting the fix in the right place.

Depression. How did we get here? What did I do wrong? How could I have prevented this? It's something that needs to be fixed. So what can you learn from this? Keep track of where you missed the mark or made assumptions about things. Identify what could be different in the future.

Acceptance. it's your problem to fix, so make it better. Take everything you've learned about the issue to this point and apply it. Own the issue, and share the learnings. Help others not miss the mark the same way.

The last stage of debugging, "How did that ever work?", has gone beyond grief. Now you're looking at the entire system and wondering what other bugs/features are lurking about. At this point you're an explorer, learning new and exciting things about your system. When you get to that point the best thing you can do is tell us all what you learned.

May 28, 2020 by Leon Rosenshein

Just Keep Learning

career learning

The only constant is change. That's true for many things, including being a developer and a developer's career. Lots of things change. The computers we use. The tools that we use. The language(s) we write in. The business environment. The goals we work toward. And on top of all of those external changes, we change too. While there's an unbroken line of "me" all the way back to when I got that first part-time developer job in school to now, I'm not the same person I was then.

I started my career writing Fortran 4 on an Apollo DN3000 computer and C89 on an SGI 310 GTX on a 5 computer network using 10Base5 Ethernet. You won't find anyone using any of those outside of a museum these days. Today we're running a 2000 node cluster using Kubernetes, C++, Golang, and Python. Those are a lot of changes to the fundamentals of what I do for a living.

Back then I was a Mechanical and Aerospace Engineer that used computers to simulate aircraft and missile flight. Since then I've done air combat simulation for the USAF and for games, pilot AI, visualization, 2D and 3D modeling and map building, fleet management, project management, data center management, and large scale parallel processing.

That's a lot of changes to go through, and what I'm doing now doesn't look that much like what I went to school for. So how did I get from then to now? By continuing to learn. And not learning in one way, but in many ways. Some was formal education. I got a masters in Aerospace Engineering early on, and then about 15 years later I got an MBA, but that is only a small part of what I learned. There were also conferences, vendor classes, company sponsored and on-line classes, and probably the most important, my coworkers.

My coworkers and I learned from experience. Doing things that needed to be done. Fixing bugs, writing tools, adding functionality, and learning along the way. I learned from them and they learned from me. Languages, IDEs, operating systems, open source and in-house frameworks. We learned the basics together and helped each other out along the way. And I asked questions. Because pretty much no matter what I was doing, there was someone who knew more about it or had more experience or had solved similar problems. So I took advantage of that and learned. I still do that. When there are things that need to be done around deep learning or security I work with people who know and learn the details.

Another way to keep learning is to teach. Sharing what you've learned forces you to really think about what you've learned and what the essence is. You need to understand something deeply to explain it clearly and succinctly. Writing these little articles has helped me understand things better, but you can get the same benefits from a tech sharing session in your team where you explain the latest feature you've added.

Because the world is changing, and you need to keep up.

May 25, 2020 by Leon Rosenshein

In Defense Of Laziness

yagni

The parentage of invention is something of an open question. It's often been said that necessity is the mother of invention, and that seems reasonable, but who's the father? I've heard a bunch of different answers, including opportunity and ability, but my money's on laziness.

But it's a special kind of laziness. Not the kind that sweeps the dust under the rug instead of cleaning up or the kind that finishes a project and leaves the tools wherever they lay. I'm talking about the kind of laziness that knows the priorities. The kind that realizes that doing things in the right order minimizes rework. The kind that, when digging a ditch, makes it big enough to handle the expected flow and some more, but doesn't build one big enough to handle the Mississippi river.

And that goes for developers as well. Good developers have that special kind of laziness. They know that if your method/library/application doesn't do what it's supposed to it doesn't matter how clean the interface is. They know that when you're dealing with terabyte datasets a few kilobytes of local data isn't something to worry about.

YAGNI (You Ain't Gonna Need It) is laziness. It's not doing work before you need to. You might never need it, so be lazy.

DRY (Don't Repeat Yourself) is laziness. Whether it's using someone else's library/package/service/tool, or refactoring to get rid of duplication, or writing and maintaining less code. Be lazy and free up the time to do something more important.

DevOps is all about laziness. Automate all the things. Make them redundant. Make them resilient. Be lazy and build tools and systems to manage the day-to-day so you don't have to.

To be clear, this doesn't mean skip the important things. Don't skip the unit tests or the code review. Don't skip on requirements or talking to customers. Don't skimp on being thorough.

So be lazy, but only at the right times

May 22, 2020 by Leon Rosenshein

Solving Problem Solving

I've mentioned this before, and it's still true. They don't pay us to write code. They pay us to further the mission, and the ATG mission statement is _Our mission is to bring safe, reliable self-driving transportation to everyone, everywhere_. Nothing in there about writing software, or building AI models or more secure hardware. Nope. We write code (or do all the other things we're doing) in support of the mission. To solve a problem. It's a very big problem, and it's going to require software and models and hardware and all those other things to solve. So how do we get better at solving problems?

Problem solving is a skill, and like any other skill, to get better you improve your technique and practice. So what does an improved technique look like? It starts with understanding the problem. Don't just write the function, but look at the underlying problem you're trying to solve. The 5 whys are a good place to start for that. Then think about edge cases. What are they? What are the failure modes? What's the happy path? Understand your boundaries and constraints, both technical and business-wise.

Once you understand what you're really trying to do, do some research. See how others have solved similar problems. Think about analogies. What problems are like yours, but in a different field. Decompose the problem into smaller parts and research them. Look for inspirations and ideas there. Think about how to put those smaller parts together into a coherent whole.

Next, make a decision. Look at the pros and cons for the different options. Think about long and short term solutions. Think about hidden costs and future benefits. Then, just do it.

What about practice? What does practice mean in this case, and when can you do it? Start small. There are multiple times a week when we need to solve a problem. Even if you think you know the solution, go through the steps. Be mindful about it and notice how you're going through the steps. Each time you go through it, it gets easier and more natural. That's practice.

Remember, practice is about the process, not the result. The goal here is not to change your mind, but to get used to a new way of solving problems and making that method second nature. As a band teacher once said, don't practice until you get it right, practice until you don't get it wrong.

May 21, 2020 by Leon Rosenshein

Velocity Is Internal

tuckman

Story points and burndown charts. Two staples of the agile planning process. They're what let you have confidence that what you're planning for any given sprint is close to what will be delivered. And they're pretty good at that. They let you compare projections and actuals over time for your team. But that's pretty much all they're good for.

One of the things they're very bad at is comparing across teams. You can make all the claims you want about common definitions for story points and capacity, but the reality is that every team goes through the 4 stages of development (remember forming/storming/norming/performing?) and gets to a point where the team consistently agrees on what a story point means to them. And once you have a consistent value for a story point and historical data on how many of those story points gets done in a sprint you can use it to predict, with some accuracy, how much work to put in a sprint and expect it to get done. But, that only applies to that team. Other teams have different definitions of story points and different capacities, so comparing velocity across teams is like comparing the efficiency of a regional bus and a taxi by the number of trips they make a day. With enough context you might be able to compare them, but just looking at the two numbers doesn't tell you much. If you want to compare teams using some of that information then maybe look at comparing the accuracy of their predictions. If the ratio of story points predicted to accomplished gets close to 1 and stays there then the team is making reliable predictions.

Another thing velocity is bad at is predicting the project completion date. And that's for similar reasons. The task list, let alone the story points, estimated early in the cycle just aren't that accurate. So until you get close to the end and really know how many points are left, as defined by the team that is going to be working on it, you can't just divide by velocity and trust the answer.

Bottom line, historical velocity is a short term leading predictor, not a long term performance measurement. So don't treat it as such.

May 20, 2020 by Leon Rosenshein

Everything Can't Be P1

comments priority

A few weeks ago I talked about priority inflation in bugs and tasks. It really happens, but it also kind of doesn't matter. You could have P0 - P3, P1 - P4, or P[RED|YELLOW|GREEN|BLUE|BLACK|ORANGE] or any other way of distinguishing priorities. That's because the priorities themselves don't mean anything. They're just a ranking. They have no intrinsic value/meaning.

And once you get past the fact that all priority is relative, and in fact only relative to the other things it was prioritized with, you realize that with a finite set of priorities, your prioritized list of bugs/tasks becomes far less useful. Consider this situation. You've got a deadline tomorrow and you've got time to work on one more thing. You check the bug list and find there are 3 Pri 1 bugs, 10 Pri 2, and about 50 Pri 3. You've got time to fix one of them. What do you do?

You can quickly eliminate the Pri 2 and Pri 3 items because you know the Pri 1 items are more important. But which one? You've certainly got your opinion, and you've probably heard other peoples, but you really don't know. So you pick your favorite one, or the easiest one, or the one that you think is more important. Meanwhile, the one that's really most important (as defined by the product owner/PM/whomever is in charge) languishes. And this can go on for days as new issues are found, added to the list, and prioritized.

So how do you know what's really the most important and make sure that it gets worked on? What about the case where a series of small tasks, each not very important on its own but is more important/provides more value as a group than other single tasks? That's where an ordered list of tasks comes in.

The most important thing, the next thing that should be worked, is on the top of the list. Do that first. If you need roll-up tasks to show the value put them on the list, not the sub-tasks. And keep the list ordered. Frequency of that depends on lots of things. Where you are in the cycle, how fast the team is working through the list, and how fast things are being added to the list at least.

Your list doesn't need to be perfect and as long as the next day or two's items are at the top, the rest doesn't matter all that much. Under the principle of "Don't make any decisions before you need to", make those decisions later. When you have more information to inform the decision. So you can spend less time making and remaking decisions and more time acting on them.

There's another dimension orthogonal to priority, severity. Severity helps inform your decision with information about what happens if you don't address the issue, but that's a topic for a different day.

May 19, 2020 by Leon Rosenshein

Swallowing The Fish

Back when I joined Microsoft as the dev lead for the Combat Flight Sim team Bill Gates was CEO and the mission statement was pretty simple, "A computer on every desk and in every home", with the unspoken ending, running Microsoft software. By the time I joined Uber, Satya Nadella was CEO and the mission statement had gone through a few iterations and was "To empower every person and every organization on the planet to achieve more." Since then the company has changed even more. And that means that Microsoft was able to do something very few companies of its size and market position have been able to do.

When I started pretty much everything was in service to Windows or Office. SQL server and Developer Division were there, but mostly to get folks to buy Windows and develop software for it to, in today's terms, drive the flywheel. I started out in the games group, and we were there to give people a reason to buy Windows based computers. So much so that FlightSim was canceled because its CM was only in the millions.

And that's the Innovator's Dilemma. How does a company handle a small competitor when it can make more money focusing on high end customers and ignoring those small threats? That was the position Microsoft found itself in about 10 years ago. Windows/Office was making money hand over fist and enterprise sales for DevDiv and SQL were up. And that's where the investments were being made.

Meanwhile, online, subscription based delivery and cheaper, "good enough" options were showing up. Sure, Word and Excel could do more than the Google equivalents, but for many (most?) folks they were enough. Companies have run into these problems before. The history of tech companies is full of them. From hardware (DEC, Sun, Prime) compilers (Borland, Watcom, Metroworks), databases (Ashton Tate, Paradox), operating systems (Solaris, CP/M), and browsers (Netscape) to business software (Wordstar, WordPerfect, Visicalc, Lotus 123, Lotus Notes) there were lots of dominant players that disappeared. And in many of those cases it was Microsoft that replaced them.

So what did Microsoft do? They recognized that they needed to take a longer view. That there needed to be a short term hit to the bottom line to shift how things were done and what the goals were. They invested in new products and moving to newer business models. Or in other words, they paid down their technical debt

And that's an important takeaway. Don't get so tied into your current way of mode/idea that you don't look at new ideas.

Older Newer