Recent Posts (page 35 / 70)

by Leon Rosenshein

Precision vs. Accuracy

Everyone wants to be accurate, and everyone wants to be precise. Those are both great goals. It’s a wonderful thing when you can be precise and accurate. On the other hand it becomes a problem when you trade one for another or even worse, mistake one for the other.

What’s the difference between precision and accuracy? The way I think about it, precision is a measure of the “size” of a quanta. One hour has 1/60th precision of a minute, and a year is 1/525600 the precision. Precision is a measurement of the measurement you’re making and has nothing to do with the thing being measured. If you measure in whole years there was a 1 year time period (525600 minutes) when I was 21 years old. If you were to measure in whole hours there was a 1 hour time period (60 minutes) when my age was 184086 hours. The measurement of 184086 hours old is much more precise than 21 years old. Measure it in minutes or seconds and it’s still more precise.

Accuracy, on the other hand is a measurement of how close the measurement is to truth, however you want to define truth. Going back to the age example, if I were to tell you I was 54 years old I would be 100% accurate. However if I told you I was 473364 hours old I would be almost 2% off. Both 54 years and 473364 hours represent the same timespan, but the accuracy of the two is different.

Of course the two are intimately related. Consider the birthday example. My birth certificate is considered truth, and it has a time of birth with a precision of 1 minute. But what’s the accuracy? We don’t really know. How much time passed between being born and looking at the clock? Probably not much, but some. And how precise/accurate was the clock? When was it last set? To what standard? It was almost certainly an analog clock, so the angle of view can change the reading as well. In my case it doesn’t make much difference, but consider the person with a birth time of 2359 local. That’s one minute of precision, but an accuracy slip of 2 minutes has the person born on the wrong day. And if it was December 31st it could be the wrong year as well.

Ballistics is another area where the difference between the two is apparent. Back in my earlier days we talked about Circular Error Probable (CEP) a lot. For a given initial condition how tight of a cluster would a bomb land in. How big would a circle need to be to include 90% (CEP-90) of the attempts? The smaller the circle, the better, and more precise, the system was. But that doesn’t say anything about accuracy. The bombsight could have been anywhere, but the CEP would be the same. Getting the bombsight and the center of the CEP to match is the accuracy. That was my job, and that sticking actuator gave me a lot of grief before I had enough data and didn’t have to worry about the size of the CEP, but that’s another story.

As engineers we know all this. We’ve been taught about it and have dealt with significant figures in math for years. But what does this have to do with development? It’s important when it comes to making estimates. Ideally you can be both accurate and precise, but that’s both hard and rare. In that case I say accuracy is more important. And even more important is to not confuse the two. Just because we estimate in hours or sprints, doesn’t mean we know to the nearest hour when something will be done. We need to be careful to not conflate the precision of the anwer with it’s accuracy. It’s an estimate. And it will get more accurate as time goes on and we know more and get closer to the end. But it rarely gets more precise. How to deal with that is a topic for another day.

https://imwrightshardcode.com/2013/07/to-be-precise/

https://en.wikipedia.org/wiki/Accuracy_and_precision

by Leon Rosenshein

Poor Man’s Polymorphism

It’s been said that an if statement is polymorphism at its most basic. It’s also been said that if-else is a code smell. Last year I talked about using data and data structures to eliminate if and switch from your code. Coding to interfaces can make your code cleaner and easier to follow. If reducing if is good then eliminating it is great, right?

Not always. A big part of development is building a mental model of what you want to happen. And there’s a limit to how big those models can be. Sure, any decision logic can be turned into a state machine and used to “hide” the details, but sometimes the simplest state machine is just a set of if statements.

The other day I wrote about cleaning up some complicated decision logic. I did go back and clean it up a little more. The code ended up looking like

_, err := os.Stat(path)
if err == nil {
  return true, nil
}

if os.IsNotExist(err) {
  return false, nil
}

return false, err

Which is about as simple as it can get. And it’s obvious. No hidden complexity. No extra layer of abstraction. No deeply nested if turning the code into a right arrow. Sure, we could have written a factory that builds an instance of an interface that does the right thing, but that’s just hiding the if behind one more level of abstraction. And no-one wants that.

So how do you know when to just if it and when to use the interface? It’s a combination of code size and frequency. Replacing 8 lines in one place with a factory/interface construct doesn’t make sense. On the other hand, building your own vtable to map shape types to their different functions (something the compiler can do for you) is just as wrong, and doing it in one giant dispatch function that only uses if is even worse.

So remember, it depends, and let the data guide you.

by Leon Rosenshein

Expediency

Sometimes the wrong solution is the right solution at a particular moment in time. Times when code that might be less than elegant is appropriate. Back in the day of boxed products and in-store shopping, getting the gold master disk to the manufacturer in time to get your production slot was one of them. Miss your slot and you might miss holiday sales. Today it’s outage mitigation. The longer an outage goes on the more your customers suffer.

Which is not to say that you can/should write bad code at those times. Code that is full of bugs and edge cases is just as much a problem when you’re in crisis as it is when you’re not. But a quick  hard-coded if <wierd, well defined special case> then return at the right time and place can get your system up and running and keep your customers satisfied. And you almost certainly need to add some work to your backlog to go back and develop the long term fix. But that doesn’t negate the importance of the quick fix.

Shipping/Mitigating is a feature. And it’s probably the most important one. Even if you get everything else perfect, if it never ships, have you really done anything? You’ve expended energy, but have you done any work?

So how do you know when to do the “right” design vs the “expedient” design? Which one adds more customer value today? How much will the expedient solution slow you down tomorrow? How long will you need to get back the time spent on the “right” solution? If the quick solution takes 5 minutes and doesn’t change your overall velocity but the elegant solution will take a couple of man-months then you probably shouldn’t bother being elegant.

A classic example of this is the Aurora scheduler for Mesos. Aurora is a stable, highly available, fault tolerant, distributed scheduler for long running services. It has all of the things you’d expect from such a scheduler. It handles healthchecks, rolling restarts/upgrades, cron jobs, log display and more. If something happens to the lead scheduler another one is automatically elected. If the scheduling system falls over when it comes back it looks at where it thinks it left off and reconciles itself with the actual running system. It also has an internal memory leak. There’s something in the code that the lead scheduler runs that will cause it to eventually lose its mind. I don’t know exactly what it is, and neither do the authors/maintainers. But they know it’s there and they have a solution. Every 24 hours the lead scheduler terminates itself and restarts. At that point the rest of the system chooses a new lead scheduler and everything continues happily. I don’t know how much time they spent looking for the problem, but the solution they came up with took a lot less time. It works in this case because the system is designed to be resilient and they had to get all of those features working anyway. They just found a way to take advantage of that work to handle another edge case. With that fix in place there’s no need to spend more time worrying about it. It’s an isolated, testable (and well tested) bit of code, so it doesn’t impact their overall velocity.

On the other hand, adding more and more band-aids and special cases to your event handler will eventually lead to emergent behavior and your velocity will slow. Back in the Opus days we had a thing called the Flying Spaghetti Monster, or FSM. It’s job was to manage the lifespan of your data. Basically you would put a tag on a directory and then the system would delete it when it’s lifespan was over. It only had 3 modes. TimeSinceCreation, ChildTimeSinceCreation, and Quota. But the logic of how you could nest them and the time periods on the first two could lead to some odd corners, so whenever we had to touch that code the first thing we’d do was add more tests for the new case. Because it just kept getting more complex. After about a year I realized that it would take me less time to redo the whole set of logic than it would to modify the house of cards it turned into. Luckily we had all the tests to make sure the logic didn’t change.

So when do you do the right thing vs. the expedient thing? It depends on which gives more customer value over time, remembering that real value now is much more valuable than potential value later.

t.co/1xpWptiyBI?amp=1

by Leon Rosenshein

Help Me

I’ve talked before about how to ask for help. How you ask is important for lots of reasons. It lets the other person know what problem you’re trying to solve. It lets them know what didn’t work so they don’t suggest that. It shows that you’ve put in some effort and you’re not just looking for someone to do the work for you.

But just as important as how to ask is when to ask. And the answer, like always, is “It depends”. The easy answer is that you should ask as soon as it becomes apparent that it will cost more to figure it out yourself than it would to get help. Simple, right?

Not really. Because you can’t know either side of the equation. You might have an epiphany in the next 10 seconds, or it might take 3 more days of digging. And you can’t know what the cost is. You might be able to figure out the direct cost, approximated by the hourly rate of the person helping you and the time they spend helping you, but what about the indirect cost? They were doing something important, so that got delayed. They were possibly interrupted, so that cost additional time. All sorts of costs to consider.

On the other hand, think of the benefits. A 2 sentence answer at the right time can save you days of exploring and investigating. Getting done days sooner adds customer value itself, and that time lets you add even more value. So how do you balance it?

One way is to look back. If you’ve spent a day making no progress what’s the likelihood that you will have a breakthrough in the next hour? You know yourself best, but that likelihood is probably pretty low. So ask.

Another way is to look forward. If the destination is across territory that is uncharted (to you) and there are people who’ve already explored it, that might be a good time to ask for help. Or at least ask for a map. And if there isn’t one, make it for the next person.

Finally, be honest with yourself. Don’t let pride or fear keep you from asking. In many cases they are flip sides of the same coin. Everyone has things they don’t know. Don’t be afraid to ask for help. That’s one of the ways we learn. It doesn’t mean you failed, aren’t capable, or don’t care. It’s about doing the most valuable thing for the team and the customer.

And there’s one thing you can be sure of. There was a time when the person helping you didn’t know the answer either.

docs.google.com/document/d/1QzFcWc6TAbI1APOMv-9e8VYZtm08DoLCJyEQ2WUDLKE/edit#heading=h.eglxkhp0bwee
docs.google.com/document/d/1jPl0AzkW8G6QThwuThzz7bZRAkxfpFLTUXzuNq7mzaw/edit#heading=h.orc3ioi5c1dy
imwrightshardcode.com/2019/05/when-to-ask-for-help

by Leon Rosenshein

The Scouting Rule

Try and leave this world a little better than you found it.

        – Lord Baden-Powell

The scouts have as one of their tenets to not just clean up the campsite after themselves, but to make it a little cleaner/better. While we might not be scouts now, that idea applies to us too. It’s also a great way to manage technical debt.

Development velocity is important. The speed at which we can add features/capabilitiesdrives how quickly we can add value. The trick is managing both instantaneous long-term velocity. It’s easy to always make the choice that maximizes instantaneous velocity, but over time you often end up going slower overall.

One way to combat that is to clean things up while you’re working on things. The other day I was tracking down a bug in the infra tool that caused a crash if there was no pre-existing config file. The fix itself was simple. Just ensure the needed directory structure was in place before trying to create the config file. But while I was debugging I ran into this little bit of code

func (fileSystem) Exists(path string) (bool, error) {
  _, err := os.Stat(path)
  if err != nil && !os.IsNotExist(err) {
    return false, err
  return err == nil, nil
}

Now that is correct, but it took me 10 minutes to really be confident that it was. So I changed it to be

// If Stat works the file exists. If Stat errors AND the error is an IsNotExist error then the file doesn't
// exist. For any other error we say the file doesn't exist and return the error
func (fileSystem) Exists(path string) (bool, error) {
  _, err := os.Stat(path)
  if err != nil {
    if os.IsNotExist(err) {
      return false, nil
    }
    return false, err
  }
  return true, nil
}

That’s much easier to read, and the next person who has to go in there will thank me. It’s probably going to be me, and I’ll still be thankful. BTW, there’s an even better way to rewrite it, but that will wait until I have a free moment or find myself in there again.

So, next time you’re in the code and you see something not quite right or that should be refactored you should just do it right? Well, … it depends. It depends on how big a refactor it is, if it’s going to make doing what you went into the code to do easier, and what the next known things you’ll need to do are. If it is directly related and will make you more efficient then you probably should. If it will be helpful next week then maybe. If you think you’ll need it in 6 months then you should probably document the issue and not fix it now.

And as an aside, when you’re doing refactoring changes, do your best to keep them separate from the change they’re enabling. It’s much easier to test/review a pure refactor PR that isn’t supposed to have any runtime impact and a change-inducing PR separately.

https://docs.google.com/document/d/1QzFcWc6TAbI1APOMv-9e8VYZtm08DoLCJyEQ2WUDLKE/edit#heading=h.xcbfdi4m6jyn
https://www.stepsize.com/blog/how-to-be-an-effective-boy-girl-scout-engineer

by Leon Rosenshein

A Short Pause

Just wanted to let everyone know that I'm going to take a short pause here. I'm working with the Aurora comms team to figure out the best way to move forward with this and share with an even broader audience.

I wanted to say thank you to everyone. Working on these has helped me clarify my thoughts and think deeper about my opinions and viewpoints. So thank you to all those who commented or reacted in Slack or reached out directly.

I've still got plenty to say, but I'm always looking for new topics as well, so if there's something you have questions about feel free to add them as a comment or send them to me at mailto:leonr@aurora.tech.

Meanwhile, remember to ask "What is it you're really trying to do?", and that the answer to any question that starts with "How should I …" is probably It depends

by Leon Rosenshein

Current State

State is important. And state is hard. It's one of the two hard things in computer science. Even something as simple as checking if a file exists or the value of a variable as a test to do/not do something. Between the time you check and the time you act the state could change. Unless you do some kind of mutual exclusion to prevent it. a mutex if you will. Having multiple cores and multiple CPUs in a system running multiple simultaneous processes make it even harder, but with global locking and integrated test and set routines in the silicon it's doable. Painful and hard on performance, doable.

Move to an even more distributed system, say a bunch of computers connected over a mostly reliable network, and it gets even harder. And for the times that you distributed synchronized state there are things like Paxos, Raft, and Zookeeper. But that just provides you with a distributed/available source of truth. Using it is still hard, and can have scaling issues.

One important thing that can help here is knowing how "correct" and "current" your view of the state needs to be. Sure, you can ask the system for everything each time you need to know, but as the size of the state grows and the number of things that need the state increases simple data transfer will become the choke point.

The traditional way around that is a cache. That's great for data with good persistence. And it matches the usual mental model of pulling data when you need it. In the right situation it works really well. That's all a CDN is really, and the internet lives on them.

Another way is to turn the system around and switch from pull to push. Instead of having the user ask for state when needed, it maintains its own view and receives deltas from the source. If your rate of data change is small (relative to the size of the state) AND the changes don't happen at predictable intervals (time-to-live is random) then having the source of truth tell you about them might make sense. It's more complicated, requires changes on both ends of the system, makes startup different from runtime, and adds the requirement of reconciliation (making sure the local view does in fact match reality), but it can greatly reduce the amount of data flying around your system and make the data more timely.

We've run into this a few times with our kubernetes clusters. As a batch cluster the size (both in nodes and tasks) is pretty dynamic. Some things run per-node and try to keep track of what's going on locally (I'm looking at you fluentd) and in steady state it's not too bad, but in rapid scale-ups all of those nodes, asking for lots of data at the same time, can put the system under water. It's partly the thundering herd, which we manage with splaying, but it's also the give me everything every time problem. That we're solving with watchers. So far it looks promising. There will be some other scale issue we'll hit in the future, but that got us past this one.

martinfowler.com/articles/patterns-of-distributed-systems/state-watch.html
martinfowler.com/articles/patterns-of-distributed-systems/leader-follower.html
en.wikipedia.org/wiki/Thundering_herd_problem

by Leon Rosenshein

Moving Day, Part 2

Today is the first day in the new house. Of course many things are the same. This channel hasn't changed (yet). Still in the same office away from the office. Still using the same hardware. Still working on the same projects. Goals are still mostly the same.

But, lots of things have changed. And again, in the spirit of a physical move, some things to keep in mind as we settle into our new home.

Know who to go to when you have questions. Know both the new and old IT and HR links. Know the slack channel(s) for realtime tech questions.

First,make sure you're in the right place. Get out your map and read over the directions before you get started. Always a good idea to understand the bigger picture before you dive into the details.

When you've gotten the all-clear and you're ready, unpack your stuff. For some (Windows) it's trivial. For some (linux) it's easy, but make sure you don't miss a step. For others (MacOS) it's a complete reset. It's not trivial, and it's not instant. Don't expect it to be fast and you won't find yourself disappointed.

If you're on MacOS you'll have the opportunity to fix those little things that were annoying you. While you reinstall everything and reconfigure things take the opportunity to make adjustments you never quite got to. If nothing else you'll get rid of a lot of cruft and be on the latest version of things :)

Once you have things mostly settled, get the list of TODO's you put together and get started. Don't be surprised if you realize there is more setup and adjustment needed. That's a normal part of getting back to work after a move. It's going to take a while to get moving again, and that's OK.

Finally, remember to stop occasionally and take a deep breath. There's a lot going on, and change is stressful. Instead of denying it, acknowledge it and work with it. It will make it easier for everyone. Don't expect to get too much done today, or for the next couple of days. Work back up to it and you'll see things beginning to work smoothly again.

by Leon Rosenshein

Moving Day, Part 1

In case you haven’t been paying attention, today is our last day in our old house. Which means it’s part 1 of moving day, the packing. Of course this is a virtual move, since there's been an acquisition and we're working for a new company, not really changing jobs. Also, we’re all pretty much working for home already, and the commute isn’t going to change much.

But lots of things are going to change. And today is the last chance to get ready. So, in the spirit of a physical move, here’s some things to keep in mind as you prepare:

Know who to go to when you have questions. Know both the new and old IT and HR links. Know the slack channel(s) for realtime tech questions.

Pack your stuff. Yes, yes, yes, it’s all digital, so there’s no real packing, but there are changes coming. Certificates will be changing. Macs will be wiped. Identities will change. You don’t want to lose anything along the way, so back it up. Google Drive is a good choice for work artifacts. and config files. So is Git. Make sure you push any branches and stashes.

Close out work in progress. The more things you can close out, the fewer things you’ll have to keep track of/pick back up after. Get your PRs out and queued. If you have PRs to review, get it done. Don’t start something that you know will get interrupted. And don’t forget to write down what you were in the middle of and what should be done next. That’s a good idea every day, and especially helpful now.

Make sure you have a map of where you’re going. Sure, you’re not actually moving physically, but there are enough things changing that having a map makes sense. Or at least a printed page of the instructions you’ll need to follow. Stored in a place you KNOW you’ll have access to in the middle of the move. I know someone who didn’t have their driver’s license with them for a move because they left it in the bathroom when the packers came. They were good packers and asked him if he had his wallet with him, and he said yes, but he didn’t. That night when he got to the temporary housing he realized he had left his wallet in the bathroom and it was now in a box on a truck heading for Seattle. Luckily his wife had credit cards and they all had passports with them, but it could have been a problem.

Finally, remember to stop occasionally and take a deep breath. There’s a lot going on, and change is stressful. Instead of denying it, acknowledge it and work with it. It will make it easier for everyone.

See you on the other side.

by Leon Rosenshein

Captain Crunch

This morning I ran across an article on crunch time in the game development industry. As many of you know, I’ve been doing the development thing for a while now. And much of the early years was gaming. Mostly flight simulations, but also other simulations and general gaming. And I can tell you from first hand experience and countless discussions with peers at other companies that crunch time is real. And yes, it’s not just the last 18 months of a 12 month project. There’s always pressure. It just reaches ridiculous levels as the end-game approaches.

There are lots of reasons for it. The biggest two though, don’t apply any more. Back in the day games were sold in boxes, in retail stores. And what you sold was what people played. For years. No downloadable content. No user generated content. No patches or bug fixes. Just what was in the box. So you had to get enough in the box to make people happy. So the pressure to “make it work” at the end was high.

Another sign of the times was that about 50% of sales happened in the 4 weeks between Thanksgiving and Christmas. And most of those sales were of the hot new games. Which meant that you box had to be on the shelves by Black Friday. Getting on the shelves before Christmas was so important that when we shipped Falcon 4 in England we put a person on a plane from San Francisco to NY to London (on the Concorde) to pick up a few hours of manufacturing time. If you missed the season your game was sunk. By next year it was dated and no one cared. So slipping the schedule by 3 weeks wasn’t an option from a business perspective.

Things are different now. Patches are expected. DLC is expected. Back then it was 4 or 5 developers and 1 artist. Now it’s 50+ artists and world builders and 10 developers using a game engine. Multiplayer has gone from turn based play by mail to split screen to MMORPG. Development and ad budgets have gone up by orders of magnitude.

But apparently some things haven’t changed. Crunch time is still around. Talk of unionization is still around. Burnout is still around. And lots of fresh new faces who want to get into gaming show up every year. Which, in my opinion, is the reason nothing changes. As long as there’s a steady stream of developers showing up to make the games there’s not a lot of incentive for things to change. And that makes me sad.

Because we know better. The constraints have changed. Boxes on shelves are no longer the driving force. Dates are still important, but much less so. And we know a lot more about the development process. We have better ways to get from idea to product that take into account the knowns, the known unknowns, and the unknown unknowns.

I’m not in the game industry any more. In large part because I wanted to get away from crunch time. Things are much better now. Work/Life balance is better. And I think the work is better too. Sure, for any given week I can get more things done by crunching, but over the longer development period a slower, steadier pace gives better results.


Bringing this back to the article I was talking about, I believe it to be an accurate depiction of the current state of affairs. Because I’m pretty sure I was living it before the author was born. I think there’s a better way. And I’ve been talking about it here for almost 2 years.