Recent Posts (page 43 / 65)

by Leon Rosenshein

How Old Is Old

Not only is what's old new again, what was once new gets old. And it seems like it gets old faster and faster. When I got my first paying job writing software 30+ years ago I was using FORTRAN77, and it was a 10 year old standard that was in the process of being updated. When I left that job Fortran90 was the new hotness, and we were just starting the conversion process. We were just starting to play around with ANSI C, and C99 was just a gleam in someone's eye. There were some commercially available FORTRAN libraries for complex math and statistics that we used and got occasional updates, but big changes were rare.

Fast-forward a few years to Microsoft. Box products with a 2(ish) year update cycle, and updates were evolutionary. As I've mentioned, Flight Simulator was backward compatible to the beginning of time, as were Windows and Office. Yes, the UI would change and people would get upset, but there weren't a lot of fundamental changes, and no-one needed to throw out all the work they'd done and translate to a new system.

Things are a little different now. Frameworks for all sorts of things keep popping up. UI, Deep Learning, Parallelization, Datacenter Management (on prem and cloud). 5 years ago when I came to Uber Mesos/Aurora was the new hotness, and was going to replace a bespoke in-house service placement and management system. Here we are 5 years later and CoreBiz is hoping to finish that transition to Mesos/Peloton so they can switch to Kubernetes.

Not my usual area, but sometimes I get pulled into working on UI stuff. Angular, React, React-Native, Node, Deno. Which one should you use? If you're using one, should you switch to a different one? Maybe. Or maybe you should stick with whatever you're using.

Just because something is "old", whatever that means, doesn't mean you should stop using it. Working is a feature. A really important one. The thing is, it's easy to say "Time to change", but making the change is hard, takes time, and blocks other changes.

That doesn't mean you should never change, just that you need to count the cost before you do. And you need to measure not just the cost of making the change now, but also the cost of making the change later or never. And what the benefits of making the change now or later are. Because very often the short term cost pushes you one way, but the long term cost pushes you the other. And knowing which is more important can play a big part in your decision.

by Leon Rosenshein

Zip It

Tales from the trenches.

Geometry is hard. Let's say you needed to take pictures of every street in the US, or at least every address. Because street level imagery is fairly temporal you wanted to do areas quickly so you have "squads" of vehicles in an area. And for technical reasons you needed to keep the vehicles from operating in the same area. How would you break the country into workable areas? That was our challenge gathering street-level imagery for Bing Maps.

The answer we came up with was Zip Codes. Seemed like a reasonable idea. If a small number of mail-people are supposed to be able to reach all of the addresses in a ZipCode every day then a single vehicle should be able to cover the streets in a day or two. Every address is in one (and only one) ZipCode, so that will keep the vehicles apart, which is also good.

So we went to implement it. And automate it. There were a bunch of startup problems, like the fact that the Post Office doesn't provide ZipCode data, but we figured out ways around them. It mostly worked, but oh the edge cases.

It turns out that ZipCodes aren't areas. They're sets of points, or more correctly, every address is associated with a ZipCode. But that's not too bad. ArcGIS can take sets of points and turn them into non-overlapping polygons. Except… While every address is in one ZipCode, the set of polygons covering all the addresses are not contiguous. Unless you're ArcGIS, which makes them contiguous. By creating tiny tendrils that connect what would be islands. Except where it can't. So you have islands there too. So we did the usual solution. Automated everything, including validation that the automation worked. Then take the parts that failed validation and send them to Marcus (thanks @eisen) to fix. Because Marcus is a wiz at ArcGIS and can make sense out of anything.

And if you're wondering why we needed to keep the vehicles far apart, the problem was that the original firmware in the SICK LIDAR units we used (along with cameras, INS, and GPS systems) was written in a way that assumed two units would never be pointed at each other. Because if they were they would quickly burn out each other's sensors, rendering them useless. We eventually got SICK to fix that problem, but that's a whole different story

by Leon Rosenshein

The First Year

It's been just over a year since I started writing these notes. It started out as just sharing some links with my team, then with all of the infra org, and finally with all of ATG. Just over 230 notes. Some more humorous than others, but hopefully all interesting and informative.

First off, what a year it's been. When I started sending those links our team was the infra team for the Maps organization in "Prime". Since then we've moved to the Infra Org in ATG/Core Platforms, switched from Java/Scala and Mesos/Peloton to Go and Kubernetes, shut down a DC (WBU2), learned AWS, turned up a DC (DCA19), moved from rAV to rNA, are still dealing with a pandemic and have been WfH for almost ⅓ of the time (100 days). Even for Uber that's a busy year :)

Second, we don't know what the next year will hold, but we have some ideas. It will have at least returning to the office, doubling the size of DCA19, and rNA on the road at scale. Who knows what other challenges the world will throw at us.

Third, a request. While I've got opinions on just about everything and am not too shy to share them, if there's something you're interested in or curious about let me know. Send me a DM or add it to a thread. I read all of them and I'm happy to take requests.

Finally, I want to say thank you. I've learned a bunch along the way. Partly from needing to be sure I understood something well enough to explain it simply, and mostly from your comments. The links and experiences you shared have taught me new things, changed my understanding of some things, and overall helped me get better at my job.

Here's to another year of working and learning together.

by Leon Rosenshein

Inheriting Code

One of the things we've all had to do is learn how to work with an established system. Whether it's your first internship, switching teams, going to a new company, or just helping someone use something you've built, you have to orient yourself. So what do you do?

The first thing to do is ask questions. If you're lucky, there are folks around who can give you an overview, tell you where the problems are, and give you some history on why things are the way they are. If so, take advantage of that opportunity. 

Think of it like moving to a new city. You've got a specific destination and someone told you to check out a cool place while you're there, but that's it. The first thing you need is to get the utilities turned on, then you need a broad map to get you into the right areas. Then you start exploring and adding detail to the map. At first the details are only in a couple of places, maybe close to home and your job. The rest of the map has some general labels like industrial, residential, and downtown. Over time though, partly out of need, and partly out of curiosity, you start to fill in those blank spots. Eventually you end up with a pretty good internal map of the city, with a lot of detail for areas you spend most of your time in. And then, when you're the "expert" you share your map with friends and family.

Codebases are the same thing. Around here getting the utilities set up is called on-boarding. Getting your tools set up and getting access to what you'll need. Depending on how far you're coming from (new intern vs. another project on your current team) you'll need to do more or less on-boarding. You'll need a goal. A bug fix or a small feature. Maybe you need to add some validation to some input. You find (or build) a rough architecture diagram so you can figure out where the input is read. You look around the neighborhood and see how folks have done similar things in the past. You follow the data around in the debugger and start to understand some of the other blocks in that diagram.

When I joined the Combat Flight Sim team my first job was to figure out why the AI would sometimes just dive for the deck in the middle of a dogfight. Before I could fix it I needed to build my map of how things worked. There was already a tool that we could use to monitor the state of the simulation in real-time, but it was inherited from the main FlightSim team and focused on the flight models and graphics engine. So my first task turned into making better tools to understand the system. I started out simple by plumbing the AI's current target into the monitor, which taught me about how the AI stored its state, how data flowed through the system and how the display system worked. With that rough map I was able to dive deeper into the AI itself, figure out what state was important, and where it came from. Once I had that understanding I could look into the problem I was there to fix.

Which brings us to the last part of learning a new system, sharing what you learned. Very often when learning a new system you'll come up with new tools or enhancements to existing ones. Building those tools is a really good way to understand those systems, and sharing them helps you and the team along the way.

For those that care, it turned out that the AI was paying attention to energy management, like all good fighter pilots do. The problem was that instead of comparing its energy to the highest threat or some weighted average of multiple threats, it was comparing to the total energy of those threats. Of course, in that case it came up comparatively very low on energy and would trade altitude for airspeed in an attempt to even the score. I switched the comparison to a weighted average and solved the problem.

by Leon Rosenshein

How Severe Is Severe

Last month I talked about how a set of priorities is nice, but whenever the size of a set of priorities is less than the number of items being prioritized you have a problem and an ordered list of items is needed. Priority is just a way to order the operations. So how do you come up with the ordered list? That's where severity comes in.

Severity is a measure of the impact of a bug or work item. It lets you know how important an item is, and should be measured in terms of user impact. What does this item, or the user level item it rolls up into, make possible. For simplicity let's talk about severity in terms of bugs or defects. Using the `infra` bug template as a starting point, severity is how big the impact is. What can't the user do because of the bug?

  • Sev 1: Causes a system crash or data loss.
  • Sev 2: Causes major functionality loss or other severe problems; product crashes in obscure cases.
  • Sev 3: Causes minor functionality problems, may affect "fit and finish".
  • Sev 4: Contains typos, unclear wording or error messages in low visibility fields.


And that's really all there is to severity. It gets interesting because severity alone isn't enough to build an ordered list. If it were, then why would you need more fields? You need it because there's more to the prioritized list than just severity. The other two big things that go into prioritization are cost and number of users.

It's the combination of severity, cost, and impacted user count that lets you order the list. Severity and number of impacted users define the business impact of fixing and issue. And given a fixed amount of "points" to spend on the cost, you can maximize your *business impact*. That's how you decide if you should fix 2 or 3 big ticket items that a small percentage of your users will run into or fix all of those little niggling details that annoy every user multiple times a day. It is almost certainly more important to fix the bug in the method that occasionally overlaps street names on a map you have 5 or more roads merging than to get the "OK" and "CANCEL" buttons to line up and be centered, but what if you could fix 100 layout issues that everyone sees every day, eliminate a click or two from a workflow, and fix all of the spelling errors for the same cost? Is that more important than making sure the road labels on the Swindon Magic Roundabout are correct? Probably.

And the same logic applies to new features as well. Add one big feature for a handful of power users, or something 98% of your users want? It depends, and to make the decision you need the impact of a feature, how many users want the feature, and the cost.

And that's why severity matters.

by Leon Rosenshein

What's Old Is New Again

Did you know that using GPUs to make processing faster is old technology? Back in the late 80s/early 90's I was working with the flight simulators at Wright-Patterson AFB, They ran on Gould SELs, and right next to the main CPU were a couple of other cabinets, about the same size. They were the co-processors, the I(nteger)PU and the F(floating point)PU. And together they could do math orders of magnitude faster than the CPU. And just like with today's GPUs, feeding them data and getting the results back properly was key to getting the most performance from them.

Similarly, the QUIC standard is looking into header compression to improve network performance. 30 something years ago, using a 286 CPU reading data from a local hard drive, we found that to get maximum performance we needed to compress the data on disk and then decompress after reading. For text files the combination of lower disk IO and more processing gave us ~25% improvement in response time.

The other place you'll see something that appears to be new, but isn't, is when scale gets involved. What's the difference between Kubernetes and a thread manager? Or an operating system managing tasks across physical and hyper-threaded CPUS inside a single computer? Conceptually, not much. Sure the implementation is different, and we've learned a lot about how to make a good API and framework, but at its heart, it's just "run this over there", "Hey is stopped." Put it over there instead".

So the next time someone talks to you about the new hotness, think about its past, what's changed, and what we can learn from people solving that problem before.

by Leon Rosenshein

Backing Up

I ran across an article on how to not ding your rented semi-trailer when backing up. Don't ask why. That's not important to this conversation. What is important is how much of the article actually applies to development.

The first and most important rule for backing up safely is "Don't back up". For software development the corollary is some combination of *Y* ou *A* in't *G* onna *N* eed *I* t, use the tools you already have, not putting yourself in a situation where you need to do write the that piece of code by putting in guardrails earlier.

Practice - The more code you write the more situations you'll learn to understand and the more tools you'll have in your toolbox.

Watch others - See how other people solve similar problems and what issues (if any) those solutions caused. Learning from your mistakes is good. Learning from other's mistakes is even better.

Look at the situation - What are the requirements? What are the constraints? What are the problems you might encounter? What kind of edge cases? The best time to handle all of those is _before_ you write the first line of code, not after things fail.

Re-examine the situation - After you've been working for a while take time to look around again. Things may have changed, and you  certainly have a better understanding of what you need to do, so maybe you'll see something you missed the first time you looked.

Tell others what you're about to do - They might be about to do the same thing, or maybe the opposite for some reason you haven't thought of. They might have good ideas or gotcha's from earlier work.

Ask an expert - If you're about to start working on a Neural Net to train a model to identify geese crossing the road don't ask me for help. That's not my area, but we have lots of folks who _are_ experts in that area. Ask them. On the other hand, if you find that your training is running too slowly and you want to take advantage of distributed processing in our GPU cluster(s) then ask me. I know lots about distributed/parallel processing.

Go slowly / Start again - Take your time. Test as you go. Validate your intermediates. Git is your friend. Use a branch to keep track of the work you've done and let's you roll back to a known good point and try again.

Say no - Or at least say you don't know how/don't feel comfortable given the constraints. It's much better to say that at the beginning rather than reach the finish date with nothing to show. You'd be surprised how often that's enough to get help or change expectations.

Good tools working correctly - For us that means bazel, docker, git, your IDE, compilers, linkers, etc. The job is hard enough when your tools are working. Don't make things harder by using a broken tool or the wrong tool for the job.

So there you go. How to safely back up a trailer to a loading dock. Or build a robot. Either one

by Leon Rosenshein

Getting Started

It's summer, and welcome to our new interns. You're bright-eyed and eager to learn. You've got an amazing opportunity to learn from some of the sharpest minds in the business. So what should you learn?

Here's my top 6 things, in no particular order.

Sharing/Asking/Working on a team and email/Slack/ZenHub/Jira/Zoom: This is a big one. In school most of what you do will be on your own, and even when you're working on a team the projects are (generally) small enough that you can hold all of the details of the entire project in your head at once. Here, that's just not true. In school projects you'll spend lots of time with your teammates and the folks you're depending on but now, even skipping the WfH, you probably depend on folks in other states/timezones, and communication is much more asynchronous. Learning how to work with people you'll never meet or talk to, who are not nearby, and who have different priorities is a skill. The different communication methods have different bandwidths and transport delays. Knowing when to use each is a topic unto itself.

Taking notes and gDocs/Confluence/Readme.md: Similar to interpersonal communication, being able to capture and share information with others and future you. Again, different methods have their pros and cons. Where does your audience normally look? How hard is it to change the docs relative to how hard it is to change the thing being documented? If it's significantly harder t update the docs than the thing itself then your docs will be out-of-date very quickly.

The command line and Bash: Wysiwyg is great, and some things are best consumed visually (RobotStudio anyone?), but let's be honest with ourselves. Most of what we do is keyboard centric, not mouse centric. The amount of data we deal with is staggering. `ls`, `find`, `grep`, `awk`, `sed`, `xargs`, `jq`, and they're ilk, along with `bash` for flow control will make your life a lot easier. Knowing when to use the simple tools and chain them together instead of building something bespoke will save lots of time

Version control and git: There's too much information in the codebase to keep in your head, and we all make mistakes, so having a way to get back to what works will save you a lot of time. We've all heard save early, save often, and version control lets you save steps along the way and get back to them as needed.

Building and deploying and bazel/docker: Repeatability. Along with version control you need to be sure that you do things the same way every time, or you'll quickly get lost. Bazel lets (almost forces) you to define what you need up front so it can keep things up to date for you. Docker lets you have the exact same environment every time, which can avoid the dreaded "Well it worked on my machine".

Your day job: Whatever project you're working on, you're working with folks who really know it. Not just what works, but what doesn't and why. Take advantage of all that knowledge.

Because when you get down to it, the more you know, the further you'll go.

by Leon Rosenshein

Know Your Tools

Learning how to use a tool is usually pretty easy. Learning when not to use a tool is harder. Especially when your "tool" is a Turing Complete language. If a program can be written, it can be written in any turing complete language, but that doesn't mean it's a good idea. In theory you can use sed to implement your workflow, but that's a very bad idea. Conversely, if you need to loop over a list of numbers and add them up you could write a bespoke program in Rust to read a file, add them up, and spit out the answer, but again, probably not the right choice.

In those cases it's relatively easy to see, but not always. Should you write your new tool in python, Go, or C++? Well, it depends. It depends on what you're familiar with, what your users are familiar with, the ecosystem the tool will live in, any additional tools/libraries you need, and, of course, the problem you're trying to solve.

And like most things software, these choices occur at all scales. Once you pick a language you have patterns and styles. Idiomatic forms differ across languages. What is your class structure/hierarchy?

At the other end of the scale is your architectural style. From layered monoliths to space-based event driven architecture, you have lots of choices.

The goal is to use the right tool for the job. We've all done, seen, and had to deal with the "I've got a hammer, so the problem must be a nail" syndrome, and it's close relative, the "I learned about a new technique so I'm going to use it everywhere" approach.

For me it was singletons. When I first learned about singletons they were the answer to all of my questions and problems. It was great. I had a place to store global state and I could access it anywhere. Then I started to use more of them. And they got more complicated. When I saw I had written a singleton that contained a couple of singletons and had internal logic I realized I had gone too far.

So before you pick up that hammer and nail in a screw, think about more than just if it will work. Make sure you're not using the wrong tool for the job.

by Leon Rosenshein

Why Are An Elephant's Ears So Big?

RTAPI or 3000+ services? The architect says "It depends". It depends on a lot of things. And a lot of them come down to why elephants have big ears. The cube/square law. Generally speaking, surface area goes up with the square of size, while volume goes up with its cube. Consider the sphere. Surface area is πr^2, while volume is 4/3 πr^3. Comparing the titular elephant to a mouse, the elephant has a lot more body (volume) generating heat per square inch of skin (surface area) so to radiate that extra heat it has big ears to dramatically increase the surface area.

But what does that have to do with software development? Think of it this way. The surface area of your unit (function, class, package, library, executable, etc) is the collection of interfaces it presents to the outside world. The more interfaces you have, functions, arguments, config files, data files, etc, the more volume (code) you need to support it. More code means you have more opportunities for bugs. The chances for coupling and side effects is higher. You have higher cognitive load because the mental model you have to keep is that much bigger.

On the other hand, the surface area of your unit also includes how it's shared/deployed and tested. There's a certain amount of overhead to deploy or test anything, regardless of its size. Deploying/testing 1 thing to get the same functionality of 4 individual things takes the same amount of functional work, but ¼ the overhead. So you get economies of scale there. That mouse might be able to move a small piece of cheese faster than an elephant, but the elephant can move a 10 wheels of cheese faster.

So what's better, a monolith or 3000+ microservices? It depends.