Recent Posts (page 47 / 70)

by Leon Rosenshein

Chaos

Distributed systems are hard. And they're hard in lots of ways. One of them is emergent behaviour, or the idea that simple rule changes can have big impacts. Sometimes good, sometimes bad, but often surprising.

Consider the simple Nginx load balancer. Let's say you've got 5 stateless backends behind a single address. By default Nginx does round-robin load balancing, passing the same number of requests to each backend. Great. You'll end up with consistent, even load. Or not. You only have consistent, even load when all of the requests are consistent. If 20% of your load takes 3x the time to handle then what happens? Well, if that ⅕ of your requests are evenly distributed and you have 5 servers then 1 server is going to get most of them, become overloaded, and fall over. Now you only have 4 servers, which may not be enough to handle the load, so they start failing and suddenly you have users at the gate with pitchforks and torches.

But you can't think of everything. So what can you do? That's where chaos engineering comes in. Chaos engineering is the idea that instead of waiting until oh-dark-30 and for something weird to happen you build systems that cause those weird things to happen while you're awake and watching. Then you notice them, figure out how to mitigate and prevent them, and then don't worry about them happening in the middle of the night.

There are lots of things that your chaos injector can do. It can add latency, remove instances, fuzz data, increase load, eat memory or disk, fail a sensor or otherwise muck with inputs and capacity. And it might do them one at a time, or it might do them in combination, because if you run low on memory you can just swap to disk, unless the disk is also full. Then what? It falls over.

For us it's software, but really it's just another application of failure analysis of systems. In aerospace we had the "iron-bird" Hook up as many real parts as you can. Simulate/work around the rest and add an external environment. Then break something and see what happens. More advanced systems break things with switches and valves, but I've talked to people who have simulated failures of hydraulic systems by taking an axe to a hydraulic line. Kind of messy, but very realistic.

About 10 years ago Netflix popularized the concept in the software world. Things have come a long way since the original chaos monkey turned off a server just to see what would happen. So think about what adding that kind of testing to your systems might show you.

And in case you were wondering, the solution to the Nginx problem we came up with was to change the distribution police from round-robin to least-used. It's a little bit harder for Nginx to keep track of, but it does a better job of balancing time spent handling requests instead of balancing the count of requests handled. In our case that made a big difference. YMMV

by Leon Rosenshein

Capabilities

Back in the days of PCs and "PC Clones" Flight Simulator was the defacto standard for compatibility. If your clone ran FlightSim it was golden and everything would work. The FS team took compatibility seriously, and we worked really hard to work on as much hardware as possible, from a sales perspective (the bigger the addressable market the more sales) and from a brand marketing standpoint.

Being compatible with lots of different hardware got much easier when DirectX rolled out. First was DirectDraw for video cards, then DirectSound for audio and DirectPlay for input devices. It became Windows problem, and as long as the Windows hardware folks did their job it was much easier for us game developers. Now just because all video cards responded to the same API didn't mean they were all the same. They had different amounts of memory, different processors, and generally speaking, different capabilities. And of course as developers we were supposed to take advantage of all of them AND give the user the ability to turn things on/off iif they wanted to trade visual realism for frame rate.

DirectX had a feature that helped us out, the CapabilityBits. Basically you could query the driver and it would return a list of things it could do and features it supported. Great idea, right? Check what you can do, then only do those things. Simple.

Or not. Every bit of information in that structure was true. The OEM support team made sure of that. But they didn't validate all the possible combinations. On some cards you could have stencil buffers and you could have depth buffers, but you couldn't have both. Maximum texture memory was accurately reported, but only achievable if you didn't use a depth buffer. Double buffering rarely caused any other degradations, but if you wanted to triple buffer then all bets were off.

So what did we do? We built our own compatibility lab and started to document the interactions between the capability bits. And then we used that matrix to define recommended and possible settings for different graphics cards.

So in the words of the old Russian proverb, Trust, but verify. Just because you know something is true in isolation doesn't mean it's true in combination with other things.

by Leon Rosenshein

Vorlons vs. Shadows

It was the dawn of the third age of mankind, ten years after the Earth/Minbari war. The Babylon Project was a dream given form. Its goal: to prevent another war by creating a place where humans and aliens could work out their differences peacefully. It's a port of call, home away from home for diplomats, hustlers, entrepreneurs, and wanderers. Humans and aliens wrapped in two million, five hundred thousand tons of spinning metal, all alone in the night. It can be a dangerous place, but it's our last best hope for peace. This is the story of the last of the Babylon stations. The year is 2258. The name of the place is Babylon 5.

Over 25 years ago JMS gave us Babylon 5. One of the first, if not the first, TV series with a pre-plotted beginning, middle, and end. There were lots of low-budget special effects, extensive use of CGI, cheesy costumes and some really interesting haircuts. There was also some pretty deep introspection.

At its heart, B5 was about a group of races coming of age together and telling the previous generation to get out of the way. In the show the old generation was primarily represented by two races, the Vorlons and the Shadows. The Vorlons framed the world in terms of "Who are you?" A place for everything and everything in its place. The Shadows on the other hand framed things around "What do you want?" If you wanted something then just do it. Do whatever you want and let the chips fall where they may.

The thing is, you can't have a sustainable system with either one of those frameworks. You need both, working in concert, to create a dynamic, evolving system. But what does a 25+ year old space opera have to tell us about software development?

One place where you can see the interdependence of those frameworks is in security. You'll often hear security folks talking about AuthN and AuthZ but what are they and what is the difference? AuthN is authentication, or "Who Are You?". AuthZ is authorization, or "What do you want (to do)?". And it's the interplay of those two things that gets challenging.

Consider the Hadoop ecosystem, HDFS in particular. The HDFS folks did a really good job of separating the two. A flag in a config file on the namenode enables authentication, and a separate flag enables authorization. HDFS authorization is modeled after traditional Unix file perms, and HDFS authentication is Kerberos based. What this means in practice is that almost every HDFS cluster you encounter has AuthZ, but no AuthN. And mostly you never know.

Because as long as everyone is honest and never lies about who they are things work fine. You can only read/write/modify the things your user has access to. It's easy for frameworks and middleware to act on behalf of the user with the authorization of the user. You get a lot of the benefits of security, or at least you think you do. Because with a little bit of research you can pretend to be anyone and do anything you want.

The opposite is AuthN without AuthZ. In that case, again, if your users are trustworthy and never make a mistake you're golden. Plus, you get a reliable audit trail. So when someone makes a mistake you know exactly who did it. You just have no way to prevent it, because everyone can do everything.

If you want real security you need both. AuthN and AuthZ. And I'll contend that AuthN is harder and more important. Because if you can't be sure who you're talking to, and the other side can't be sure who they're listening to, it doesn't matter how carefully you've checked to make sure the caller really can do what it wants.

by Leon Rosenshein

POC vs MVP

So what's the difference between a POC and an MVP? Let's start with definitions. POC is proof of concept, while MVP is minimum viable product. A POC is something you show your boss or PM to demonstrate that something is possible, while an MVP is something the business team thinks should be put in front of customers.

A POC could be lots of things. It could be a technology demonstrator. It could be re-running a log with a new model, or it could be doing A/B testing with FLAGR. The important thing to remember is that the goal of a POC is to prove that something works or at least  has the potential to.

An MVP, on the other hand, is a product. It's fully supported. It has monitoring and alerting. It has a deployment process, a scale up/out plan, on-call support, and an SLO/SLA

The trick is to keep them separate. There's always a push to turn a POC into an MVP. After all, you got something that works, how hard can it be to give it to customers? It turns out that it can be very hard, but there are things you can do to make it easier.

The first thing is to make your POC less. The less it looks like and MVP the less pressure you'll get to release it. CLIs are great ways to drive a POC and make it clear that it's not a product.

The other thing to do is to think about what you really need to do to turn your POC into a product. You shouldn't do them as part of the POC, but think about it  and have an idea of how to do it. What kinds of metrics do you want to monitor and alert on? What will need to change for scale? What tests/gates would the CI/CD pipeline need?

Even if you don't get a lot of pressure to turn your POC into MVP having a plan will help you move faster and add more business value

by Leon Rosenshein

Under Construction


A couple of months ago I talked about the builder pattern for object instantiation. Another option is the factory pattern. It certainly has its place. Factories let you isolate the logic and flow of construction from both your code and the thing being constructed. Factories can also give you a kind of dependency injection, which can make testing easier.

The ability to isolate and extend leads me to an article I ran across a couple of weeks ago. In the article the author talks about a Score class that initially can be Low, Medium, or High, and takes an int as the parameter to the constructor. As the author notes, there are oh so many problems with that. What's the high score? What if you try an invalid number? How do you know what number to use? So the author suggests a factory with three methods, CreateHighScore, CreateMediumScore, and CreateLowScore. That certainly solves the "What number should I pass?" problem, but that's about it.

The extensible option adds an enum to the factory and then uses that to create a Score with the right value. Better, but still, WAT? There's got to be a better way.

What I would have done is skipped the factory, and switched the Score class from using an int to use an enum. Then there's never any question about what value to use. You just use the enum with the name you want. Of course, over time more and more enums get added and pretty soon you end up with Low, Medium, High, MediumHigh, VeryHigh, ExtremelyHigh, Highest, and everyone's favorite, EvenHigher.

Of course, that brings up its own set of issues. What's higher, VeryHigh or ExtremelyHigher? How do you deal with that? By making Score comparable (or the equivalent in your language of choice) you can then find the highest of a set or sort by Score as needed.

Doing it that way brings intent to the foreground and reduces cognitive load, which, as I've said before, is always a good thing.

https://uberatg.slack.com/archives/CLVTB4W20/p1586442600018500

https://t.co/jaESC9MurT?amp=1

by Leon Rosenshein

One Step At A Time

You've probably heard of The Phoenix Project. Considered as a novel it's boringly one-dimensional to almost OK. Considered as allegory though, it's got a lot going for it. Archetypes, broad statements, and simply worded lessons.

One of the simplest of those lessons is Kaizen, or continuous improvement. The idea that most efficient things you can do is the thing that makes your daily work more efficient. That's kind of circular, and deserves to be unpacked. At its core Kaizen takes the long view. Do the right thing now to make things work better in the future. The challenge is how to balance the present and the future.

As an infrastructure team we could take all the lessons we've learned over the years and interactions and do a green-field build of the perfect system for our needs. Toil away in the back room for a couple of years and then emerge with a fully functional system that scales to meet all of today's current and expected needs. The big bang approach. You can probably guess what the result of that would be. By the time it was released the world would have changed enough that it didn't meet the actual needs and we'd spend more time trying to get it right. At best we'd eventually get there. And over the preceding 2+ years would have delivered no value.

Or, kaizen. Make it better every day. Work towards the end goal of a system that does exactly what we need the way we want it. Take feedback along the way, adjusting the goal to meet customer needs. And add a little value every week or so. And it doesn't have to be a big thing. Even the smallest step along the way will help. Automate something. Make it easier to mark a task as done. 

Unlike technical  debt, where compound interest hurts, adding value early lets it compound over time, and ends up making a big difference. So take that small step when you see it.

by Leon Rosenshein

How Old Is Old

Not only is what's old new again, what was once new gets old. And it seems like it gets old faster and faster. When I got my first paying job writing software 30+ years ago I was using FORTRAN77, and it was a 10 year old standard that was in the process of being updated. When I left that job Fortran90 was the new hotness, and we were just starting the conversion process. We were just starting to play around with ANSI C, and C99 was just a gleam in someone's eye. There were some commercially available FORTRAN libraries for complex math and statistics that we used and got occasional updates, but big changes were rare.

Fast-forward a few years to Microsoft. Box products with a 2(ish) year update cycle, and updates were evolutionary. As I've mentioned, Flight Simulator was backward compatible to the beginning of time, as were Windows and Office. Yes, the UI would change and people would get upset, but there weren't a lot of fundamental changes, and no-one needed to throw out all the work they'd done and translate to a new system.

Things are a little different now. Frameworks for all sorts of things keep popping up. UI, Deep Learning, Parallelization, Datacenter Management (on prem and cloud). 5 years ago when I came to Uber Mesos/Aurora was the new hotness, and was going to replace a bespoke in-house service placement and management system. Here we are 5 years later and CoreBiz is hoping to finish that transition to Mesos/Peloton so they can switch to Kubernetes.

Not my usual area, but sometimes I get pulled into working on UI stuff. Angular, React, React-Native, Node, Deno. Which one should you use? If you're using one, should you switch to a different one? Maybe. Or maybe you should stick with whatever you're using.

Just because something is "old", whatever that means, doesn't mean you should stop using it. Working is a feature. A really important one. The thing is, it's easy to say "Time to change", but making the change is hard, takes time, and blocks other changes.

That doesn't mean you should never change, just that you need to count the cost before you do. And you need to measure not just the cost of making the change now, but also the cost of making the change later or never. And what the benefits of making the change now or later are. Because very often the short term cost pushes you one way, but the long term cost pushes you the other. And knowing which is more important can play a big part in your decision.

by Leon Rosenshein

Zip It

Tales from the trenches.

Geometry is hard. Let's say you needed to take pictures of every street in the US, or at least every address. Because street level imagery is fairly temporal you wanted to do areas quickly so you have "squads" of vehicles in an area. And for technical reasons you needed to keep the vehicles from operating in the same area. How would you break the country into workable areas? That was our challenge gathering street-level imagery for Bing Maps.

The answer we came up with was Zip Codes. Seemed like a reasonable idea. If a small number of mail-people are supposed to be able to reach all of the addresses in a ZipCode every day then a single vehicle should be able to cover the streets in a day or two. Every address is in one (and only one) ZipCode, so that will keep the vehicles apart, which is also good.

So we went to implement it. And automate it. There were a bunch of startup problems, like the fact that the Post Office doesn't provide ZipCode data, but we figured out ways around them. It mostly worked, but oh the edge cases.

It turns out that ZipCodes aren't areas. They're sets of points, or more correctly, every address is associated with a ZipCode. But that's not too bad. ArcGIS can take sets of points and turn them into non-overlapping polygons. Except… While every address is in one ZipCode, the set of polygons covering all the addresses are not contiguous. Unless you're ArcGIS, which makes them contiguous. By creating tiny tendrils that connect what would be islands. Except where it can't. So you have islands there too. So we did the usual solution. Automated everything, including validation that the automation worked. Then take the parts that failed validation and send them to Marcus (thanks @eisen) to fix. Because Marcus is a wiz at ArcGIS and can make sense out of anything.

And if you're wondering why we needed to keep the vehicles far apart, the problem was that the original firmware in the SICK LIDAR units we used (along with cameras, INS, and GPS systems) was written in a way that assumed two units would never be pointed at each other. Because if they were they would quickly burn out each other's sensors, rendering them useless. We eventually got SICK to fix that problem, but that's a whole different story

by Leon Rosenshein

The First Year

It's been just over a year since I started writing these notes. It started out as just sharing some links with my team, then with all of the infra org, and finally with all of ATG. Just over 230 notes. Some more humorous than others, but hopefully all interesting and informative.

First off, what a year it's been. When I started sending those links our team was the infra team for the Maps organization in "Prime". Since then we've moved to the Infra Org in ATG/Core Platforms, switched from Java/Scala and Mesos/Peloton to Go and Kubernetes, shut down a DC (WBU2), learned AWS, turned up a DC (DCA19), moved from rAV to rNA, are still dealing with a pandemic and have been WfH for almost ⅓ of the time (100 days). Even for Uber that's a busy year :)

Second, we don't know what the next year will hold, but we have some ideas. It will have at least returning to the office, doubling the size of DCA19, and rNA on the road at scale. Who knows what other challenges the world will throw at us.

Third, a request. While I've got opinions on just about everything and am not too shy to share them, if there's something you're interested in or curious about let me know. Send me a DM or add it to a thread. I read all of them and I'm happy to take requests.

Finally, I want to say thank you. I've learned a bunch along the way. Partly from needing to be sure I understood something well enough to explain it simply, and mostly from your comments. The links and experiences you shared have taught me new things, changed my understanding of some things, and overall helped me get better at my job.

Here's to another year of working and learning together.

by Leon Rosenshein

Inheriting Code

One of the things we've all had to do is learn how to work with an established system. Whether it's your first internship, switching teams, going to a new company, or just helping someone use something you've built, you have to orient yourself. So what do you do?

The first thing to do is ask questions. If you're lucky, there are folks around who can give you an overview, tell you where the problems are, and give you some history on why things are the way they are. If so, take advantage of that opportunity. 

Think of it like moving to a new city. You've got a specific destination and someone told you to check out a cool place while you're there, but that's it. The first thing you need is to get the utilities turned on, then you need a broad map to get you into the right areas. Then you start exploring and adding detail to the map. At first the details are only in a couple of places, maybe close to home and your job. The rest of the map has some general labels like industrial, residential, and downtown. Over time though, partly out of need, and partly out of curiosity, you start to fill in those blank spots. Eventually you end up with a pretty good internal map of the city, with a lot of detail for areas you spend most of your time in. And then, when you're the "expert" you share your map with friends and family.

Codebases are the same thing. Around here getting the utilities set up is called on-boarding. Getting your tools set up and getting access to what you'll need. Depending on how far you're coming from (new intern vs. another project on your current team) you'll need to do more or less on-boarding. You'll need a goal. A bug fix or a small feature. Maybe you need to add some validation to some input. You find (or build) a rough architecture diagram so you can figure out where the input is read. You look around the neighborhood and see how folks have done similar things in the past. You follow the data around in the debugger and start to understand some of the other blocks in that diagram.

When I joined the Combat Flight Sim team my first job was to figure out why the AI would sometimes just dive for the deck in the middle of a dogfight. Before I could fix it I needed to build my map of how things worked. There was already a tool that we could use to monitor the state of the simulation in real-time, but it was inherited from the main FlightSim team and focused on the flight models and graphics engine. So my first task turned into making better tools to understand the system. I started out simple by plumbing the AI's current target into the monitor, which taught me about how the AI stored its state, how data flowed through the system and how the display system worked. With that rough map I was able to dive deeper into the AI itself, figure out what state was important, and where it came from. Once I had that understanding I could look into the problem I was there to fix.

Which brings us to the last part of learning a new system, sharing what you learned. Very often when learning a new system you'll come up with new tools or enhancements to existing ones. Building those tools is a really good way to understand those systems, and sharing them helps you and the team along the way.

For those that care, it turned out that the AI was paying attention to energy management, like all good fighter pilots do. The problem was that instead of comparing its energy to the highest threat or some weighted average of multiple threats, it was comparing to the total energy of those threats. Of course, in that case it came up comparatively very low on energy and would trade altitude for airspeed in an attempt to even the score. I switched the comparison to a weighted average and solved the problem.