Recent Posts (page 49 / 71)

by Leon Rosenshein

What's Old Is New Again

Did you know that using GPUs to make processing faster is old technology? Back in the late 80s/early 90's I was working with the flight simulators at Wright-Patterson AFB, They ran on Gould SELs, and right next to the main CPU were a couple of other cabinets, about the same size. They were the co-processors, the I(nteger)PU and the F(floating point)PU. And together they could do math orders of magnitude faster than the CPU. And just like with today's GPUs, feeding them data and getting the results back properly was key to getting the most performance from them.

Similarly, the QUIC standard is looking into header compression to improve network performance. 30 something years ago, using a 286 CPU reading data from a local hard drive, we found that to get maximum performance we needed to compress the data on disk and then decompress after reading. For text files the combination of lower disk IO and more processing gave us ~25% improvement in response time.

The other place you'll see something that appears to be new, but isn't, is when scale gets involved. What's the difference between Kubernetes and a thread manager? Or an operating system managing tasks across physical and hyper-threaded CPUS inside a single computer? Conceptually, not much. Sure the implementation is different, and we've learned a lot about how to make a good API and framework, but at its heart, it's just "run this over there", "Hey is stopped." Put it over there instead".

So the next time someone talks to you about the new hotness, think about its past, what's changed, and what we can learn from people solving that problem before.

by Leon Rosenshein

Backing Up

I ran across an article on how to not ding your rented semi-trailer when backing up. Don't ask why. That's not important to this conversation. What is important is how much of the article actually applies to development.

The first and most important rule for backing up safely is "Don't back up". For software development the corollary is some combination of *Y* ou *A* in't *G* onna *N* eed *I* t, use the tools you already have, not putting yourself in a situation where you need to do write the that piece of code by putting in guardrails earlier.

Practice - The more code you write the more situations you'll learn to understand and the more tools you'll have in your toolbox.

Watch others - See how other people solve similar problems and what issues (if any) those solutions caused. Learning from your mistakes is good. Learning from other's mistakes is even better.

Look at the situation - What are the requirements? What are the constraints? What are the problems you might encounter? What kind of edge cases? The best time to handle all of those is _before_ you write the first line of code, not after things fail.

Re-examine the situation - After you've been working for a while take time to look around again. Things may have changed, and you  certainly have a better understanding of what you need to do, so maybe you'll see something you missed the first time you looked.

Tell others what you're about to do - They might be about to do the same thing, or maybe the opposite for some reason you haven't thought of. They might have good ideas or gotcha's from earlier work.

Ask an expert - If you're about to start working on a Neural Net to train a model to identify geese crossing the road don't ask me for help. That's not my area, but we have lots of folks who _are_ experts in that area. Ask them. On the other hand, if you find that your training is running too slowly and you want to take advantage of distributed processing in our GPU cluster(s) then ask me. I know lots about distributed/parallel processing.

Go slowly / Start again - Take your time. Test as you go. Validate your intermediates. Git is your friend. Use a branch to keep track of the work you've done and let's you roll back to a known good point and try again.

Say no - Or at least say you don't know how/don't feel comfortable given the constraints. It's much better to say that at the beginning rather than reach the finish date with nothing to show. You'd be surprised how often that's enough to get help or change expectations.

Good tools working correctly - For us that means bazel, docker, git, your IDE, compilers, linkers, etc. The job is hard enough when your tools are working. Don't make things harder by using a broken tool or the wrong tool for the job.

So there you go. How to safely back up a trailer to a loading dock. Or build a robot. Either one

by Leon Rosenshein

Getting Started

It's summer, and welcome to our new interns. You're bright-eyed and eager to learn. You've got an amazing opportunity to learn from some of the sharpest minds in the business. So what should you learn?

Here's my top 6 things, in no particular order.

Sharing/Asking/Working on a team and email/Slack/ZenHub/Jira/Zoom: This is a big one. In school most of what you do will be on your own, and even when you're working on a team the projects are (generally) small enough that you can hold all of the details of the entire project in your head at once. Here, that's just not true. In school projects you'll spend lots of time with your teammates and the folks you're depending on but now, even skipping the WfH, you probably depend on folks in other states/timezones, and communication is much more asynchronous. Learning how to work with people you'll never meet or talk to, who are not nearby, and who have different priorities is a skill. The different communication methods have different bandwidths and transport delays. Knowing when to use each is a topic unto itself.

Taking notes and gDocs/Confluence/Readme.md: Similar to interpersonal communication, being able to capture and share information with others and future you. Again, different methods have their pros and cons. Where does your audience normally look? How hard is it to change the docs relative to how hard it is to change the thing being documented? If it's significantly harder t update the docs than the thing itself then your docs will be out-of-date very quickly.

The command line and Bash: Wysiwyg is great, and some things are best consumed visually (RobotStudio anyone?), but let's be honest with ourselves. Most of what we do is keyboard centric, not mouse centric. The amount of data we deal with is staggering. `ls`, `find`, `grep`, `awk`, `sed`, `xargs`, `jq`, and they're ilk, along with `bash` for flow control will make your life a lot easier. Knowing when to use the simple tools and chain them together instead of building something bespoke will save lots of time

Version control and git: There's too much information in the codebase to keep in your head, and we all make mistakes, so having a way to get back to what works will save you a lot of time. We've all heard save early, save often, and version control lets you save steps along the way and get back to them as needed.

Building and deploying and bazel/docker: Repeatability. Along with version control you need to be sure that you do things the same way every time, or you'll quickly get lost. Bazel lets (almost forces) you to define what you need up front so it can keep things up to date for you. Docker lets you have the exact same environment every time, which can avoid the dreaded "Well it worked on my machine".

Your day job: Whatever project you're working on, you're working with folks who really know it. Not just what works, but what doesn't and why. Take advantage of all that knowledge.

Because when you get down to it, the more you know, the further you'll go.

by Leon Rosenshein

Know Your Tools

Learning how to use a tool is usually pretty easy. Learning when not to use a tool is harder. Especially when your "tool" is a Turing Complete language. If a program can be written, it can be written in any turing complete language, but that doesn't mean it's a good idea. In theory you can use sed to implement your workflow, but that's a very bad idea. Conversely, if you need to loop over a list of numbers and add them up you could write a bespoke program in Rust to read a file, add them up, and spit out the answer, but again, probably not the right choice.

In those cases it's relatively easy to see, but not always. Should you write your new tool in python, Go, or C++? Well, it depends. It depends on what you're familiar with, what your users are familiar with, the ecosystem the tool will live in, any additional tools/libraries you need, and, of course, the problem you're trying to solve.

And like most things software, these choices occur at all scales. Once you pick a language you have patterns and styles. Idiomatic forms differ across languages. What is your class structure/hierarchy?

At the other end of the scale is your architectural style. From layered monoliths to space-based event driven architecture, you have lots of choices.

The goal is to use the right tool for the job. We've all done, seen, and had to deal with the "I've got a hammer, so the problem must be a nail" syndrome, and it's close relative, the "I learned about a new technique so I'm going to use it everywhere" approach.

For me it was singletons. When I first learned about singletons they were the answer to all of my questions and problems. It was great. I had a place to store global state and I could access it anywhere. Then I started to use more of them. And they got more complicated. When I saw I had written a singleton that contained a couple of singletons and had internal logic I realized I had gone too far.

So before you pick up that hammer and nail in a screw, think about more than just if it will work. Make sure you're not using the wrong tool for the job.

by Leon Rosenshein

Why Are An Elephant's Ears So Big?

RTAPI or 3000+ services? The architect says "It depends". It depends on a lot of things. And a lot of them come down to why elephants have big ears. The cube/square law. Generally speaking, surface area goes up with the square of size, while volume goes up with its cube. Consider the sphere. Surface area is πr^2, while volume is 4/3 πr^3. Comparing the titular elephant to a mouse, the elephant has a lot more body (volume) generating heat per square inch of skin (surface area) so to radiate that extra heat it has big ears to dramatically increase the surface area.

But what does that have to do with software development? Think of it this way. The surface area of your unit (function, class, package, library, executable, etc) is the collection of interfaces it presents to the outside world. The more interfaces you have, functions, arguments, config files, data files, etc, the more volume (code) you need to support it. More code means you have more opportunities for bugs. The chances for coupling and side effects is higher. You have higher cognitive load because the mental model you have to keep is that much bigger.

On the other hand, the surface area of your unit also includes how it's shared/deployed and tested. There's a certain amount of overhead to deploy or test anything, regardless of its size. Deploying/testing 1 thing to get the same functionality of 4 individual things takes the same amount of functional work, but ¼ the overhead. So you get economies of scale there. That mouse might be able to move a small piece of cheese faster than an elephant, but the elephant can move a 10 wheels of cheese faster.

So what's better, a monolith or 3000+ microservices? It depends.

by Leon Rosenshein

Crawl, Walk, Run

Or, as Kent Beck saidMake it work. Make it right. Make it fast”. But what does that mean? What's the difference between crawling, walking, and running or working, right, and fast? What goes in each phase?

Crawling or making it work is the happy path proof of concept. As long as the inputs make sense (whatever that means in a given context) then the output is good enough to use. Bad input might crash or give undefined results. Error handling and reporting is minimal to non-existent. The idea is to get enough signal to know if you really want to invest. Back in the aerial image processing days we used to have visible seams where two images touched, and we had artists and modelers fix things by hand. We started to use poisson blending for the textures and some line-following heuristics to hide the seams along natural image boundaries. In many cases it worked well, so we rolled it out and made accepting or sending the seam off for manual alteration part of the QC process. At that point we got to find more edge cases and understood the system we were working much better so we could do a better job going forward.

Walking (making it right) is productionizing. Hardening the system so that if there are problems you know early and can deal with them. It's handling all of the weird edge cases that come up. It's degrading gracefully. It's making sure you have the right answer (or know that you don't) all the time. It also means making sure the code itself is well written. That it's "clean" and extensible (if needed). That it's well factored. For that seam line processing it meant handling fields and forests and cities. It meant high confidence annotations on which seams still needed manual work and which didn't. It meant not blending colors across seam lines that ran down the side of a road. By this time, even with the much longer automated processing we were able to have higher overall throughput and more consistent quality.

Finally, running or making it fast. For me, this part is about making it literally fast and making it scale out, Being able to get more done with less. Eliminating bottlenecks. And not just in the code. Making it easier and faster to build, test, and deploy. Adding automation. Extending the use cases. For the seam lines most of it was really making it faster. Getting the process down from days to hours. We were able to reduce memory usage (we were memory bound, like many things) so we could do more in parallel. And because we had it "right" from phase 2 it was easy to make sure we didn't have any regressions.

So take things in steps, checking along the way to make sure you're still on track. You, and your customers, will be happier.

by Leon Rosenshein

Eventual Consistency

Consistency, availability, and partition tolerance. CAP theorem. If you're working on a distributed database then a lot of what you do is guided by CAP theorem. The world is an imperfect place, and networks break. So you have to deal with network partitions. And in the face of a network partition you can't guarantee consistency and availability. That's the CAP theorem in a nutshell. But what if there were a way to, if not break the rules, at least ignore them in some cases?

Consider the lowly status API. Is this flag set? Has this job finished? Is the car door open? Things like that. Assuming good intent and the network is working, _at best_ those are just snapshots in time, The answer you got was true when it was sent, but by the time you got it it wasn't. So you make the call again, looking to see if anything changed, and behold, something did. You get the response you're looking for and get to move on.

That's eventual consistency in action. Ask later and eventually you'll get the correct answer. And you can take advantage of that to make the issues of CAP theorem less of a burden. Because different operations have different requirements and make different promises to the user.

For database writes you might reasonably decide to require consistency at all times. If you can't be sure everyone knows, then don't accept the change. On the other hand, in many (most?) situations availability is key. Getting the most recent answer you can is better than not getting any answer. Recognizing this opens the door to lots of opportunity, from working through network partitions to read-only status copies of your database.

So when you're thinking about the requirements of your distributed database remember to that the requirements for data accessibility and consistency are not consistent for all accesses.

Side note: Both ACID and CAP have consistency in them, but they're not the same thing. That's a topic for another time.

by Leon Rosenshein

TIL ...

TIL what TIL means. Not really, but that is one of the TLA's I seem to have trouble remembering. Kind of like I know that PCMCIA doesn't mean People Can't Memorize Computer Industry Acronyms, but I can never remember what it really means.

Actually, what I really learned was more about Optional vs Nullable. I gained a deeper understanding of the advantages of Optional types. Not having to worry about NULL pointers can make things cleaner, easier to comprehend, and of course, avoid NPEs, which is always a good thing. Back when I was using Scala everything turned into a chain of operations on collections and without Optional that would have been a nightmare. With Optional things just seemed to flow.

by Leon Rosenshein

Abstract vs Interface

Whatever your language of choice, at some point you're going to need to think about inheritance, composition, and hierarchy. That's when you start thinking about abstract classes vs interface classes. Regardless of what your language calls them, pretty much all modern languages (C++, Golang, Javascript. Python, etc) include the functionality directly or indirectly.

The difference between the two is kind of subtle, but for the sake of simplicity let's say an interface is the definition of the contract between caller and provider, but has no implementation, while an abstract class provides at least one overrideable concrete implementation of part of that contract. In neither case can you use them without providing your own implementation. In both cases though, you can just act against the contract and not care about the implementation.

The place where it gets interesting is when you start talking about composition vs inheritance, IsA vs HasA, and the future. Generally speaking, you can inherit (IsA) one thing, but you can implement (HasA) many. So what happens when you want to add some new functionality or capabilities? What if you only want to add it to some things and not others?

A great example is Compareable. You could put it on the most basic element in your hierarchy. You'd have to give it a default implementation that failed with NotImplemented or everything would have to provide its own implementation that did that, but that would break the DRY rule. Then you'd have to implement it on everything that wanted to be Compareable. That's basically the same amount of work so no loss there. What would be a loss though is that now your compiler can't know if Compare actually works or just fails, so you're forced to try it out, hope it works, and then deal with the fallout. On the other hand, you could make Compareable an interface, define it rigorously, and then add it to the things that you plan on making comparable. Then when some developer writes instanceofthing.Compare(otherinstance) you get a warning up front that the capability doesn't exist. At that point the developer can implement it or do something else. And if you inherit from something that implements an interface you get that as well.

So, is an SDV a Car, which is a MotorizedVehicle, which is a Transport, or is an SDV a thing which implements the interfaces of a MovingPlatform, SensorPackage, and DecisionMaker? Discuss in the thread.

by Leon Rosenshein

Toward A Better PM

I'm not talking about Nyquil or Tylenol PM or [Program|Project|Product] Manager. I'm talking about Post-Mortems, or Post-Incident Reviews as they're starting to be known. Whatever you call them, they're critical to improving the system. And that's the goal. Make the system better. Make it more reliable. More self-healing. To do that you need to figure out two things. What happened, both leading up to and during the incident, and why those things happened. You don't need to figure out where to place blame, and trying to do that makes it less likely that you'll be able to find out the important things.

What and why leading up to the incident. How we got into this situation. The 5 whys. It's so important to know how we got there that it's baked right into the incident response template. But here's the thing. We call it the 5 whys, but it often ends up being the 5 whats.

  1. The server was overloaded.
  2. The server was overloaded because there were 3 instances instead of 5.
  3. There were 3 instances instead of 5 because the autoscaler didn't scale the service up.
  4. The autoscaler didn't scale up because there was no more capacity
  5. There was no more capacity because growth wasn't in the plan.

That's what happened, not why. Why didn't we know the service was getting close to overloading? Why did we get down to 3 instances without knowing about it? Why didn't we know the autoscaler wasn't able to autoscale? Why didn't we know we were close to capacity? Why didn't we anticipate more growth? You need to answer those questions if you want to make sure to fix that class of problem. Each of those "whys" has a solution that would have prevented the problem, or at least let you know before it got that bad.

A good PM goes over what happened during the incident. And that's good. Understanding the steps is important if you want to make an SOP in case something like that happens again. But there's more to it than that. Unless your incident/solution is such a well worn path that you know exactly what to do and just do it (in which case, why isn't it automated?), chances are there was some uncertainty going at the beginning. Those 6 stages of debugging. So instead of just recording/reporting the steps that were taken to mitigate the problem, think about that uncertainty. Why did it happen? Was there an observability gap? A documentation gap? A knowledge gap? Maybe it was permissions, or too many conflicting signals. Ask questions about why there was uncertainty and what could be done to reduce the time spent clearing up that uncertainty.

We naturally gravitate to the what questions. They typically have very concrete answers and it's easy to point to those answers and the process and say that's why there was an incident. And while we're focusing on the what we tend to ignore the why. However, if you really want to make the system better, to prevent things from happening in the first place, you need to spend less time on what happened, and more time on why it happened.