by Leon Rosenshein

The More Things Change

November seems to have slipped away, but I’m back now. This peice has been brewing for a while and it’s time to share.

The machine learning lifecycle, with loops

Things sure have changed for developers. The first piece of software I got paid to develop was written in Fortran77 on an Apollo DN4000. The development environment was anything but integrated. I used vim to edit the code, f77 to compile the individual source files into object files, then f77 again, this time with all of the object files, to generate an actual executable file. All manually. Or at least that’s what I was supposed to do.

At the best of times it took a while to do all those steps. Throw in debug by print and things were really slow. Write some code. Compile it. Link it. Run it. Look at the output. The cycle time could be hours or even days depending on how long it took me to get the syntax and logic to both be correct at the same time.

To make matters worse, I would regularly forget one of those steps. Usually, the linking step. Unsurprisingly, if you compile your code, but don’t link the object files into a new executable nothing changes. And instead of remembering what I forgot to do, I’d edit and compile again, and still nothing would happen. Eventually we figured out how to use Make, which solved most of the problems, but not all of them. The thing that Make didn’t solve was remembering to actually save the file. Especially when I learned how to use multiple windows so I could keep the editor open while I compiled, linked, and ran things.

Today modern IDEs make this much easier and less error prone. They know what files you’ve changed. They understand the languages. They understand and can, in many cases, figure out the internal dependencies of your code and build only the things that need to be built. And for those situations that the IDE itself can’t handle there are external tools like bazel to handle the dependencies and manage the build.

At least they usually do. I was helping my daughter with programing homework the other day and we were doing some front-end development. Doing the usual console.log debugging we weren’t seeing the output we expected. It turns out that she didn’t have autosave enabled on her IDE, so there we were, 30+ years later, still having problems keeping our development velocity up because we forgot to save a file.

Of course, that’s an outlier. The typical compile/test cycle is seconds to minutes. Some systems even complain if your tests take too long to run. Which means in that for a typical developer you can get (nearly) instantaneous and continuous feedback on whether your changes are having a negative impact on your unit tests. You can feel confident that you haven’t broken anything, and if you’re a person who writes tests first instead rather than test after, you know when you’re done.

Some folks are still dealing with that long, drawn-out, error-prone process. Two such groups that I deal with regularly are Data Scientists/ML Researcher. To do something the often need to not just make a change to their analysis code, but make one or more upstream changes to the data itself, which involves a completely different code change in a different code path. And then they must regenerate the data. And point their analysis/training code at the new data. All of those steps take time. They’re hard to track. They’re easy to miss. There’s not great tooling around it. Even worse, for some things, like the data regeneration, you can’t do it locally. There’s too much data or there’s access controls around it or you’re in the wrong country and the data needs to stay where it is, so you need to submit a request to have some process run to regenerate the data. And that can fail for lots of reasons too.

Put together, doing data science/ml work often feels like development used to. Which isn’t surprising. The more things change, the more they stay the same. It’s just the frame of reference/scale that changes.