Recent Posts (page 48 / 65)

by Leon Rosenshein

Remote Extreme Programming

You've probably heard of pair programming. That's where two programmers work on one screen with one keyboard, taking turns being the typist and the navigator/observer. Not my favorite way to work long term, but I've definitely taken advantage of it during debugging or exploring a new area with another developer. It's really easy to do in person, and not too hard even now with WDP. Zoom and screen sharing are almost like being there in person.

At the same time we've got codesignal, which we use for phone screens, zoom interviews, and even some in-person interviews. That's (usually) not pair programming, just following along, but the interviewer has the ability to edit at the same time, if desired. In my experience it works great with simple audio, even better when there's video.

What if there were a way to take that experience and use it for live development of code in our codebase, but inside a fully featured IDE? Turns out such a thing exists, at least if you use VSCode. I haven't tried it out yet, but it's a full co-editing session, with editing of the same file with visible cursors, and live debugging, where both parties can look at/inspect the parts they want. Picture that. Working with someone, and if you want to inspect a variable, just check. Noo more saying. Set a breakpoint on line X. Hove over that variable for me. You could just do those things and explain what you're looking for. I haven't tried it out myself yet, but I'm going to Real Soon Now™. Hopefully someone reading this already has and can let us all know if it's as cool and useful as it looks.

by Leon Rosenshein

Bob The Builder

There are lots of design patterns. The most famous are probably the ones in the Gang of Four's book, Design Patterns. There are lots of good patterns in there. Each pattern has its strengths and weaknesses. Of course, like any good tool in your toolbox, you can use any particular pattern for multiple things. As important as it is to know when to use a pattern, knowing when to NOT use a pattern is more important. Just like you can use a screwdriver as a prybar when you need to, you shouldn't reach for a screwdriver when you need a prybar.

One such pattern is the Builder pattern for constructing new instances. There's a fairly narrow set of use cases, so most of the time it doesn't apply. The Builder pattern is most useful when you need create an immutable object with lots of overridable defaults. You could create a set of overloaded constructors with the common sets of options, and then another with every option, but that's hard to write, hard to use, and hard to maintain. You could write single-use setters, but how do you ensure they're all called before anything is used? What about validation? How do you know that enough has been set to validate? How do you make it readable? Extendable?

Enter the Builder pattern. Basically a collection of chained setters to a Builder that is then used to create the object with a final .Build() method. There are a few advantages. The selection and order of the setters is up to the user. There's no partially constructed object laying around to be misused. You get an explicit indication of when all the parameters are set and that it's time to do validation. Your immutable object springs into existence fully formed and ready to use. Got a new parameter with a sensible default? No problem add it to the Builder and your users won't know until they need it and it's already there. Need more than one of something? Call the .Build() multiple times. Need a bunch of things that are mostly the same? Instead of a single chain of builders bifurcate it at the right point.

Of course, nothing is free. You still need to set all those parameters on your object, so that code doesn't go away. You still need to do validation. Now you need to create a whole new "friend/embedded" thing called the builder. And your builder needs a getter/setter for every parameter, with some validation. So there's a bunch of code you wouldn't otherwise need. If you only have a handful of parameters and they're always needed there's a lot of overhead you should avoid.

But when it's appropriate Builders make things much easier to read/maintain and can help reduce the cognitive load, so next time you find yourself in that situation, consider using a builder.

by Leon Rosenshein

Alles Ist In Ordnung

Everything is in order. That's what coding standards are all about. Making sure everything looks right. Especially the ones we have linters for. Correct?

Sort of. Yes, coding standards include formatting and layout. And that's the part linters are best at finding/fixing and annoying us with. And it's important too. Especially in languages that we _mostly_ agnostic to the amount and location of whitespace. Like C/C++. You can pretty much throw spaces and tabs anywhere you want (outside of string literals) and the compiler won't care. Of course taken to the extreme that leads to the obfuscated C contest, and no-one calls that code readable. And that's led to lots of "standards" for formatting.

Golang is similar, in that it doesn't care much about whitespace, but there's also `go fmt` which *will* format your code the correct way, as defined by the language. So you have two options. Do it your own way or do it the official way. Other languages have more rigid spacing requirements. Python makes you pick an indentation for a block and stick with it, but doesn't care what you pick. Some languages let you specify the type for any variable, some have a default type based on the first letter of the variable's name (FORTRAN).

So formatting is kind of arbitrary, but consistency helps us all. Whether it's conventions for variable/method/class/constant names or spacing, indentation, or one-true-brace, the less mental gymnastics you need to go through when you see a new piece of code, the lower the cognitive load, and as mentioned many times, that's a good thing. Especially when your code-base and programming team(s) are large. What works for 5 people doesn't work for 500.

But coding standards are about more than just raw formatting. That's just the most visible part. The meat of standards is really the set of recommendations about how to structure your code. When to use classes, methods, interfaces, libraries, etc. And those are much harder to quantify/lint for. But to me they're even more important. And it goes back to cognitive load. Making things digestible. Making things clear. Being up front about side-effects and possible error cases. Making sure the code clearly identifies the programmers intent.

And that's where it gets hard. Because there are going to be exceptions. And if you've mandated strict adherence to a standard then there's no clean way to express what you want. So how do you handle that?

First and foremost, by making sure the "standards" are recommendations. Very strict recommendations, but not absolute laws. You need to have a good reason to override, but there needs to be a way to silence the linters when needed. Without that things get obfuscated just to placate some code. And we should never let formatting define us.

Second, by making sure everyone knows why the standard includes that rule in the first place. What's it really there for? Is there a better way to achieve that goal than by strictly applying the rule?

Third, by reexamining the standards periodically. Not changing them on a whim, but looking to see which parts are helping, which are hurting, and what areas need better coverage.

Of course, not everyone agrees with me. I've included a link to someone who feels almost entirely the other way around. It's an interesting read, and I encourage all of you to check it out. And then share what you think in the thread. I'll start it off, then go make some popcorn. This should be fun.

by Leon Rosenshein

Time For A Walk

I don't know about you, but I've been spending a lot of time in one spot for the past few weeks. It's great that I've got a workspace with a stand-up desk and enough pixels to keep all the text I need to read big enough to see, but I don't even get to walk to the various conference rooms anymore. And that's a problem for me. No changes of scenery. No break in the pattern. Less exercise. So what's a developer to do?

One option is to take an older idea and update it for today's WDP times. The walking meeting. We've probably all had them. Sometimes it was a 1:1 with the boss, sometimes it was walking down the block to grab a cup of coffee that isn't from the office kitchen. Obviously we can't do things quite the same way now, but in some ways it's gotten easier to have a walking meeting. The meeting is already on zoom, so location doesn't matter. Especially if it's a brainstorming session or a 1:1, it might not matter where you are. Just grab your headphones and phone and take a walk during the meeting.

And, there are real benefits. A change in scenery. A little exercise. Fresh air. Some direct sunlight and vitamin D. And there's evidence that people are more creative when they're moving. Even back in the office when I got stuck on a problem a few circles around the floor often got me out of my rut and moving towards the solution again. Give it a shot. It can't hurt.

by Leon Rosenshein

Staying In Sync

Technology changes. A lot. All the time. So how do you stay on top of it? How do you know what’s coming so you can be prepared?

One important thing to remember is that you can’t be an expert in everything, so don’t try to be. You’ll have your strengths, things that you’re better at, or enjoy working on more, and you should always play to them. There are going to be things that are “outside your wheelhouse”, and that’s fine. Unless you’re working completely isolated, you probably shouldn’t work to make those things your strengths. But that doesn’t mean you should ignore those things.

It starts with knowing the tools you use. Knowing the capabilities of the standard libraries and what the common extensions are. What their strengths and limitations are and when to use them and when to avoid them. That includes not just the built-in libraries, but also the open source and internal libraries. Don’t forget about the other dev tools we use, such as simple editors (vim vs emacs), IDEs(VSCode, JetBrains, emacs), command shells (bash, zsh, fish?), Git, GitHub, and Buildkite to name a few. Sometimes a bash one-liner is the right solution, sometimes you need PySpark or Zeppelin.

Then there’s the ecosystem we work in. AWS and all its offerings. DevEx/Infrastructure tools (bonsai, sq, infra, batch-api, HDFS, Piper, atlantis, etc). Core Business infrastructure (M3, umonitor, usso/ussh, uOwn, Querybuilder/Queryrunner, etc). To understand that we have the Product Catalog which lets our internal customers know what technologies are supported and should be used.

But what about new things? The up and coming tools/technologies that might not be ready, but you should be thinking about so you’re ready for them when they’re ready for you. For general software development/architecture Thoughtworks is a good place to start, particularly their Technology Radar. InfoQ and DZone are other good resources, and they’ll send you a daily list of articles for topics you sign up for if you want.

And however you learn about new things, share them. That way everyone benefits.

by Leon Rosenshein

Sharding Shouldn't Be Hard

When you get down to it, most of what we do is keep track of things. Objects in motion. Objects at rest. Files. Data in files. Where we put the files. Metadata about the data. Who wrote the data. Who read the data. For source control systems (Git, TFS, Perforce, VSS, etc), that's all they do. And when that's your business you keep track of a lot of data. Typically that means storing the data in a database.

And when you're starting out you keep everything in the same database to make joins and queries faster and management easier. As your business grows you grow your dB. That's what we did when we were making maps for Bing. We used Microsoft SQL server to track every file, ingested or created, in the system. At its peak we were managing billions of files across 3K machines and 300 PB of spinning disk. All managed by a single SQL server instance. Just keeping that beast happy was a full time job for a DBA, a SQL architect, and a couple of folks who worked on the client. We had a dedicated high-speed SAN backing it and 128 CPU cores running things.

When Uber acquired us we were in the process of figuring out how to split things up into something we could scale horizontally, because even without needing to pay SQL's per core license fees we had pretty much hit the ceiling on vertical scaling. Every once in a while someone would run a bad query, or the query would be larger than expected, and everything would come crashing to a halt because of table locks, memory limits, or just lack of CPU cycles. We knew we had to do something.

And it wasn't going to be easy. We had 100's of cars out collecting data, dozens of folks doing the manual part of map building, and of course those 3K nodes (~100K cores) doing the automated part of map building. We couldn't just stop doing that for a few days/weeks while we migrated, and we didn't have the hardware to set up a parallel system, so we had to migrate in place. So we thought about it. We came up with our desired system. Isolated, scalable, reliable. And we girded our loins and prepared to make the switch.

Then we got lucky. We got acquired and never had to go through with our plans. Not everyone is that lucky. GitHub ran into many of the same problems, but they had to implement their fix. Read on to see how that worked out and why there have been so many GitHub incidents recently.

by Leon Rosenshein

To Test Or Not To Test

The NA repo has a requirement on code coverage. And not just a requirement, A high one. And that makes sense. The Autonomy code is mission, critical, man-rated, and people's lives depend on it not going wrong. But that requirement applies not just to the Autonomy code. it applies to everything in the repo. Frameworks, web services, and command line tools among other things. And that makes sense. Yes, unit tests add some up-front cost, but overall they speed things up. They do that by giving you instant feedback and confidence.

I made a change to one of our CLI tools a couple of months ago. Just a small change to allow the user to use a yubikey for two-factor auth when getting a usso certificate. Simple. Add a flag, check the flag's value, act on it. So I implemented it. Did some simple testing to make sure that it did what I wanted, then got ready to do the PR. Now our PR template has a checkbox to indicate that tests have been run. The PR build gates on passing all the tests, so we don't block PRs that haven't set the flag, but still, it's a good idea and doesn't take much time, so I ran them. And they failed. Seems there was another entry point I hadn't considered. Luckily there were tests and it was a quick fix to get the tests passing. That also reminded me to add some tests for my new flag. And those tests found some more edge cases.

So yes, the process from thinking I was done to actually submitting the PR was about 2 hours longer than it would have been without the tests, but that feature has been out for a while now, and people have used it. There haven't been any issues and it's solved not only the yubikey problem, but it's saved folks who have multiple phones associated with their account as well. If those tests hadn't been there something would have been interrupted to get a fix out, it would have taken more time to get it working, our user would have had to deal with bugs for a while. So those couple of extra hours up front saved a couple of days time across the org. And that's just for a simple CLI. Think about tools and services that are core to our workflow. If they're down for an hour that's 1000's of man-hours lost.

by Leon Rosenshein

Go Big, Go Blue

ATG isn't a small company anymore. ZenHub was a good idea, but it didn't scale as far as we had hoped, and Phab wikis are deprecated. AWS identities are confusing and we're already reaching scale limits on accounts, roles, and cross-connects. Google drive is great for storing and collaborating on documents, but discovery is nigh-on impossible. Yet work must continue. We still need to plan and execute. We need to track defects. We need to build and store artifacts at scale. We need monitoring, alerting, analytics, and dashboards. We need to share documentation. We need to meet all of those needs (and more) with an integrated solution that extends our current usage of GitHub.

Therefore, we're moving to Azure DevOps. With Azure we start with the source control system we're already using and then extend it with compute, storage, identity management, work item tracking, document management, analytics, communications, and a CI/CD system designed and built, from the ground up, for the enterprise customer.

Regardless of what you think of the quality of the OS, the Azure DevOps system scales to 10s of thousands of users working on multi-million line codebases like Windows and Office 365 with distributed build and test systems that can scale up and down as needed to meet demand. Work item tracking that lets you search the entire corpus of items, but let you build complex queries, handle fan-out/fan-in dependencies, bulk item entry/update either online or offline.

Identity management that scales to include not just every user, but every computer, desktop and DC, integrates location and status with email, voice/video chat, and fine grained access control. Identity management that manages not only email, documents, and chat, but also includes every process running on any node, desktop to service, service to service, and on-behalf-of authentication (batch API for example).

Best in class document editing capabilities. Online and offline editing with full markup and review capabilities. Cross-linked pivot tables that pull directly from source of truth databases. Project plans that include resource availability, Gantt charts, critical path analysis, and full programmatic access and linking so they're always up to date with respect to work item tracking.

Time series databases? SQL databases? NoSQL databases? Integrated queries? Analytics? Dashboarding? Alerting? All part of the system.

Migration began this morning and should be done by EOD. Enjoy.

by Leon Rosenshein

Concurrency And Parallelism

Getting computers to do what you want is hard. Getting them to do more than one thing at a time is even harder. In fact, for a given CPU core, it's impossible. But they can do things so fast, and switch between things so fast, that it looks like they're doing more than one thing at a time. That's concurrency. Add in multiple cores and in fact they are doing multiple things at once. That's parallelism. Or at least they might be doing multiple things at once, but you won't know until after it happens. So how do you manage that?

The easiest way to manage it is to not manage it. Make a bunch of independent tasks, then throw them at the threading model/library built into your favorite language/package. In almost all cases it will be better than what you can write. As long as you remember the 2 key things. First, the tasks really need to be independent. As long as they are you don't need to worry about them getting interrupted or having some side effects. Second, If you really are independent and have more tasks than cores it's going to be slower than doing them serially per core.

Why will it be slower? Because switching between those tasks takes time. Time that could have been spent doing useful work. Or, more likely the tasks aren't independent, so you end up doing more work to make them independent or put in some kind of guard code (mutexes, semaphores, etc) to make it safe.

So, when does it make sense to be concurrent? The simplest is when you would need to wait for an external resource. If you make a web request it could take seconds to get a response. Rather than block, do the wait concurrently. If you're reading from disk it's probably faster than a web call, milliseconds are still longer than microseconds, so you might be able to get some work done then.

And this is all on a single computer, where you have a good chance of everything actually happening and not losing any messages. Make it a distributed concurrent system and it gets even harder, but that's a story for another time.

by Leon Rosenshein

No UML

You've probably heard of Unified Modeling Language (UML). You've probably even drawn up a few system diagrams with it. It's so popular that tools like Visio and Google Draw have the templates built in. But does it work? It's certainly good for analysing an existing system and understanding the relationships between the parts, but what about up-front design. Doing UML right often requires you to know, or make decisions about, things that aren't done yet. Things where there requirements aren't even done yet, let alone the detailed design.

That's where NoUML comes in. If UML is supposed to be an abstraction of your system, NoUML is an abstraction of that. It's even more simplified, and it focuses not on the components, but on the _relationships_ between them. Everything else is left as an implementation detail.

Even the possible relationships are limited. There are only 3 kinds of relationships:

  • Is
  • Has
  • Uses


and they're very generic. For example, "Is" doesn't imply any language inheritance. It's duck typing without any types. Similarly, "Has" doesn't imply contains, just an association. "Uses" is for the things something needs, but are associated in any other way. And of course there are rules for heritage and ownership, which helps break things down into discrete components.

But where NoUML shines is in how those discrete components relate to each other. They help you see where the transitive dependencies are and help keep you from turning those transitive dependencies into direct dependencies. It helps you keep your abstractions from leaking out into neighboring components. It keeps your bounded contexts bounded. And when you're building large scale distributed systems anything you can do to keep your boundaries clean and clear will pay big dividends. If not immediately, then soon.