Recent Posts (page 53 / 70)

by Leon Rosenshein

Sharding Shouldn't Be Hard

When you get down to it, most of what we do is keep track of things. Objects in motion. Objects at rest. Files. Data in files. Where we put the files. Metadata about the data. Who wrote the data. Who read the data. For source control systems (Git, TFS, Perforce, VSS, etc), that's all they do. And when that's your business you keep track of a lot of data. Typically that means storing the data in a database.

And when you're starting out you keep everything in the same database to make joins and queries faster and management easier. As your business grows you grow your dB. That's what we did when we were making maps for Bing. We used Microsoft SQL server to track every file, ingested or created, in the system. At its peak we were managing billions of files across 3K machines and 300 PB of spinning disk. All managed by a single SQL server instance. Just keeping that beast happy was a full time job for a DBA, a SQL architect, and a couple of folks who worked on the client. We had a dedicated high-speed SAN backing it and 128 CPU cores running things.

When Uber acquired us we were in the process of figuring out how to split things up into something we could scale horizontally, because even without needing to pay SQL's per core license fees we had pretty much hit the ceiling on vertical scaling. Every once in a while someone would run a bad query, or the query would be larger than expected, and everything would come crashing to a halt because of table locks, memory limits, or just lack of CPU cycles. We knew we had to do something.

And it wasn't going to be easy. We had 100's of cars out collecting data, dozens of folks doing the manual part of map building, and of course those 3K nodes (~100K cores) doing the automated part of map building. We couldn't just stop doing that for a few days/weeks while we migrated, and we didn't have the hardware to set up a parallel system, so we had to migrate in place. So we thought about it. We came up with our desired system. Isolated, scalable, reliable. And we girded our loins and prepared to make the switch.

Then we got lucky. We got acquired and never had to go through with our plans. Not everyone is that lucky. GitHub ran into many of the same problems, but they had to implement their fix. Read on to see how that worked out and why there have been so many GitHub incidents recently.

by Leon Rosenshein

To Test Or Not To Test

The NA repo has a requirement on code coverage. And not just a requirement, A high one. And that makes sense. The Autonomy code is mission, critical, man-rated, and people's lives depend on it not going wrong. But that requirement applies not just to the Autonomy code. it applies to everything in the repo. Frameworks, web services, and command line tools among other things. And that makes sense. Yes, unit tests add some up-front cost, but overall they speed things up. They do that by giving you instant feedback and confidence.

I made a change to one of our CLI tools a couple of months ago. Just a small change to allow the user to use a yubikey for two-factor auth when getting a usso certificate. Simple. Add a flag, check the flag's value, act on it. So I implemented it. Did some simple testing to make sure that it did what I wanted, then got ready to do the PR. Now our PR template has a checkbox to indicate that tests have been run. The PR build gates on passing all the tests, so we don't block PRs that haven't set the flag, but still, it's a good idea and doesn't take much time, so I ran them. And they failed. Seems there was another entry point I hadn't considered. Luckily there were tests and it was a quick fix to get the tests passing. That also reminded me to add some tests for my new flag. And those tests found some more edge cases.

So yes, the process from thinking I was done to actually submitting the PR was about 2 hours longer than it would have been without the tests, but that feature has been out for a while now, and people have used it. There haven't been any issues and it's solved not only the yubikey problem, but it's saved folks who have multiple phones associated with their account as well. If those tests hadn't been there something would have been interrupted to get a fix out, it would have taken more time to get it working, our user would have had to deal with bugs for a while. So those couple of extra hours up front saved a couple of days time across the org. And that's just for a simple CLI. Think about tools and services that are core to our workflow. If they're down for an hour that's 1000's of man-hours lost.

by Leon Rosenshein

Go Big, Go Blue

ATG isn't a small company anymore. ZenHub was a good idea, but it didn't scale as far as we had hoped, and Phab wikis are deprecated. AWS identities are confusing and we're already reaching scale limits on accounts, roles, and cross-connects. Google drive is great for storing and collaborating on documents, but discovery is nigh-on impossible. Yet work must continue. We still need to plan and execute. We need to track defects. We need to build and store artifacts at scale. We need monitoring, alerting, analytics, and dashboards. We need to share documentation. We need to meet all of those needs (and more) with an integrated solution that extends our current usage of GitHub.

Therefore, we're moving to Azure DevOps. With Azure we start with the source control system we're already using and then extend it with compute, storage, identity management, work item tracking, document management, analytics, communications, and a CI/CD system designed and built, from the ground up, for the enterprise customer.

Regardless of what you think of the quality of the OS, the Azure DevOps system scales to 10s of thousands of users working on multi-million line codebases like Windows and Office 365 with distributed build and test systems that can scale up and down as needed to meet demand. Work item tracking that lets you search the entire corpus of items, but let you build complex queries, handle fan-out/fan-in dependencies, bulk item entry/update either online or offline.

Identity management that scales to include not just every user, but every computer, desktop and DC, integrates location and status with email, voice/video chat, and fine grained access control. Identity management that manages not only email, documents, and chat, but also includes every process running on any node, desktop to service, service to service, and on-behalf-of authentication (batch API for example).

Best in class document editing capabilities. Online and offline editing with full markup and review capabilities. Cross-linked pivot tables that pull directly from source of truth databases. Project plans that include resource availability, Gantt charts, critical path analysis, and full programmatic access and linking so they're always up to date with respect to work item tracking.

Time series databases? SQL databases? NoSQL databases? Integrated queries? Analytics? Dashboarding? Alerting? All part of the system.

Migration began this morning and should be done by EOD. Enjoy.

by Leon Rosenshein

Concurrency And Parallelism

Getting computers to do what you want is hard. Getting them to do more than one thing at a time is even harder. In fact, for a given CPU core, it's impossible. But they can do things so fast, and switch between things so fast, that it looks like they're doing more than one thing at a time. That's concurrency. Add in multiple cores and in fact they are doing multiple things at once. That's parallelism. Or at least they might be doing multiple things at once, but you won't know until after it happens. So how do you manage that?

The easiest way to manage it is to not manage it. Make a bunch of independent tasks, then throw them at the threading model/library built into your favorite language/package. In almost all cases it will be better than what you can write. As long as you remember the 2 key things. First, the tasks really need to be independent. As long as they are you don't need to worry about them getting interrupted or having some side effects. Second, If you really are independent and have more tasks than cores it's going to be slower than doing them serially per core.

Why will it be slower? Because switching between those tasks takes time. Time that could have been spent doing useful work. Or, more likely the tasks aren't independent, so you end up doing more work to make them independent or put in some kind of guard code (mutexes, semaphores, etc) to make it safe.

So, when does it make sense to be concurrent? The simplest is when you would need to wait for an external resource. If you make a web request it could take seconds to get a response. Rather than block, do the wait concurrently. If you're reading from disk it's probably faster than a web call, milliseconds are still longer than microseconds, so you might be able to get some work done then.

And this is all on a single computer, where you have a good chance of everything actually happening and not losing any messages. Make it a distributed concurrent system and it gets even harder, but that's a story for another time.

by Leon Rosenshein

No UML

You've probably heard of Unified Modeling Language (UML). You've probably even drawn up a few system diagrams with it. It's so popular that tools like Visio and Google Draw have the templates built in. But does it work? It's certainly good for analysing an existing system and understanding the relationships between the parts, but what about up-front design. Doing UML right often requires you to know, or make decisions about, things that aren't done yet. Things where there requirements aren't even done yet, let alone the detailed design.

That's where NoUML comes in. If UML is supposed to be an abstraction of your system, NoUML is an abstraction of that. It's even more simplified, and it focuses not on the components, but on the _relationships_ between them. Everything else is left as an implementation detail.

Even the possible relationships are limited. There are only 3 kinds of relationships:

  • Is
  • Has
  • Uses


and they're very generic. For example, "Is" doesn't imply any language inheritance. It's duck typing without any types. Similarly, "Has" doesn't imply contains, just an association. "Uses" is for the things something needs, but are associated in any other way. And of course there are rules for heritage and ownership, which helps break things down into discrete components.

But where NoUML shines is in how those discrete components relate to each other. They help you see where the transitive dependencies are and help keep you from turning those transitive dependencies into direct dependencies. It helps you keep your abstractions from leaking out into neighboring components. It keeps your bounded contexts bounded. And when you're building large scale distributed systems anything you can do to keep your boundaries clean and clear will pay big dividends. If not immediately, then soon.

by Leon Rosenshein

Overcommunicating

I'm one of the folks down in the engine room (the infra team). Our job is to make sure that all of our customers are happy and their appetite for processing (batch, long running services, stateful, stateless, etc) and storage (HDFS, Blob, Posix, etc) is met. As such, we're essentially a service team. That doesn't mean we write services (we do that), but that we provide a service to our customers. And our customers use those services to do their work, whatever it is. And like other service teams, when things are going well and our customers needs are being met they don't think about us too much. And that's the way it should be. Processing and storage as reliable as running water.

Of course, when things go wrong people notice. That's an incident. So we do what's called Incident Management. And that's comprised of two things. The first is dealing with the problem. We mitigate the problem as soon as possible, understanding not just what happened, but why, then we do what we can to make sure it doesn't happen again.

The second part, which is at least as important, is communicating. Making sure people know there's a problem, and that we're working on it. Making sure people understand what caused it. This communication helps in lots of ways. It keeps people from wasting time trying to figure out what they're doing wrong when the problem isn't theirs. It keeps people from wondering what's going on and distracting us from fixing the problem. It helps our customers understand how they can make better use of our services without impacting others.

This communicating is such a big part of the process that it's one of the named roles in our SOP. We typically have two people on call for each area. When there's a problem the primary is the lead, responsible for making sure the problem is solved and that the right people are working on it. The secondary is the communicator. Their job is to keep everyone else informed so that they primary can work the problem. This includes posting to slack in #breaking, updating our dashboard with a banner, creating/updating an incident, and making sure that there's a Post-Incident review. They also handle all the ad-hoc queries that come in during the incident.

And while not everyone at ATG is working on services handling 1000s of requests/sec, we all have customers. Some internal, some at Core Business, and some external. And they all have expectations. So, whether you're responsible for a function call, library, REST/gRPC service, website, or a car showing up, what's your plan for when your customer complains? If you take a look at our SOP, it explicitly says it's not about how to solve the problem. It's about how to manage the process of solving the problem and making sure everyone who wants to know what's happening knows.

by Leon Rosenshein

I Feel You Man

Is your code empathetic? Do your users think it is? Do your users think you're empathetic? Is there anything you can do to change that? Does it matter?

All good questions. And of course it does. But before you can answer them you need to understand what empathy is. People often think that having empathy means that you feel what the other person is feeling, and that's part of it, but there's a more important part. Not feeling it, but knowing how they feel, and doing something about. Being able to predict how someone is going to feel or react, and take that into consideration before it happens.

And that's a learned ability. Something you can get better at. It's training yourself to think about things from the other side. When you're writing code, thinking about what it would be like for someone who doesn't have the context to see it for the first time. If you're wondering what that person would be feeling just take a look at some code you wrote over 6 months ago. Do it now. I'll wait.

That slightly lost feeling you had? Now you know how they would feel. Think about what would have made it easier for you to understand what you were reading. Now add that to your next PR.

Remember the last time you got an error like "There has been an unexpected error. Operation Failed."? Wasn't that helpful? Of course not. So do better. Think about user expectations. Give pointers and direction when there's a problem.

We can't make foolproof code, users can't read the developer's mind, and the code can't read the user's mind, so there are going to be mistakes. And that's where empathy comes in. Be prepared and make it easier for everyone.

Postscript

Now that we're all WDP, this is even more important. Communication is harder and slower, so mistakes are magnified, and clarity has an outsized impact. There's more than enough stress going around, so taking the time to not add to that stress really helps. And not just in the code. In every form of communication, take a deep breath, assume good intent, and don't overreact.

by Leon Rosenshein

Shiny Happy People

We've been working from home for almost 2 weeks now. How are you feeling about it? There's no question that there's been a huge disruption in the force. It's something we all need to get used to. And as someone pointed out the other day, this isn't really WFH, it's WDP (work during pandemic) which is not exactly the same thing. But there are some similarities. The priorities might have changed, and the relative ranking of taking care of yourself and your family vs work has certainly changed. But for most of us the work itself hasn't changed. The long term goals haven't.

What has changed is the environment. Both the physical and psychological environments are very different than they were 2 weeks ago. We're home, practicing social distancing, and dealing with our teammates over a screen. What used to be turning in your chair and asking a question is a much more involved project. There are new and different demands on our time. Feedback from coworkers is slower, but feedback from the people in the house is immediate.

So what can we, as employees and teams, do to be productive and feel fulfilled? I know for myself. one of the things I need to be productive is pixels. Way more pixels than my MacBook has. So I brought home my 2 external monitors. Those that work with me can attest that I don't sit down much, so I got myself a standing desk. And I'm fortunate enough to have a room at home that can be my office so I can "go to work" in the morning and "come home" at night. I used to walk to work, so now I take the dog for a walk to separate work from home.

Our team has it's own little zoom chat that sits in the corner of one of my screens (remember all those pixels) so I'm not alone. But I'm not always there either. Sometimes I just wander off. And that's ok. Sometimes I focus, sometimes I'm checking on my reactor, and sometimes I'm watching some online continuous education video. Some of them are even work related :)

And above all, I've talked to my manager about things. About having the space and time to get things done, but also the time to not be doing things. And I have trust. Trust that I'm doing the right thing, that the company is doing the right thing, and trust that my manager trusts me to do the right thing. And sometimes I just watch silly videos. What works for you?

by Leon Rosenshein

Keeping It Clean

Compilers are really good at turning the code you write into some kind of executable, be it machine code that executes directly or ByteCode that gets run by some virtual machine. It always does what you tell it, and most of the time it does what you want.

But that doesn't make it clean. To write clean you, first, you need to know what clean code is. There are lots of lists and acronyms that talk about what clean code is, but I think Martin Fowler summed it up best, with "any fool can write code that a computer can understand. Good programmers write code that humans can understand." It's not enough to make sure the compiler knows what you mean, you need to make sure future you (and your teammates) knows what you mean.

The code needs to be simple to understand (minimize cognitive load). It needs to be easy to navigate (although your IDE can help). It needs to do what the reader expects (from the library/class/method name) and nothing else. It should fail in predictable and informative ways.

But truly clean code goes beyond the actual source code. Comments (and doxygen notes in our case) should explain not what choices were made, but why. They should help the reader understand not only when to use something, but also when NOT to use it. And it's not just in the source itself. There are RFCs that give the big picture and the reasoning behind it. There might be presentations that explain some of the details. They might be codelabs that walk a user through use cases. Maybe a reference implementation or two.

In addition to the usual web links and links to other posts, there are 2 links to books on the O'reilly/Safari website (Clean Code and Code Complete). That website is a treasure trove of books, presentations, and online classes. If you find yourself with some spare time and bandwidth, check out what's there.

by Leon Rosenshein

On Conflict

Conflict is real, inevitable, and everywhere. Every comment on a PR or RFC is conflict. So is every question in a design review. If you're on a team that truly has no visible conflict then take a deep look around and try to figure out why. It's probably not a good sign. I doubt you all really agree about everything.

Unless you never make decisions anyone disagrees with, every decision you make involves some conflict. Everyone who disagrees with your decision puts a cost on the impact of your decision and a cost on the potential conflict (raising the issue). Most people tend to dislike conflict so that cost is relatively high. In many (most?) cases, the cost of the decision appears relatively low, so the conflict is never voiced. And generally that's ok and the team and work progress.

But is that really the best thing? Is conflict bad? Of course it can be. And unvoiced conflict isn't much better. There are two obvious problems with unvoiced conflict. First, the person who isn't saying anything feels stifled, unheard, and diminished. That's never a good thing. And in these days of Zoom and WFH, it's even easier to feel that way.

Second, we're missing out on important points and opportunities. There's a high probability that by not addressing the issue you're taking on unknown technical debt that is just going to bite you later. It might very well be the right thing to do, but we should never take on future work without being aware of it.

So what can we do about it? The part that we all can work on is reduce the cost of conflict. Keep it about the topic. The Cambridge dictionary says conflict is an active disagreement between people with opposing opinions or principles. The key here is that its people with opposing opinions, not opposing people. So talk about the ideas. Be open to new ideas. Listen to the other person. That person has decided to expend significant energy to raise the issue. Respect that commitment and listen to the ideas. Talk about the ideas and how they interact. Talk about use cases and future implications. Don't talk about people.

At its best, conflict is more like improv. The best jam sessions grow not out of agreement, but out of collaboration. Reaching the best possible solution that includes previously missed thoughts and ideas. The kind of conflict that starts with "I don't think that solution handles 'X' well, and ends with it handling 'X'. 'Y', and 'Z'.