Recent Posts (page 54 / 71)

April 7, 2020 by Leon Rosenshein

Time For A Walk

I don't know about you, but I've been spending a lot of time in one spot for the past few weeks. It's great that I've got a workspace with a stand-up desk and enough pixels to keep all the text I need to read big enough to see, but I don't even get to walk to the various conference rooms anymore. And that's a problem for me. No changes of scenery. No break in the pattern. Less exercise. So what's a developer to do?

One option is to take an older idea and update it for today's WDP times. The walking meeting. We've probably all had them. Sometimes it was a 1:1 with the boss, sometimes it was walking down the block to grab a cup of coffee that isn't from the office kitchen. Obviously we can't do things quite the same way now, but in some ways it's gotten easier to have a walking meeting. The meeting is already on zoom, so location doesn't matter. Especially if it's a brainstorming session or a 1:1, it might not matter where you are. Just grab your headphones and phone and take a walk during the meeting.

And, there are real benefits. A change in scenery. A little exercise. Fresh air. Some direct sunlight and vitamin D. And there's evidence that people are more creative when they're moving. Even back in the office when I got stuck on a problem a few circles around the floor often got me out of my rut and moving towards the solution again. Give it a shot. It can't hurt.

April 6, 2020 by Leon Rosenshein

Staying In Sync

Technology changes. A lot. All the time. So how do you stay on top of it? How do you know what’s coming so you can be prepared?

One important thing to remember is that you can’t be an expert in everything, so don’t try to be. You’ll have your strengths, things that you’re better at, or enjoy working on more, and you should always play to them. There are going to be things that are “outside your wheelhouse”, and that’s fine. Unless you’re working completely isolated, you probably shouldn’t work to make those things your strengths. But that doesn’t mean you should ignore those things.

It starts with knowing the tools you use. Knowing the capabilities of the standard libraries and what the common extensions are. What their strengths and limitations are and when to use them and when to avoid them. That includes not just the built-in libraries, but also the open source and internal libraries. Don’t forget about the other dev tools we use, such as simple editors (vim vs emacs), IDEs(VSCode, JetBrains, emacs), command shells (bash, zsh, fish?), Git, GitHub, and Buildkite to name a few. Sometimes a bash one-liner is the right solution, sometimes you need PySpark or Zeppelin.

Then there’s the ecosystem we work in. AWS and all its offerings. DevEx/Infrastructure tools (bonsai, sq, infra, batch-api, HDFS, Piper, atlantis, etc). Core Business infrastructure (M3, umonitor, usso/ussh, uOwn, Querybuilder/Queryrunner, etc). To understand that we have the Product Catalog which lets our internal customers know what technologies are supported and should be used.

But what about new things? The up and coming tools/technologies that might not be ready, but you should be thinking about so you’re ready for them when they’re ready for you. For general software development/architecture Thoughtworks is a good place to start, particularly their Technology Radar. InfoQ and DZone are other good resources, and they’ll send you a daily list of articles for topics you sign up for if you want.

And however you learn about new things, share them. That way everyone benefits.

April 3, 2020 by Leon Rosenshein

Sharding Shouldn't Be Hard

When you get down to it, most of what we do is keep track of things. Objects in motion. Objects at rest. Files. Data in files. Where we put the files. Metadata about the data. Who wrote the data. Who read the data. For source control systems (Git, TFS, Perforce, VSS, etc), that's all they do. And when that's your business you keep track of a lot of data. Typically that means storing the data in a database.

And when you're starting out you keep everything in the same database to make joins and queries faster and management easier. As your business grows you grow your dB. That's what we did when we were making maps for Bing. We used Microsoft SQL server to track every file, ingested or created, in the system. At its peak we were managing billions of files across 3K machines and 300 PB of spinning disk. All managed by a single SQL server instance. Just keeping that beast happy was a full time job for a DBA, a SQL architect, and a couple of folks who worked on the client. We had a dedicated high-speed SAN backing it and 128 CPU cores running things.

When Uber acquired us we were in the process of figuring out how to split things up into something we could scale horizontally, because even without needing to pay SQL's per core license fees we had pretty much hit the ceiling on vertical scaling. Every once in a while someone would run a bad query, or the query would be larger than expected, and everything would come crashing to a halt because of table locks, memory limits, or just lack of CPU cycles. We knew we had to do something.

And it wasn't going to be easy. We had 100's of cars out collecting data, dozens of folks doing the manual part of map building, and of course those 3K nodes (~100K cores) doing the automated part of map building. We couldn't just stop doing that for a few days/weeks while we migrated, and we didn't have the hardware to set up a parallel system, so we had to migrate in place. So we thought about it. We came up with our desired system. Isolated, scalable, reliable. And we girded our loins and prepared to make the switch.

Then we got lucky. We got acquired and never had to go through with our plans. Not everyone is that lucky. GitHub ran into many of the same problems, but they had to implement their fix. Read on to see how that worked out and why there have been so many GitHub incidents recently.

April 2, 2020 by Leon Rosenshein

To Test Or Not To Test

The NA repo has a requirement on code coverage. And not just a requirement, A high one. And that makes sense. The Autonomy code is mission, critical, man-rated, and people's lives depend on it not going wrong. But that requirement applies not just to the Autonomy code. it applies to everything in the repo. Frameworks, web services, and command line tools among other things. And that makes sense. Yes, unit tests add some up-front cost, but overall they speed things up. They do that by giving you instant feedback and confidence.

I made a change to one of our CLI tools a couple of months ago. Just a small change to allow the user to use a yubikey for two-factor auth when getting a usso certificate. Simple. Add a flag, check the flag's value, act on it. So I implemented it. Did some simple testing to make sure that it did what I wanted, then got ready to do the PR. Now our PR template has a checkbox to indicate that tests have been run. The PR build gates on passing all the tests, so we don't block PRs that haven't set the flag, but still, it's a good idea and doesn't take much time, so I ran them. And they failed. Seems there was another entry point I hadn't considered. Luckily there were tests and it was a quick fix to get the tests passing. That also reminded me to add some tests for my new flag. And those tests found some more edge cases.

So yes, the process from thinking I was done to actually submitting the PR was about 2 hours longer than it would have been without the tests, but that feature has been out for a while now, and people have used it. There haven't been any issues and it's solved not only the yubikey problem, but it's saved folks who have multiple phones associated with their account as well. If those tests hadn't been there something would have been interrupted to get a fix out, it would have taken more time to get it working, our user would have had to deal with bugs for a while. So those couple of extra hours up front saved a couple of days time across the org. And that's just for a simple CLI. Think about tools and services that are core to our workflow. If they're down for an hour that's 1000's of man-hours lost.

April 1, 2020 by Leon Rosenshein

Go Big, Go Blue

ATG isn't a small company anymore. ZenHub was a good idea, but it didn't scale as far as we had hoped, and Phab wikis are deprecated. AWS identities are confusing and we're already reaching scale limits on accounts, roles, and cross-connects. Google drive is great for storing and collaborating on documents, but discovery is nigh-on impossible. Yet work must continue. We still need to plan and execute. We need to track defects. We need to build and store artifacts at scale. We need monitoring, alerting, analytics, and dashboards. We need to share documentation. We need to meet all of those needs (and more) with an integrated solution that extends our current usage of GitHub.

Therefore, we're moving to Azure DevOps. With Azure we start with the source control system we're already using and then extend it with compute, storage, identity management, work item tracking, document management, analytics, communications, and a CI/CD system designed and built, from the ground up, for the enterprise customer.

Regardless of what you think of the quality of the OS, the Azure DevOps system scales to 10s of thousands of users working on multi-million line codebases like Windows and Office 365 with distributed build and test systems that can scale up and down as needed to meet demand. Work item tracking that lets you search the entire corpus of items, but let you build complex queries, handle fan-out/fan-in dependencies, bulk item entry/update either online or offline.

Identity management that scales to include not just every user, but every computer, desktop and DC, integrates location and status with email, voice/video chat, and fine grained access control. Identity management that manages not only email, documents, and chat, but also includes every process running on any node, desktop to service, service to service, and on-behalf-of authentication (batch API for example).

Best in class document editing capabilities. Online and offline editing with full markup and review capabilities. Cross-linked pivot tables that pull directly from source of truth databases. Project plans that include resource availability, Gantt charts, critical path analysis, and full programmatic access and linking so they're always up to date with respect to work item tracking.

Time series databases? SQL databases? NoSQL databases? Integrated queries? Analytics? Dashboarding? Alerting? All part of the system.

Migration began this morning and should be done by EOD. Enjoy.

March 31, 2020 by Leon Rosenshein

Concurrency And Parallelism

Getting computers to do what you want is hard. Getting them to do more than one thing at a time is even harder. In fact, for a given CPU core, it's impossible. But they can do things so fast, and switch between things so fast, that it looks like they're doing more than one thing at a time. That's concurrency. Add in multiple cores and in fact they are doing multiple things at once. That's parallelism. Or at least they might be doing multiple things at once, but you won't know until after it happens. So how do you manage that?

The easiest way to manage it is to not manage it. Make a bunch of independent tasks, then throw them at the threading model/library built into your favorite language/package. In almost all cases it will be better than what you can write. As long as you remember the 2 key things. First, the tasks really need to be independent. As long as they are you don't need to worry about them getting interrupted or having some side effects. Second, If you really are independent and have more tasks than cores it's going to be slower than doing them serially per core.

Why will it be slower? Because switching between those tasks takes time. Time that could have been spent doing useful work. Or, more likely the tasks aren't independent, so you end up doing more work to make them independent or put in some kind of guard code (mutexes, semaphores, etc) to make it safe.

So, when does it make sense to be concurrent? The simplest is when you would need to wait for an external resource. If you make a web request it could take seconds to get a response. Rather than block, do the wait concurrently. If you're reading from disk it's probably faster than a web call, milliseconds are still longer than microseconds, so you might be able to get some work done then.

And this is all on a single computer, where you have a good chance of everything actually happening and not losing any messages. Make it a distributed concurrent system and it gets even harder, but that's a story for another time.

March 30, 2020 by Leon Rosenshein

No UML

You've probably heard of Unified Modeling Language (UML). You've probably even drawn up a few system diagrams with it. It's so popular that tools like Visio and Google Draw have the templates built in. But does it work? It's certainly good for analysing an existing system and understanding the relationships between the parts, but what about up-front design. Doing UML right often requires you to know, or make decisions about, things that aren't done yet. Things where there requirements aren't even done yet, let alone the detailed design.

That's where NoUML comes in. If UML is supposed to be an abstraction of your system, NoUML is an abstraction of that. It's even more simplified, and it focuses not on the components, but on the _relationships_ between them. Everything else is left as an implementation detail.

Even the possible relationships are limited. There are only 3 kinds of relationships:

Is
Has
Uses

and they're very generic. For example, "Is" doesn't imply any language inheritance. It's duck typing without any types. Similarly, "Has" doesn't imply contains, just an association. "Uses" is for the things something needs, but are associated in any other way. And of course there are rules for heritage and ownership, which helps break things down into discrete components.

But where NoUML shines is in how those discrete components relate to each other. They help you see where the transitive dependencies are and help keep you from turning those transitive dependencies into direct dependencies. It helps you keep your abstractions from leaking out into neighboring components. It keeps your bounded contexts bounded. And when you're building large scale distributed systems anything you can do to keep your boundaries clean and clear will pay big dividends. If not immediately, then soon.

March 27, 2020 by Leon Rosenshein

Overcommunicating

I'm one of the folks down in the engine room (the infra team). Our job is to make sure that all of our customers are happy and their appetite for processing (batch, long running services, stateful, stateless, etc) and storage (HDFS, Blob, Posix, etc) is met. As such, we're essentially a service team. That doesn't mean we write services (we do that), but that we provide a service to our customers. And our customers use those services to do their work, whatever it is. And like other service teams, when things are going well and our customers needs are being met they don't think about us too much. And that's the way it should be. Processing and storage as reliable as running water.

Of course, when things go wrong people notice. That's an incident. So we do what's called Incident Management. And that's comprised of two things. The first is dealing with the problem. We mitigate the problem as soon as possible, understanding not just what happened, but why, then we do what we can to make sure it doesn't happen again.

The second part, which is at least as important, is communicating. Making sure people know there's a problem, and that we're working on it. Making sure people understand what caused it. This communication helps in lots of ways. It keeps people from wasting time trying to figure out what they're doing wrong when the problem isn't theirs. It keeps people from wondering what's going on and distracting us from fixing the problem. It helps our customers understand how they can make better use of our services without impacting others.

This communicating is such a big part of the process that it's one of the named roles in our SOP. We typically have two people on call for each area. When there's a problem the primary is the lead, responsible for making sure the problem is solved and that the right people are working on it. The secondary is the communicator. Their job is to keep everyone else informed so that they primary can work the problem. This includes posting to slack in #breaking, updating our dashboard with a banner, creating/updating an incident, and making sure that there's a Post-Incident review. They also handle all the ad-hoc queries that come in during the incident.

And while not everyone at ATG is working on services handling 1000s of requests/sec, we all have customers. Some internal, some at Core Business, and some external. And they all have expectations. So, whether you're responsible for a function call, library, REST/gRPC service, website, or a car showing up, what's your plan for when your customer complains? If you take a look at our SOP, it explicitly says it's not about how to solve the problem. It's about how to manage the process of solving the problem and making sure everyone who wants to know what's happening knows.

March 26, 2020 by Leon Rosenshein

I Feel You Man

Is your code empathetic? Do your users think it is? Do your users think you're empathetic? Is there anything you can do to change that? Does it matter?

All good questions. And of course it does. But before you can answer them you need to understand what empathy is. People often think that having empathy means that you feel what the other person is feeling, and that's part of it, but there's a more important part. Not feeling it, but knowing how they feel, and doing something about. Being able to predict how someone is going to feel or react, and take that into consideration before it happens.

And that's a learned ability. Something you can get better at. It's training yourself to think about things from the other side. When you're writing code, thinking about what it would be like for someone who doesn't have the context to see it for the first time. If you're wondering what that person would be feeling just take a look at some code you wrote over 6 months ago. Do it now. I'll wait.

That slightly lost feeling you had? Now you know how they would feel. Think about what would have made it easier for you to understand what you were reading. Now add that to your next PR.

Remember the last time you got an error like "There has been an unexpected error. Operation Failed."? Wasn't that helpful? Of course not. So do better. Think about user expectations. Give pointers and direction when there's a problem.

We can't make foolproof code, users can't read the developer's mind, and the code can't read the user's mind, so there are going to be mistakes. And that's where empathy comes in. Be prepared and make it easier for everyone.

Postscript

Now that we're all WDP, this is even more important. Communication is harder and slower, so mistakes are magnified, and clarity has an outsized impact. There's more than enough stress going around, so taking the time to not add to that stress really helps. And not just in the code. In every form of communication, take a deep breath, assume good intent, and don't overreact.

March 25, 2020 by Leon Rosenshein

Shiny Happy People

We've been working from home for almost 2 weeks now. How are you feeling about it? There's no question that there's been a huge disruption in the force. It's something we all need to get used to. And as someone pointed out the other day, this isn't really WFH, it's WDP (work during pandemic) which is not exactly the same thing. But there are some similarities. The priorities might have changed, and the relative ranking of taking care of yourself and your family vs work has certainly changed. But for most of us the work itself hasn't changed. The long term goals haven't.

What has changed is the environment. Both the physical and psychological environments are very different than they were 2 weeks ago. We're home, practicing social distancing, and dealing with our teammates over a screen. What used to be turning in your chair and asking a question is a much more involved project. There are new and different demands on our time. Feedback from coworkers is slower, but feedback from the people in the house is immediate.

So what can we, as employees and teams, do to be productive and feel fulfilled? I know for myself. one of the things I need to be productive is pixels. Way more pixels than my MacBook has. So I brought home my 2 external monitors. Those that work with me can attest that I don't sit down much, so I got myself a standing desk. And I'm fortunate enough to have a room at home that can be my office so I can "go to work" in the morning and "come home" at night. I used to walk to work, so now I take the dog for a walk to separate work from home.

Our team has it's own little zoom chat that sits in the corner of one of my screens (remember all those pixels) so I'm not alone. But I'm not always there either. Sometimes I just wander off. And that's ok. Sometimes I focus, sometimes I'm checking on my reactor, and sometimes I'm watching some online continuous education video. Some of them are even work related :)

And above all, I've talked to my manager about things. About having the space and time to get things done, but also the time to not be doing things. And I have trust. Trust that I'm doing the right thing, that the company is doing the right thing, and trust that my manager trusts me to do the right thing. And sometimes I just watch silly videos. What works for you?

Older Newer