I’m pretty sure this story is true, because I’ve heard it too many times, sometimes from people who could have been there. The people involved and timeline match up. Also, even if it’s not true, there’s still something to learn.
Amazon has always been about 2-pizza teams. You should be able to feed an entire team lunch with 2 pizzas. The idea is to keep them agile and innovative. To minimize communications delays and bottlenecks. It works pretty well too. It says nothing about the software architecture, only the scope of responsibility of a team.
Back around 2002, Amazon’s internal system was a large monolith. And there were lots of 2-pizza teams trying to work on it at the same time. It was pretty well designed, but with that much going on there were lots of interactions and coupling between teams and the work they were doing. So much that it really started to slow things down. It got so bad that Jeff Bezos issued an ultimatum.
All teams will henceforth expose their data and functionality through service interfaces.
That’s a pretty bold requirement. It meant that everything needed to change. From the code to the tooling to the deployment processes and automation. Over the next 2-3 years, Amazon did it. They changed to a Service-Oriented Architecture that endures to this day. It broke a lot of the direct coupling that had crept into the monolith. And it led directly to the creation of AWS and it being the dominant cloud platform. A cloud platform that can do everything from hosting Netflix’s compute and storage to hosting this blog.
It did that by clearly defining boundaries and letting teams innovate inside those boundaries to their hearts content. But it also led to some new problems. Because each team was responsible for everything inside those boundaries, teams started to write their own versions of things that were once shared libraries. And we all know that duplication code is bad. You duplicate bugs. It makes updates hard. Teams ended up with code they were responsible for that they didn’t necessarily understand.
Enter Brian Valentine. He’d recently joined Amazon (in 2006) as a Senior VP, coming from Microsoft, where he’d led, among other things, the core Windows OS team. A huge organization w/ 1000’s of people developing hundreds of libraries and executables that made it up. He looked at what was going on and realized that lots of teams were writing the same code. That there were multiple implementations of the same functionality scattered throughout the codebase. That it was inefficient and wasteful and that those sorts of functionality should be provided by a set of core 2 pizza teams so that the other teams could focus on their specific parts of the business.
He worked with his team and his peers to define a system where those core teams would be identified and created, then all the other teams would start using them. They wrote the 6-pager that defined the system, how it would be created, and all the benefits. Eventually it got to a meeting with Jeff Bezos, then Amazon CEO. I believe everything up to this point is true. Here’s where it gets apocryphal. I want to believe it’s true, but I just don’t know.
After the required reading, Valentine summarized the point of the meeting by writing a single line on the whiteboard
1 > 2
Huh? One is definitely not greater than two. One is strictly less than two. What Valentine meant was that having one way to do some shared bit of functionality that is actually shared, not copied/reimplemented, is better. It’s more efficient. It means less duplicated effort. It lets teams focus on their value add instead of doing the same thing everyone else was doing. That makes sense. So that’s what Amazon does now, right?
Nope. After Valentine said that was how things should be done and stepped back, Bezos stepped up and changed it slightly. To
1 > 2 > 0
What? That makes even less sense. Two is greater than 0, and so is one, but two is not between one and zero. What Bezos was saying was that having one solution might be better than having two, but waiting for some central team to build something new or update some existing service to have a new capability takes time. It adds coupling back in. It makes things more complex. And while you’re waiting for the central team to do its part the dependent team can’t do anything. So for some, potentially long, period of time, you don’t have 1 solution, you don’t have 2 solutions, you have 0 solutions. And having zero solutions is even worse than having multiple solutions. The plan pretty much ended there. Like they say in the go proverbs, “A little copying is better than a little dependency.”
Which is not to say you never go back and centralize common things. Amazon doesn’t expect every team to write their own operating system. They don’t write their own S3 access layer. They use Linux, or the AWS SDK. And when there is a critical mass of people doing something common then you write a centralized library that is shared. Or write a new service with a defined interface that everyone should call.
The trick is to do it at the right time. After there are enough instances to let you know what to build, but before there are too many versions to be able to replace them all with the central version.