Recent Posts (page 50 / 65)

by Leon Rosenshein

For Want Of A Nail

We build fault tolerant distributed systems. Not just fault tolerant distributed systems, but fault tolerant systems built on top of fault tolerant distributed systems. And they're connected by a distributed, fault tolerant network, using diverse fibers from different companies. That's a lot of fault tolerance. You'd think something that tolerant couldn't fail. Yet sometimes we get failures.

Most of the time they're simple failures and the system is degraded. We lose a fiber and traffic gets a little slower. We lose a few nodes and the rest of the system picks up the work. Some bad data slipped into the system so instead of surge pricing the cost stays flat. There are problems, but they're localized, isolated, and the system gets better when the problem is fixed.

But sometimes the problem triggers a feedback loop. That's when things get interesting, and get interesting fast. Think back to that fiber we lost in the earlier example. Generally, no big deal. Traffic is rerouted and things move on. But what if we lost 10% of our capacity and we only had 15% overhead? Everything's fine, right? Usually, but, not always. Suddenly things take longer. So we get a few timeouts. But since the system is fault tolerant instead of throwing an error, we retry the message. Now we've increased our network traffic, so we get more timeouts. which increases the traffic. Then some service somewhere gets overloaded because of the traffic increase. But that's ok because we've got auto-scaling that notices and spins up more instances of the service. That spreads the load, but increases the traffic further, and now our healthchecks are impacted, so the proxy decides to stop sending traffic to those "failed" instances. So the load on the remaining instances increases, and they actually fall over. Now we've got even more traffic, and nothing to respond to it.

It's always the little things. Someone with an excavator dug a little too deep or a little too wide and broke a buried fiber-optic cable in Kansas. Some messages were retired to route around the gap. Retries caused healthchecks to fail. Failed healthchecks concentrated traffic to a few instances of a service in Virginia. Increased load meant the service couldn't respond in time. Jane Public in Capetown pushed a button and a car didn't show up. And that's a cascading failure. And that's how well designed, isolated, localized, fault tolerant systems die.

One of the hardest problems when dealing with fault tolerant distributed systems is that the fault tolerance that keeps them working up to a point is the very thing that takes them out when you pass that point. And they're very hard to recover from, because all of the things you're doing to recover are suddenly making the problem worse. To the point where sometimes the best solution is turning it off and back on. Sometimes physically, and sometimes you can get away with a virtual reboot.

There are lots of ways to help prevent cascading failures, but they mostly come down to figuring out when to stop trying to be fault tolerant and just fail quickly. And restart quickly. You can read more (and find more links) in these articles.

by Leon Rosenshein

Generally Speaking

What shape are you? What shape do you want to be? What shape should you be? Is there a right shape? Is there a wrong shape? And what does shape even mean here? All good questions. Let's take them in reverse order.

In this case shape refers to a person's familiarity/ability in a cross-section of technologies. The generalist, jack of all trades, but master of none. The deep specialist, who knows more and more about less and less until they know everything about nothing, The focuser, not as deep as the specialist, but with a broad base. And the serial specialist, who focuses on different things at different times in their career. It's a histogram with N bins and an average/variance across the bins.

For the individual, there is no wrong shape, with the possible exception of the generalist who thinks they're an expert in everything. That's never a good idea. Outside of that, every "shape" has value. It just might not be valuable to a specific team at a specific time. An expert in SQL query optimization isn't the best person when the team is working on the mobile UI. Similarly, for an individual there is no right answer. If you're a UI developer with a focus on Android then you might be the right person for that team I mentioned.

So there's no shape that you should be. There's a place for every shape, and every shape can add value.

That leaves the last two questions, and only you can answer those. Figure out what you want. What makes you happy and fulfilled. Then figure out what shape you are, and close the gap. While you're closing the gap figure out where that shape can add the most value. It will be good for you and whatever team/group/organization you end up with. And win-win outcomes are the best for everyone.

by Leon Rosenshein

Guides

Last year at the Software Architecture conference I saw a presentation by James Thompson on "Beyond Accidental Architecture". Afterward he and I got to talking about the topic and my position that everything is really about scope of work and scope of influence and the difference between a senior architect and a junior developer was not in the kinds of problems being solved, (time, complexity, interfaces, etc) but in the scope. Systems, platforms, and frameworks vs classes, functions and interfaces. 3 year roadmaps vs sprint planning. Raising the effectiveness of a team vs a brown bag on a new library. Bridging the gap between disciplines (legal, finance, product) vs between developers on a team.

And not just what, but how. The wider the scope, the less direct control there is and the more influence is needed. You don't get to tell folks what to do, you work with them to find shared solutions and raise the bar for everyone. Which brings me to James' follow-on article about the "Software Architect as Guide" It's an interesting take and one that really resonates with me.

by Leon Rosenshein

Gazinta's and Gazouta's

A long time ago I was working on a simulated environment for a hardware mission computer on a 1553 bus and my boss told me to make sure and match the gazintas and gazoutas. As a freshly graduated Mechanical and Aerospace engineer who found himself doing hardware-in-the-loop work with computers I was somewhat confused. He told me that they were the things (data/information) that goes in to (gazinta) or goes out of (gazouta) each device on the bus. That made sense. And for information to flow on a 1553 bus everything had to be defined up front. We had a data dictionary we had to implement and we had to get it right because we had to talk to real hardware that already existed. And that was a challenge, because our simulation didn't necessarily work the way the real world did.

On the other hand, it was also liberating. As long as we met that requirement we were free to do whatever we wanted. It let us innovate on our side of the line and not worry too much about how it would work.

But that data dictionary also had a bunch of other good points. It defined not just the data that flowed between the subsystems, but it described the overall system. It let us understand not just what we were doing, but the bigger picture as well. It let us see how a change to one system would impact (or not) other systems. It let us reason about the architecture of the system.

Now data dictionaries (ICDs, thrift files, protobuf files, db Schemas) are critical, but they are very detailed and focused, and while it's possible to get the big picture from them, it's challenging to say the least. And when you're describing your system to someone all the details in the world don't matter without context to put it in. And that's where your software architecture diagram comes in. It's the boxes on the whiteboard diagram. The one with boxes for processes and durable stores and buffers, but no mention of the technology inside them. There are lots of ways to make them. UML, Visio. Rational. No matter how you make them, the goal is to understand the bounded context(s) of your system and how they fit together.

So whether you're writing UML, Thrift, gRPC, AXL graphs, or using LucidChart to describe the relationship between various doggos, the better you can document your architecture and data flow, the better you can work with others and be on the same page.

by Leon Rosenshein

Cycles Of History

Or, `The more things change, the more they stay the same`. A long long time ago computers were big. They took up entire rooms and required cadres of experts who would care for them, take your punch cards, and then some time later tell you that you punched the wrong hole on card 97. So you'd redo card 97, resubmit your cards, and get the next error. Rinse and repeat until lo, an answer appeared.

Then computers got smaller. You could put two or three in a room. Someone added a keyboard and a roll of paper towels, and you could type at it. And get an answer. And be miles away connected by a phone line. Time and Moore's law marched on, and computers got smaller and smaller. Teletypes turned into vt100 terminals, then vector displays, and finally maps and bitmap displays. Thus was born the Mother of all Demos. And it was good. Altair then Sinclair, Atari, Commodore, and IBM started making personal computers. They got even smaller and Osbourne gave us suitcase sized luggables. IBM and Apple made them easier to use. Then the Macintosh made them "cool". And all the power was on the desktop. And you could run what you wanted, when you wanted to.

Then Sun came along, and the network was the computer. At least for the enterprise. Centralized services and data, with lots of thin clients on desktops. Solaris. Sparc. XWindows. Display Postscript. Control. The cadre in the Datacenter told you what you could run and when. They controlled access to everything.

But what about Microsoft? A computer on every desktop (running Microsoft software). And it became a thing. NAS, SAN, and Samba. File servers were around to share data, but they were just storage. Processing moved back out to the edge. All the pixels and FPS you could ask for. One computer had more memory than the entire world did 20 years earlier. We got Doom, and Quake, an MS Flight Simulator. But all those computers were pretty isolated. LAN parties required rubber chickens and special incantations to get 4 computers in the same room to talk to each other.

Meanwhile, over in the corner DARPA and BBN had built milnet, universities joined bitnet, and computers started talking to each other, almost reliably. Email, usenet, and maybe, distributed computing. TCP/IP for reliable routing. Retries. Store and forward. EasySabre. Compuserve. Geocities. Angelfire. AOL, and the September that never ended. The internet was a thing, and the network was the computer again.

And now? The speed of light is a limit. You need things close by. Akamai isn't enough. Just having the data local doesn't cut it when you need to crunch it. and the big new thing is born. Edge computing. Little pockets of local data that do what you need, and occasionally share data back with the home office and get updates. Hybrid Cloud and on-prem systems. It's new. It's cool. It's now

It's the same thing we used to do, writ large. Keep your hot data close to where it's going to be processed. It doesn't matter if it's RAM vs. drum memory, L1 Cache on chip vs. SSD or SAN vs. Cloud. Or I could tell you the same story about graphics cards. Bandwidth limited, geometry transform limited, Fill limited, Power limited. Or disk drives. RPM, Density, Number of tracks, Bandwidth, Total storage. Or display systems. Raster, Vector, Refresh rate. Pixel density. Or, for a car analogy, internal combustion engines.

In 1968 Douglas Engelbart showed us the future. It took 50+ years and some large number of cycles to get here, and we're still cycling. There are plenty of good lessons to learn from those cycles, so let's not forget about them while we build the future.

by Leon Rosenshein

Committed

What do you do more of, read PR commit messages or write them? Like almost everyone, I read a lot more PR messages than I write. And even more than the code it represents, which represents your intent to the compiler/interpreter, PR messages are about expressing your intent to the reader. And like good code, the purpose of your PR message is to meet a business need. In this case, the business need is to represent to the reader (which is often future you) what and why the PR exists in the first place.

So how do you achieve that goal? By remembering it while you're writing the message. And it starts with the title. It needs to be short and imperative. It should say what the PR does. And it probably shouldn't have the word and in it. In general, if it's a compound statement you should have split the change into multiple PRs, each with a single focus.

Second, no more than three paragraphs that explain, at a high level, what the PR does, why it's needed, and why it does it that way. It's not a design doc, but you can link to one if needed. It's not a trade study, but it should capture the reasoning. This is the place where you would touch on what you didn't do, and why you didn't do it. If there's a requirements doc, work item, task, or bug that the PR is in service of, this is where you link to it.

Third, describe how you know it works. Again in a paragraph or two, not a full test matrix, explain how you tested it. It could be unit tests, integration tests, A/B tests, or something else. And if there's something you didn't test, note what it is and why you didn't test it.

Finally, remember that we do squashed PRs, so all of the individual commit messages you write prior to submitting the PR won't be seen after your PR lands. I don't put nearly as much effort into them as I do the final PR commit message. I view them as short term notes to myself to help me write the final PM message. For the interim commits short is good, and unless there's something novel in a commit then one or two lines is enough.

So what do you think is important in a PR comment?

by Leon Rosenshein

Stable Code

Stable code is a good thing, right? Write it, build it, deploy it, and it just keeps working. It just runs and runs. You don't need to worry about it anymore. What more could you ask for? How about repeatability? There's more to a stable service than the first release. There's all the subsequent releases as well. You need to keep doing it. New requirements and requests come in, you write some more code, build it again, then deploy it.  In most cases those releases come on a fairly frequent cadence, and you get pretty good at it. And if you're not that good at it you get better. You practice. You write runbooks. You automate things. You evolve with the rest of your environment.

But sometimes it doesn't work that way. Maybe it was the first release, or maybe it was the 10th, but eventually it happens. You wrote the code so well, with so much forethought and future-proofing that months go by and you don't have to touch the system. Your monitoring and your customers are telling you everything is going great so you stop thinking about it and move on to other things. And inevitably a customer comes to you with a new requirement.

This happened to us the other day. We had a website that had been working great for months. A stateless website that was adaptive enough to handle all the changes in upstream services we threw at it without changes. Then the upstream service we knew would always be there went away and we needed to change one hard coded value. So what do you do in that case? First, you remember where you put the code and the runbooks. Then you build it so you can make a simple change and test it. Simple right?

In this case, not really. Turns out the world had moved on. The runbooks didn't work. Our laptops had upgraded to a new set of tools. The build and deployment systems had changed. A testing dependency wasn't available. But we needed to deploy, So we adapted. We removed the missing dependency. We turned off a bunch of validation tools. We got some bigger hammers and made it work. At the expense of repeatability. Luckily for us we're decommissioning the tool, so hopefully we won't need to ever update it again.

Bottom line, because of our apparent success we took shortcuts and didn't make sure our build system was not only stable and repeatable, but hermetic enough to stand up to the passage of time. We were offline for about 6 hours because our builds relied on external dependencies we had no control over and we had to adjust to work around them.

The lesson? Be hermetic. Own your dependencies. Exercise your systems even if you haven't needed them.

by Leon Rosenshein

Architect's Newsletter

Monoliths, Microservices,  Distributed  Monoliths, Domain Driven Design (DDD), Service Meshes, Data Modeling and more. All in the latest Software Architect's Newsletter.

Personally, I like DDD, as it helps with system decomposition, isolation, reducing coupling and cognitive load, and generally gives developers more freedom inside their domain because the boundaries are clearly defined. This is especially important when you're out on a sea of uncertainty. Anything you can do to bound the context you're working in lets you focus more on what you're trying to get done and not have to worry about what others are doing.

by Leon Rosenshein

Foundations

I recently ran across an article on "The 10 things every developer should learn", which is a great idea, but kind of hard to specify in practice. I mean, every developer should understand loops, conditionals, and data structures, right? How about discrete math and formal logic? Set theory? Big O notation is something every developer should have a basic understanding of too. It would be good to know TCP/IP at a high level, know what 3NF is. Then there's the toolset. Every developer should know source control, compilers, deployment tools, operating systems, and IDEs. Wait, that's more than 10 things.

Or, you could go the other way. There's the LAMP stack. If you know that you know everything you need to know, right? And that's only 4 things. Simple. Or maybe the list is Linux and C++. It's all just 1's and 0's, and short of assembly language or machine code you don't get much closer to the 1's and 0's than that. At the other end of that spectrum is Lisp and Smalltalk. Those are the best languages ever designed, with rich semantics. Or Ada. If you can get it to compile it's almost certainly correct. Good luck getting much more than "Hello World" to compile though.

So having said all that, I do believe there are things every developer should know. But they're not very concrete. They're more a set of heuristics and guidelines. And there's not a short ordered list. The things on the list are context sensitive and depend on the problem you're trying to solve. So what's on my list? In no particular order, 

  • Make sure you're solving the problem you are supposed to be solving
  • You don't know everything
  • You can't retrofit security securely
  • Build for the right scale
  • Use the right tool for the job
  • Text is hard
  • Write for the reader, not the compiler
  • Design for flexibility, but remember, you ain't gonna need it (yet)
  • Having a way to change things without a full build/test/deploy cycle is important to flexibility
  • Save early, save often (both to disk and source control)
  • Don't go alone
  • Don't blindly follow top 10 lists

So what's on your list? Share in the thread.

by Leon Rosenshein

Onboarding

How do people join your team? How long until they can do a build and run some tests? I'm not talking about understanding things well enough to submit a bug-fix PR or add a new feature, just build the code as is and run all the unit tests?

Our repo is hermetic, so that's good, it doesn't rely on having the right version of the compilers and libraries installed on dev machines. but what about the rest of dev tooling? IDEs are pretty common these days. Does your team have a supported one or two? Are there a set of plugins/aliases that are commonly used to improve your daily work? How are they shared/made available? What security groups/lists/permissions are needed? What websites does your team use to share things? What's your on-call policy, if you have one, and how do folks get added?

And this doesn't just apply to new hires or people onboarding to your team. What happens when you get a new laptop or need to re-image your current one? Get pushed to a new version of the OS? Find yourself helping out a coworker on their machine and try to use one of those tools?

And what if something changes? How do you tell the rest of your team? How do you add new capabilities without breaking things?

None of these things are insurmountable, and individually they might not take that much time, but overall they're things that slow you down, take you out of the flow, and generally detract from happiness. So think about how you can make things better for everyone.