Recent Posts (page 55 / 65)

by Leon Rosenshein

Failure Modes

How does your system fail? What are the impacts of the most common/likely modes of failure? What are the workarounds? Does it fail-safe? Does it Fail-Degraded?

Consider the lowly moving walkway. The Pittsburgh airport has a bunch of them. They just sit there, going around and around. They make your life easier. But sometimes they fail. They can fail in multiple different ways, but the most common is to simply stop moving. No-one likes that, but you know what happens when a moving walkway stops moving? It becomes a floor. It stops reducing your workload, but you can still walk on it. Similarly, for an escalator. If your escalator stops moving it just becomes a set of stairs. It might be steeper than you want, and the riser at the top and bottom are likely different than the ones in the middle (that's a building code violation in most places), but you can still go up and down them. The Denver airport is like that. Yes, every escalator has nearby stairs for if there's a problem, especially when they're being worked on, but if it stops moving, you don't have to. That's fail-degraded.

Elevators have a different failure mode. When they stop moving, at best they're closets. The closets are safe, but there's not useful functionality. Fire doors often have a mechanism to hold them open, but if something happens and the power fails, they close. Not ideal, but safe. Back at the airport again, that luggage cart you rented has a bar you need to hold on to or it won't move. If the linkage breaks the cart doesn't roll away, it's stuck. Truck and train air brakes work on the same system. If there's no pressure the brakes are full on. You need to apply the default pressure to release the brakes, then you add more to engage them again. If the hose/line breaks you stop. Fail Safe . Your car, on the other hand, won't stop if there's a hole in the brake line.

So how does this apply to software? How do you respond to failures of subsystems or bad input? Can you keep operating, store things in some durable fashion, and then catch up? How do you ensure that you don't make things worse by only applying part of a change? If your snappy local cache is down, can you go back to the source of truth yourself? Or do you just say no and let the upstream thing, with more (hopefully enough) knowledge of the situation do the right thing. 

by Leon Rosenshein

Wall Of Perl

Back in my Flight Simulator days I ran the build system. The build system supported about 15 developers, 20 testers, and 30 artists. We did daily builds and internal releases of code and content. No check-in builds, no automated unit tests. Code builds were easy. Fire up MSVC via command line with options and wait. If it failed send email to lots of people.

Content builds were a little different. We used 3DStudio Max for models and custom plugins for model export/processing. We had another custom tool to take textures (images saved in a lossless format) and convert them to pyramided DirectX texture files. And we had lots of models/textures. Around a terabyte of content source, and this was in the late 90's, so large for the time. The only way to make it run in a timely fashion was a distributed build farm (which were the artist's development machines during the day). And holding the whole thing together was a set of Perl scripts.

From the beginning Perl has always surprised me by how terse it is. Working with lists, whether modifying or just acting on them, is easy and (usually) clear. It has pretty close integration with the underlying OS, so acting on system resources (files usually) is straightforward. It's a procedural language, so it's familiar to most developers. It's strongly typed in that at runtime there is some enforcement, but it's loosely typed since there's no compile time checks and it can depend on your data to know if things really work.

Perl gets a bad wrap because there is lots of read-only perl code out there. Particularly since a big use of Perl is string handling, a lot of Perl code is replete with regular expressions, extracting data from strings and then using it to do something else. In the spirit of transparency I admit that I've written code in Perl, then come back the next day and been unable to figure out what I was doing, even knowing what I was trying to accomplish.

On the other hand, Perl is still amazingly powerful and useful. Larry Wall is the Benevolent Dictator For Life, and does a great job. Unlike other languages which went through a painful schism when updating to a new major version, Perl 5 is still the Perl, and what might have become Perl 6 has spun off into its own language and Perl continues to grow with active community support. There's even Perl being used in the NA toolset. So what's your experience with Perl? Share in the thread.

by Leon Rosenshein

On Testing

My title is Senior Software Engineer II. I work on Infrastructure. No mention of customers or testing in there, but I'm a tester. I write and execute tests to make sure my customers always get what they expect, and if something goes wrong *I* know before they do. It wasn't always that way.

Back in the day, there were different roles, and they kept to themselves. PMs talked to business leaders and customers and published the requirements. Devs wrote some code, hopefully related to the requirements, and tossed them over the wall to the testers, who tried to run it based on their understanding of the requirements, and told the developers it was broken. Lather, Rinse, Repeat. Eventually something got sent to customers, and then the patching, which went through the same cycle, began.

Now, not so much. There are lots of PM types, the walls between roles are much more porous, and the test org is largely gone. No STEs (Software Test Engineers), no SDETs (Software Development Engineer in Test), no SEs (Sustained Engineering). Instead we have lots of different kinds of tests and times/places to run them. There are Dev Tests, Unit tests, Functional Tests, Integration Tests, UI Tests, Simulations, Staging Environments, Black Box Tests, Chaos Monkeys, A/B experiments, and HealthChecks. So how do you know which ones to use and when?

That depends on where you are in the product life cycle and what you're trying to accomplish. Unit tests are great for making sure things do the right things when the inputs are defined and give the proper error when expected problems occur. They should be written early in the cycle and executed all the time to make sure unexpected changes don't sneak in. And of course, they should be completely deterministic, carrying all the data they need and not relying on any outside functionality.

At the other end of the cycle is black box testing and health checks. Assuming your system is deployed and running, use it like a customer would and make sure you get the correct responses. If you do this kind of testing regularly you stand a good chance of finding out there's a problem at the same time your customers do, possibly even sooner.

In between are all kinds of tests that focus on different things. Chaos monkeys for dealing with random problems. Integration tests to make sure your assumptions about another system are valid. Staging environments for medium scale testing. A/B experiments for large scale testing. 

Martin Fowler has a whole page of articles about testing and when and how to use them.

by Leon Rosenshein

That's The British Spelling Of ...

There are lots of ways to delight your users. Do the right thing. Do what they expect you to do. Figure out what they want even when they don't know or spell it wrong. You're probably familiar with the way search engines will take your search string, fix a misspelling, then run the search and make sure that's what you mean. And they do that for more than simple transpositions like teh -> the or nad -> and.

Other places you might see it is in git. If you've ever tried to run git status and typed git stats instead you've seen it tell you that you meant status. Arc goes one step further. If you run arc diff --no-lint arc will respond with (Assuming '--no-lint' is the British spelling of '--nolint'.) and then do what you meant. Git can do the same thing, but by default it just tells you the options. Facebook gives you extra passwords

If you've ever wondered how they do that, one common way is to check for valid words, then, for invalid words use something called the levenshtein distance to find the best alternative. Once you have alternatives you can present them to your user and maybe ask them if they want to try one. We do the first part in the `infra` cli.

Or, you can go the sl route (brew install sl on OSX or sudo apt-get install sl on linux) and just have something that handles a common typo. And don't suggest lots of aliases. That's just no fun

by Leon Rosenshein

Distributed System Superpowers

We all work on distributed systems. We all like superpowers. If you were Distributed Systems Man what superpowers would you want? As a user of distributed systems I'd want, in no particular order:

*Idempotency*: I want to be able to send the same message multiple times until I get a response and not worry about partial application. If I have to set something and get a connection error then have to check to see if it was set, figure out the new set command, and try again I will get it wrong. Don't make me do that.

*Availability*: I want the system always work. Period. I should never get a "System unavailable" message. And if I do retrying immediately should just work.

*Scalability*: I want the system scale (up/out) when it gets slow. Make it automatic. Make it so I don't have to care if you scaled up/out. I want to send my message and have it get to the right place magically.

*Durability*: I want the system to remember what I told it. I don't care if the power fails in the DC or someone with a backhoe digs up a fiber. If I get a message that my submission (whatever it was) worked then the system should never forget it

*Consistency*: I want to be able to set something and then I (and everyone else) immediately be able to get that value. I don't want to care where the other person is, but when I get told I've updated something it should be updated for everyone.

Of course, providing those things is hard. According to the CAP theorem it's not possible. But we can come close. Caches. Queues. Shards. Duplication. And SLAs. Don't forget them. Maybe you can't promise everything, but figure out what you can promise, then deliver on that. So what's your superpower? Share in the thread.

by Leon Rosenshein

How Do I Use This Thing?

Speaking of APIs, how do you know the API (or UI) you've lovingly crafted solves a user's problem? I assume you've done your homework and talked to your customers, observed what they're doing, and understand their domain reasonably well, and what you've built works, but that's not enough. Remember, customers (users) want solutions (to their problems), not features and cool technology. So how do you know if you're actually solving their problems?

Ask them, of course. Whether you call it a prototype, a beta test, or "Hey Pat, can you come over here and try this out for me?", getting real feedback is important. Is it as obvious as you think? Do they click on the right button or go searching for one with a different name? Does the workflow make sense? Do they get stuck? Watch them and see. Do they look confused or delighted? Ask people with different experiences. They'll see things differently and have different responses.

You're also checking out any documentation/examples you have. For an API, can your target user modify something and make it their own? Do the names tell them what to expect? Do they use the API the way you expected?

And instrument your system. Your users will often tell you what they don't like, but objectively, how often do they do the wrong thing, click the wrong button, or call an API with incorrect parameters? That you can track yourself.

by Leon Rosenshein

I Need A REST

REST is representational state transfer, and is in truth transport agnostic, but when people talk about RESTful APIs they're usually talking about REST over HTTP. And when you're talking about HTTP and REST, you need to think about your `goesintos` and `goesoutofs`, your verbs and your bodies. HTTP 1.1 only defines 8 verbs, and which one to use for a given method is pretty well defined. When you're defining your API think about safety and idempotence. Distributed systems live and die by idempotence, and we have a lot of distribution.

Think about your inputs. Are they all required? Are there reasonable defaults? Is there some set of values that are mutually exclusive? That might be OK, but be mindful of it. And don't accept inputs if there shouldn't be any. It's not an error to send a body with a GET request instead of using query parameters, but GET requests are supposed to ignore the body, so from the principle of least astonishment, it's a bad idea. You never know who's going to drop/ignore it along the way.

Think about your outputs. First of all, the HTTP spec has defined as set of return codes. Use them. And use them as they're intended. Not all successes are 200, but all 2XXs are successes. If you need to provide more details then use the body.

And of course, this doesn't apply to only REST over HTTP. Idempotence and not astonishing your users is important regardless of what your API is. If you're using gRPC it's not as clear what is idempotent, but minimize side-effects. Setters clearly change something, and Getters never should.

by Leon Rosenshein

What's Your Charter

I've talked about mission and vision statements, but I haven't talked about charters. You'll hear charters talked about in two contexts, team and project. Personally, I don't believe in team (or org in general) charters. Those are mission and vision statements.

To me, a charter has an end goal (just like a vision), but when the end goal is reached the project is over and work stops. Yes, a project can evolve and where you end up isn't always where you thought you were going, but there is an expected end. A charter is both more and less than a vision and mission

It's less than that in that its scope is in some ways smaller. Not only does it have an end, projects and their charters exist within a larger organization. Self-driving cars that drive more than themselves is a vision. PLT is a project and has a charter.

A charter is also more. Besides having an end state (vision) projects also have roles and responsibilities defined. Who's involved? Who's responsible for what? Who are the pigs and chickens? What are the processes the project will use? Some of these need to be written down, others are assumed/inherited from the org.

Regardless of whether all of those things (and others) are explicitly written down or not, you've got them. The benefit of having them written down and agreed to is that whenever there's a question you can refer back to the doc and get clarification.

by Leon Rosenshein

I'll Tell You Later

In the spirit of YAGNI, sometimes it’s better to put off a decision until a later date. Management books tell you a sub-optimal decision now is better than no decision. And that’s often the case, especially when there are lots of people sitting around doing nothing while they wait for a decision. Sometimes you do need to choose, and in those cases making a choice is better than not.

Sometimes, the reason you need to make a choice now is because of time constraints. For whatever good and valid reason the choice needs to be made, so you make it. You do the best you can with the information you have at the time, then you make the best of it and work from there.

Other times, particularly at the beginning of something, you have lots of time and freedom to make decisions. In those situations it can often be better to not decide. Take the time to investigate. Look at your options. Look at the state of the art. Think about use cases. Build a framework, buy a framework, or just do it? If you’ve only got one use case you don’t know. If you only have two then a simple if statement is enough. At 3+ you can really decide if you need to build a generic system.

Similarly, if you decide on an architecture before you have all the requirements and the requirements change you now have to either adjust your architecture (expensive), reduce features (unhappy customers), or do unnatural things with your code (tech debt).


So, when it comes time to make your next decision, think about it, and YAGNI right now

by Leon Rosenshein

DSLs and You

Domain Specific Languages are a thing. They let you express what you want in terms relevant to the problem at hand, and can hide a lot of complexity. Consider the lowly SQL select statement. You get to specify a few parameters and the system does yeoman’s work to actually figure out how to optimally read the data. CSS is also a DSL. Our team wrote and used Turtl to process JSON files and it makes templating and data extraction a lot easier. Those are all text based DSLs, and we’re all very familiar with them, but not all DSLs are text based.

Most form builders and interface layout tools are visual DSLs. Yes, there is a config file behind the UI that you can muck with, but the intent is to use the visual interface. There are even visual DSLs for personal robots. The world’s largest tire manufacturer (LEGO) also produces a visual DSL for it’s Mindstorm robots. It’s based on LabView, which is another visual DSL that’s widely used for industrial purposes.

But that doesn’t mean that DSLs are the solution to all your problems. For one thing, they’re specific, which means if your problem isn’t covered by the designed use cases you’ll have a very hard time doing what you want, if it’s even possible. It’s also another level of indirection which makes it harder for the end user to know exactly what’s going to happen. And like other languages, how you handle the edge cases is important. And you need to be consistent across versions. You never really know how your customers are going to use it, but you need to support/maintain those use cases going forward.

So, a DSL may be the solution to your problem, but make sure you include all the future costs in your evaluation