Recent Posts (page 29 / 65)

by Leon Rosenshein

Oracles

One of the things I was taught about driving, particularly night driving, way back when, was that you should never drive faster than you could see. Taken literally, that either makes no sense, or says never driver faster than the speed of light. But it does make sense. It means you should make sure you always have time to avoid any obstacle you suddenly see or, make sure you can stop before you hit something you suddenly see.

Simple enough, but there’s more to it than making sure you can identify something further away than your braking distance. Especially if you define braking distance as how far you go before you stop after the brakes are fully applied on the track under ideal conditions. It depends on things like road conditions, brake temperature, gross vehicle weight, lag in the system, and tire condition. And then there’s the person in the loop. How long does it take you to notice that something is in the way? Then decide that you need to stop? Then actually mash the brakes? Depending on your speed, that can take longer than actually stopping.

The same kind of thing applies to automatic scaling as well. You want to automatically scale up before you run out of whatever you’re scaling so your customers don’t notice, but you don’t want to scale up too much or too soon and have to pay for unused resources because it’s possible you might need them. The same goes when you scale down. You don’t want to hold on to resources too long, but you also don’t want to give them up only to need them again 2 seconds later. Which means your auto-scalers have to be Oracles

Sure, you could be greedy and just grab the most you might ever need, but while that might work for you, it’s pretty hard on everyone else, and if you are in some kind of multi-tenant system it won’t work if everyone is greedy all the time,

One place we see this a lot is with AWS and scaling processing. Not only do we need to predict future load, but we also need to take into account the fact that scaling up the system can take 10 minutes or more. Which means if we wait too long to scale up our customers ping us and ask why their jobs aren’t making progress, and if we don’t scale down we get asked why utilization is so low and can we do more with less? To make matters worse, because we run multiple clusters in multiple accounts in someone else’s multi-tenant system we’ve seen situations where one cluster has cannibalized another for resources. There are ways around it, but it’s non-trivial.

Bottom line, prediction is hard, but the benefits are multitude. And the key is understanding what the driving variables are.

by Leon Rosenshein

Containment

I’ve talked about failure modes before, but there was an incident last weekend that reminded me of them again. United 328, from leaving Denver for Honolulu, had a failure. A pretty catastrophic one. Most likely they lost a fan blade, which took out the rest of the engine. But as scary as it was for the folks on the plane and on the ground under it, no-one got hurt. That’s because a group of engineers thought about the safety case and the possible failure modes and planned for it.

And that meant that the aircraft had systems built in to minimize the impact and risk of rapid, unplanned, disassembly. Things like a containment system for the flying fan blades, automated fuel cutoff and fire extinguishers, and sufficient thrust from the other engine to keep making forward progress. Not just mechanical systems, but also operational systems like cockpit resource management systems, ATC procedures, and cabin crew plans. All of which worked together to get the plane safely back on the ground.

What makes this even more impressive is that all of those mechanical safety systems are heavy, and one thing you try to eliminate on planes is weight. Because every pound of an aircraft’s total weight is a pound of revenue generating stuff you can’t carry. That’s why the seats are so uncomfortable. And all those processes and training costs time and money, but they do them anyway. Because it’s just that important.

The same thing applies to software in general, not just robot software. What are the potential failure modes? What can you do to make sure you don’t make things worse? What do you need to do to ensure the best possible outcome when something goes wrong?

We can’t ensure 100% safety, but we need to do everything we can to minimize risk before something goes wrong and then do everything we can to get to the best result possible if it does.

by Leon Rosenshein

MBWA vs Gemba

Management By Walking Around (MBWA) is a management style where a manager walks around to the various people/workstations, seeing what’s going on, talking to people, observing and directing as they feel appropriate.

Gemba is Japanese for “the actual place” and is used in the Lean Methodology to represent the idea of managers periodically “walking the line”, or going to the workplace to see how things are going, listen to people, and observe what’s happening to better understand value creating and identify waste.

At first glance, MBWA and Gemba seem like the same thing. The manager leaves their office, goes to where the work is happening, looks around, and makes some changes. Sounds equivalent, no?

Actually, no. Because while the physical activity, walking around, is the same, the motivation and the process are pretty different. MBWA is spot checking on the people. Seeing what they’re doing, understanding what their concerns and impediments are, and making sure there’s good alignment on tasks and goals.

Gemba, on the other hand, is about checking on the value stream. Seeing what directly contributes to it and what adds friction and delay, then figuring out how to maximize the former and minimize the latter. And it often doesn’t happen during the Gemba walk. Instead, the walk is used to gather information. Asking questions and listening to answers. Getting the big picture and filling in the details. Then, later, with thoughtful deliberation, making decisions and acting on them.

Both have their place. Gemba focuses on the value stream, which is what we’re here for. Adding customer value. MBWA, on the other hand, focuses on the people. Since people are the ones adding value, we need to focus on them.

We need to be thoughtful and make sure both are getting done, at the right times. And, since we’re all WfH, do it without actually walking.

by Leon Rosenshein

Discoverability vs. Findability

First things first. discoverability and findability are not the same thing. Findable is relatively easy. You know something exists. You saw it once or someone told you about it. You know what you’re looking for, but you don’t know where it is. Like your car keys or that status dashboard. Discoverability, on the other hand, is about learning. You can only discover something once. The second time you see something you haven’t discovered it, you’ve found it. And that’s the key difference.

And that gets to the heart of the issue. Findability is important for the things you know you don’t know. You have a question and you want an answer. What’s the difference between merge sort and bubble sort? When is it better to use a micro-kernel based architecture instead of microservices? Search engines are great for that kind of thing. Put in your keywords and get the answer.

Discoverability, on the other hand, is about things you don’t know that you don’t know. And search engines are notoriously bad at that. Sure, you might get a better spelling or a search result might show you something related, but if you’re trying to find out how to use your microkernel architecture to process millions of queries per second you’re unlikely to find anything about how a distributed microservice might be a better choice. If you know a little bit more about the problem space you can use a search engine, but it’s harder. You need to change your thinking and search for architectural patterns, and then dig in from there. And that’s if you know the domain.

Or consider your IDE. Both VSCode the various JetBrains IDEs do a good job of both making it easy to find the functionality you’re looking for with hierarchical menus, context menus, and a search mechanism and the make it easy to discover keyboard shortcuts and related commands/options though advertising them and grouping things. Vim, on the other hand, has an OK search engine, butt if you don’t know what you’re looking for it’s almost impossible to discover.

So why should you care? You should care because it applies not just to IDEs and search engines, but also to libraries, APIs, and pub-sub systems. We talk to our customers/users a lot and understand what they want/need. We use that information to build the things that provide the most value to them. If there were more valuable things to build we’d build them instead. But unless our customers/users know about what we’ve done and use it then we’ve wasted time. Sure, you could assume that since they asked for it 3 months ago they’ll notice when it suddenly appears, but really, they’re busy to and probably won’t. So make it not just findable, but discoverable. How you make something discoverable is a different topic for later

by Leon Rosenshein

I've got you covered

Code coverage is important. If you haven’t exercised your code then you don’t know if it works. If you don’t know it works, how can you be done?

So code coverage is important. If all your code is covered by your unit tests you can feel pretty confident that your code does what you think it does. Test Driven Development (TDD) is a good way to approach it. Figure out what you want your code to do. Write the tests that validate it does those things. Then write just enough code to make sure all the tests pass. Do that well and you’ve got 100% code coverage.

And that means your code is perfect and you can turn off your alerting, right? Wrong. For one thing, 100% code coverage and passing tests means that all of your code is executed by the tests. It doesn’t say anything about the correctness of the code, Consider this bit of pseudo-code

func area(length int, width int) int {
 return length * length
}

func test_Area() {
  length = 2
  width = 2
  result_area = area(length, width)
  assert_equal(result_area, length * width)
}

100% coverage, all tests pass, and it’s wrong. So code coverage isn’t everything.

But you could argue that in this case the problem isn’t that coverage isn’t the problem, it’s that the test is wrong. You’d be correct, but writing tests to check the tests has no end.

Then there’s the issue of side effects. Everything does what you expect in isolation, but does it do so in concert? Do you have enough tests to cover the side effects? Your method to add to the global list does so, but is there another bit of code that has cached a copy of the global list and is now going to do the wrong thing? In this case each of the tests is correct, but there’s an interaction you haven’t tested.

Then there’s the whole world of integration. Each piece of code does exactly what it should. No less, and no more. But when you put them together you get some kind of non-linear response or positive feedback loop.

All of which is to say, code coverage is important. You can’t be sure you’re right without it. But don’t confuse a necessary component with a sufficient one.

by Leon Rosenshein

YAGNI, but ...

I’ve said it before, and I’ll probably say it again. Remember YAGNI. You Ain’t Gonna Need It. But, YAGNI, until you do. And if you haven’t planned for it, then what?

We got about 5 inches of snow overnight here in sunny Colorado. It’s all shoveled now, and I used my legs, not my back, so I’m fine. Actually, this is Colorado Powder, so I used a 36” push-broom for most of it. But it reminded me of a teaching moment from my youth in the northeast, where the snow is wet and heavy, more like concrete. And it’s not so sunny, so it kind of sets like concrete too.

After one of those 12” dumps my father sent me out to shovel my grandmother’s sidewalk and plow her driveway. I argued a bit, but hey, I was going to get to drive the tractor (not me, but the same tractor), so I did it. I did it exactly. Nice straight cuts down to the edge of the sidewalk and put the snow right there on the edge. A couple of turns around the driveway with the tractor and if the berm was on the driveway a little, there was still plenty of room for the car, so no problem. I told my dad I was done and asked if I could borrow the tractor to plow the lake off for hockey.

He took a look outside and asked if I was sure I was done because there was going to be more snow later in the week, and it was only November, so … I said yes. Everything is clear. He said OK, you can borrow the tractor, but you have to keep doing your grandmother’s sidewalk and driveway.  I thought that was great and went and plowed the lake. Driving on ice with individual rear wheel braking is fun, btw.

And as it is wont to do in those parts, it got a little warmer (like mid 30s) then got cold and snowed again. So I plowed and shoveled, and if the driveway got a little narrower because the berms crept inward a little it still wasn’t a problem. The car still fit. Rinse and repeat a few times. And then one morning after plowing the car didn’t fit. And the berms were about 2 feet of near solid ice. So out I go w/ the tractor, pick, and shovel and beat the berms into submission. Got my grandmother’s car out and she got to work, so all was well.

At which point my father reminded me that just doing what you need at the moment has long term consequences, and saving a little time then can leave you in a really bad place later. Then he got out the big tractor and pushed the berms back to where they should have been.

That’s when I learned about the exceptions to YAGNI. You might not need it now, but sometime you know you’re going to in the future. Or at least the combination of work needed now to prepare for the future combined with the chance of needing it later and the amount of additional work needed to recover from the omission makes it worth buying the insurance of the pre-work.

And that applies to software just as much as it does to plowing the driveway. It’s certainly easier and faster to access S3 directly than it is to build and use a full POSIX compliant file system wrapper. So you probably shouldn’t do that since you don’t need the full wrapper now, but maybe a simple facade, with no semantic differences that does nothing but collect your accesses and maybe collects some metrics, is worth doing. The metrics are certainly nice, and it lets you see your access patterns better. And, if you realize later that yoU DO need that full POSIX compliant system or you need to switch to Azure blob store, or even just handle multi-region S3 access efficiently you’ve got a place to do it.

So remember, YAGNI, but …

by Leon Rosenshein

Ch-ch-ch-ch-changes

The greatest payoff of domain-driven design is not that the *state* of a business domain maps cleanly onto software but that *changes* to a business domain map cleanly onto software.

   -- Chris Ford

Domain-Driven Design has lots of benefits. Ubiquitous language. Clear Boundaries. It works with the business, not against it. And done right it changes with the business.

Back in my early days I was working with a FORTRAN simulation of an F18. It was designed to see how changes to the flight computer would translate into changes in the handling characteristics. And it was pretty good at doing that. But that’s not what I was doing with it. I was using it as the flight model for a 1v1 simulation.

First, I turned it into a real-time, man-in-the-loop system, with a simple graphical display. That was relatively easy. Instead of reading a time series of control inputs hook the inputs up to some external hardware and the keyboard. It already ran faster than real-time at 50hz, so no problems there. Just connect the outputs to a simple out-the-window display and voila - a man-in-the-loop simulation. The only thing left to do was add another copy of the aircraft to the world.  And that’s where I ran into problems.

You see, the simulation was designed around “ownship”, the vehicle being simulated. And it did a great job modeling the forces acting on ownship. Thrust, drag, lift, gravity, asymmetric forces from control surfaces, inertia. All modeled. And the result was net forces and rates. In all 6 degrees of freedom. But it had no absolute location, and or orientation. It was always at the center of the universe, and the universe moved around it. And for speed of implementation I put the notion of absolute location and orientation in the display system.

That’s fine if you’re the only thing in the universe, or at least all the other things don’t move, but it’s kind of hard when you want to interact with something and you’re both at the center of your own private universe that moves around you. But still, it’s just math and transformations, so I made it work for 2 aircraft in one visual world. I had it working for 3 as well, when I ran into a problem.

My boss told me we need to do MvN sims as well. Since we don’t have enough pilots let’s make some of the aircraft AI. Skipping the whole “How do I build an air combat simulation (ACS) AI overnight?” issue, I also had a problem in that these simulations needed to interact outside of the graphical display. Unfortunately my design had the display system as the only place everything interacted.

My first thought was to duplicate that logic everywhere things had to interact. It worked after a fashion, but it was hard to keep track of and as more and more things needed to know where everything else was it got even harder. Because I had my domains wrong.

I needed to change. So, back to the drawing board to fix the domains. Start with a consistent world for everything to live in. Move the integration/transformation into each model so that they had an absolute position. Share those around the system and let each system do what it needed to. Just like the real world.

And unsurprisingly, after that I found that a lot of the things we had planned became easier. Things like adding new vehicles, weapon systems, upgraded ACS AI, or adding a record/playback system. They matched the domain, so I was able to add them within the system, without doing code gymnastics to get them to fit.

by Leon Rosenshein

MVP

Just what is an MVP anyway? Like everything else, it depends. In this case on context. The Super Bowl has a Most Valuable Player, but that’s not the MVP I mean, In this case I’m talking about the MVP as Minimum Viable Product.

But even there, MVP isn’t consistently used. I first ran into the term when I was working on the initial release of in-browser 3D mapping for Virtual Earth. We (the dev team) were extremely busy trying to figure out how build a product that could stream a surface model of the entire globe down to a browser at sufficient level of detail to show the Microsoft Campus in the foreground and find the our office windows, but not flood the pipe with data for Mount Rainier. While we were busy figuring that out, a different group of people was figuring out what feature set we needed to make a product that fit into the Microsoft online offerings. Two groups trying to figure out what the MVP was. We eventually shipped something, but I’m not sure we ever agreed on what the MVP was. Since then I’ve seen the MVP debate play out the same way multiple times.

There’s 2 big reasons for that. First, because MVP is made up of some highly subjective terms. Let’s start with the last one and work backwards. Product is probably the easiest. It’s the thing you’re selling/giving away. Of course you haven’t defined what it is or why someone wants it. That’s where viable comes in. A viable product is one that a customer is willing to pay for. And for a product to be viable the total of what the customer pays, both direct and indirect, must add up, over time, to more than the expenses. If not, you’ll eventually run out of money and not be selling anything. Which leads us to minimum. What is the absolute least you can do and still have a viable product? The assumption here is that the minimum takes the least amount of time to build, makes the fewest wrong choices, and gets you more information from the customer soonest, so you can iterate on it and make better choices going forward. That makes sense, and is a good definition of MVP.

Second, there is another definition of MVP out there. One pushed by the Lean Startup movement. And it defines and MVP something as the minimum thing you can build to test the viability of a product idea. Which is a very different thing. The definition of minimum is about the same, and viability still refers to the product, but now, instead of validating the success of a product, it’s applied to the idea of the product. Which means when you’re building an MVP, you’re not checking to see if you can make money off it, you’re checking to see if anyone else thinks you have a good idea.

And that’s why we never reached agreement on what the MVP for VirtualEarth was. The dev team wanted to build and deploy something to see if anyone cared, and if so, what they cared about. Oh for sure we had ideas and built some of them in the first release, but mostly we wanted to find out from our early adopters what they thought was important. So we could build more of that. Meanwhile, that other group was trying to figure out, without actual data, what people were willing to pay for. We’ll never know for sure why Virtual Earth (Bing Maps) never became the standard online map, despite having 3D models, weather, moving trees, Streetside, BirdsEye, and user generated content first, but building things that people didn’t want or know how to use instead of what they wanted in the order they wanted it probably played a part.

So when you build an MVP, remember why you’re building it.

Stages of an MVP

by Leon Rosenshein

Strict Product Liability

How many times have you heard That’s not a bug, that’s a feature? Usually it’s a developer saying that to a user or tester as an excuse for things not working as expected, but sometimes it’s “the street finds its own uses for things”. And how you respond depends on how you approach the situation.

Consider this video showing how a developer might react to a tester not using the product as intended. Who’s responsible here? Is anyone responsible? Is the tester wrong? Was the developer wrong? Was the specification wrong? Was no-one wrong and this is a learning experience for everyone? Should something be done? It depends on what the goal was.

In the US we have the idea of Strict Product Liability (SPL). IANAL, but my basic understanding of SPL is that the maker of a product is liable for all reasonable uses of a product. Which means if you’re using a screwdriver as a scraper or pry bar and something bad happens the manufacturer might be liable. There are lots of other factors, and you should probably consult your own lawyer before you act on that description, but there’s an important lesson in there for us developers too.

And it’s this. You can’t control what your customers are going to do with the capabilities you give them. You can (and should) make it simple to use things the way you intend them to be used. Streamline the happy path and funnel use-cases towards it. But unless you spend way too much time constraining your customer they’re going to use the tools you give them to solve their problems.

So what do you do about it? Two big things. First, while you can’t force your customer to stay on the happy path, you can make it very hard for them to hurt themselves or others. If your schema has a string field in it for a title then limit it to a reasonable length. If not you’ll find that someone is storing a 10Mb JSON blob in every record and taking out your database. Or maybe they’re using it as temp storage for message passing for distributed processing and suddenly you’ve got 20K concurrent writes/reads.

Second, find out why they’re doing it. They probably have a good reason. And you probably have a solution to their problem, but they don’t know about it. So introduce them to it. And if you don’t, should you? If one customer needs it there’s a good chance there are others who would benefit from that capability as well. So maybe you should build it for them. They’ll be happier, your system will be happier, and the other users of the system will be happier.

by Leon Rosenshein

Kinds Of Tests

There are lots of classes of tests. It all depends on what your goals are. Using the correct test for the situation ensures you’re actually testing what you think you are.

Unit tests are good for testing things in isolation. They let you control the inputs, both direct and dependent, and make sure the result is what you expect. The tighter the control you have the more reproducible the test and the fewer false positives you get. And it’s important to test not just that good inputs produce correct answers. You also need to ensure that all the possible combinations of incorrect inputs produce the correct response, whatever that means in your case. Having good, hermetic, unit tests is crucial to minimizing regressions.

Integration tests, on the other hand, have a whole different kind of isolation. You want integration tests to have as few controlled inputs as possible, but you want to be sure that the results of your integration tests are kept out of your production systems. Because what if there’s something wrong that you haven’t found yet. You run them carefully, and you know exactly what to expect as a result so you can look for small changes. Having good integration tests is crucial to finding problems with interactions and emergent behaviors. And if (when) you find some, it’s a good idea to figure out how to add them to the unit tests as controlled dependent inputs.

Scale tests are kind of like integration tests, but you’re looking for different things. Instead of having a small number of hand-crafted input sets and caring about the specific results, scale tests use lots of inputs and see what the impact of demand/concurrency is. The actual results aren’t checked. Instead the error rates and internal metrics, such as response time, queue sizes, memory used, are tracked and anomalies are flagged. Scale tests include not just the number of requests, but the number of requests per time and time at scale to see how a system responds to spikes and long periods of high demand. Good scale tests need lots of input, but give you confidence that your production systems will keep running if they get deployed.

Then there are tests you run in production. Some call those experiments or A/B tests, but they’re tests just the same. You’re just testing how something not under your direct control responds to changes. Things can get really dicey here. First, you need a good way to segment the population so only a subset get the new experience. You need to be able to define the group tightly and repeatably, If subjects go in and out of the group it’s probably not valid. You need to ensure that not too many subjects are in the test. You need to make sure that the test doesn’t have an unwanted impact on the control group. You need them though because good experiments let you test things safely in the real world with real users.

And of course you need frameworks to handle all of these different kinds of tests. Sure, everyone could write their own, but that’s a huge duplication of effort. And worse than that, it increases cognitive load, because now you have to worry not only about the tests you’re doing, but how the tests are done as well. And the last thing I want people running tests to worry about is if the test harness is really running the tests that have been written and returning the correct results.