Recent Posts (page 18 / 67)

by Leon Rosenshein

Roller Coasters

Roller Coasters can be fun. Big drops. High speeds. Adrenaline rushes. And most of our offices have them relatively close. There’s the Steel Curtain at Kennywood, the Demon at Great America, and the Mind Eraser at Elitch Gardens. There’s even Roller Coaster Trail outside Bozeman.

From the outside it’s big and scary looking. You have some glimpses, but no understanding. You’re in line waiting to in and give it a try. There are folks in line with you who are very familiar with it. They’re telling you how wonderful it is. They describe the details and how it makes all the other coasters look tame.

There are also folks in line for the first time. They’re laughing nervously. Not quite sure what they’ve gotten themselves into. You can see the tension. The glances at the big drop or the loop. When a car rattles overhead they startle or duck. But they keep slowly moving forward. Nervously settling into the car when it’s their turn.

You start up the first big hill. The experienced folks are cheering. The novices are enjoying the view, but their heart rate and blood pressure are going up. They nervously check pockets to make sure things are secure.

You reach the top of the hill. Speed picks up. Wind in your hair. You float a little (or maybe a lot) in the seat. Let out a scream or two. The track levels out, the G-force builds. A couple of dips and sharp turns and you realize, you’ve got this. It’s fun. Probably with doing again. You start back up the other big hill. You’re confident. You’ve done this before.

Then the track drops out from under you. You’re in free fall. Then you’re upside down in a loop, a corkscrew, or both. How did we get here? What just happened? About the time you’ve internalized it you’re back at the station. An experienced rider, quite likely ready to do it again.

But there’s another kind of roller coaster we deal with a lot more often. That’s the roller coaster of understanding. It goes something like this.

Rollercoaster of understanding

New technologies are like that. I understand most of Kubernetes. I’ve been using Golang for a few years now. I’m pretty sure I get go functions and channels, but I’ve felt that way before, so there might be another hill coming.

Learning patterns is the same way. When you first learn them, you see the application for them everywhere, and you overuse them. But with time and experience, you learn not just when and how to use them, when and how to not use them.

And it's not just learning that is like a roller coaster. There are so many parts of the development cycle that feel like that. But those are stories for another time.

Regardless, the key is to push through. With humility as you realize there’s always more to learn.

by Leon Rosenshein

Getting Feedback

It’s feedback season again. There’s lots of advice on how and when to give feedback. The internet is full of it. What you don’t see so much of, and what’s about to become very relevant, is how to receive feedback. Like any other communication, regardless of how the message is sent, if it's not received there’s no information transfer.

So what can you do as the receiver of feedback? What is your role in the process? There are some simple things you can do to make getting feedback as effective as possible. And it doesn’t matter if you’re getting the feedback from a manager, peer, subordinate, or someone you only had incidental contact with.

First and most important, listen. Hear what the feedback is. Make sure you understand what is being said. Take notes. Ask clarifying questions if needed. It doesn’t matter if you think the person giving the feedback misunderstood or misheard or is wrong. Your job as the receiver is to get the information being delivered. That means you don’t control the conversation. Not the timing, and not the direction. Don’t put words in their mouths. You can ask clarifying questions, but not leading questions. 

Second, don’t defend yourself or your action(s). This isn’t the time for that. Just take it in. Write it down. It goes back to listening first. Again, clarifying questions can be ok, but asking the giver to look at it from a different angle or asking if some new information changes their opinion deflects from what they’re trying to say.

Third, look for trends. Did you hear the feedback once, in a specific situation, from one person multiple times, or from many people over many instances? Is it something you only hear at work, or is it something you’ve heard in multiple contexts?

Fourth, don’t make instant promises. Unless the feedback is simple, like “I prefer to be called Tim rather than Timothy.” You’ll need to think about what was said, when and how often the situation arises, what you can do, and what you should do. The time to do that is not when you're supposed to be listening to the other person.

And finally, say thanks. The giver put themselves out there. Especially if it’s a subordinate or junior. They did their part and deserve thanks.

Of course, after you get the feedback comes the hard part. Interpreting it, understanding it, and making changes.

by Leon Rosenshein

Efficiency

What is efficiency? According to Merriam Webster, efficiency is:

Efficiency (noun):

the ability to do something or produce something without wasting materials, time, or energy : the quality or degree of being efficient

Ex: Because of her efficiency, we got all the work done in a few hours.
Ex: The factory was operating at peak efficiency.

But what does that really mean? And how does it change depending on what you’re doing/producing? Do you measure efficiency the same way when you’re building a bridge, running a sprint, running a marathon, creating an original painting, or building software? Of course not. I ran across this distinction by Jessica Kerr.

“Efficiency” is about fewer steps, uniformity, control— only when building the same thing over and over. 

“Efficiency” in building something new is about smaller steps, exploration, quick learning.

There’s something very subtle going on there. When you know exactly what to do and when to do it, being efficient means ensuring that everything happens exactly as planned, and exactly when planned. Anything beyond that minimum is wasted effort. Something to be eliminated when you’re trying to be more efficient.

On the other hand, when you don’t know exactly what to do next, when to do it, or what it will look like when you’re done, you approach it the other way. Take small steps. See if they work. Adjust and make sure you’re going in the right direction. Instead of a detailed roadmap with explicit steps (Taylorism), you have a sea chart. You have a goal, and you have the environment. The goal gives you your strategic direction. The environment helps you make tactical decisions.

We often use the first version because it gives us a sense of control. It’s rework avoidance theory. Define the path, execute the plan, arrive at the goal. No rework required.

Unless there is. If you just follow the plan you reach the last step and look up to see where you are. You might be close, you might be where you started, or you might be even further away. You won’t know until you get there. So you don’t know how much rework there’s going to be. You do know one thing though. The less you know about the environment between you and the goal, the further off you’re going to be when you finish the plan.

So if you want to execute efficiently through the unknown, take many more much smaller steps.

by Leon Rosenshein

SPOFs

A SPOF is a single point of failure. It’s that one little thing that everything else depends on and doesn’t have redundancy. Like building an entire electric grid to make sure you have power available, adding an external generator and battery backup in case the grid fails, and building your house with multiple circuits to each room, then having a single line going from the battery/generator/grid multiplexer to the house. If that single line fails you’re out of power. At least in that case if the line fails you can probably run an extension cord from the working power supply directly to the house.

Now consider the James Webb Space Telescope (JWST). As I write this the JWST is about 500,000 miles from the Earth and is undergoing a complex, complicated process to get things ready for its real work. One of the biggest things it needs to get done is unfold its sunshield so the cold side stays cold. The sunshield, when unfolded, is the size of a tennis court and consists of 5 layers of incredibly thin (< 0.05mm) kapton film sheets. What makes it complicated is that the sunshield is packed for travel, it, along with all the rest of the satellite, needs to fit into a 5x15 meter fairing for launch. What makes it complex, is that during the unfolding process everything needs to move together at the right speed with the right force to avoid wrinkles and tears. On top of that, it needs to do it in a vacuum and microgravity. So you end up with 344 SPOFs.

Since it’s 500,000 miles away, in conditions that can’t be replicated reliably for any meaningful duration, testing in real conditions is hard. You can do lots of unit tests. You can do some underwater tests to approximate microgravity. But integration tests, not so much. So you plan. And you plan some more. You consider failure modes and build in contingencies. Then you build contingencies for your contingencies. And after all that, and $10,000,000,000 you launch with 344 SPOFs. In that case you have to have way more than 2 9’s confidence in your SPOFs. You need 4 or 5. That’s impressive.

So next time someone says it’s too hard to get 4 9’s on a highly available system remind them of the JWST, which got that kind of reliability with 344 SPOFs, so doing it Earthside, where you can touch things, change them, and have redundancy should be (relatively) easy.

by Leon Rosenshein

Back It Up There

Back in the house now and mostly back to normal. Heat and electricity are on, but we still need to boil our water. Turns out most of our processes and preparations worked, but not all of them. The biggest failure appears to be the emergency alert system. The city/county uses Everbridge to share alerts. That’s fine and dandy, and they’re highly available and distributed. The problem is, they’re OPT IN, so if you don’t register for alerts you don’t get them. And of course, if you don’t know about the system you don’t register. Well, we’re registered now, so we’ve got that going for us.

Which got me thinking about systems testing and assumptions. When something absolutely, positively, has to work right, how do you verify it? Do you believe that if your car’s “Check Engine” light is off everything is fine? That’s a pretty bold statement. Especially if you don’t have a test to tell you if that light even works. Even if everything is working as planned, is there something that can go wrong that the light doesn’t tell you about?

Having a little green light that says everything is OK is a little better. At least then if the bulb burns out you know something happened and you can check. But even that leaves you at the mercy of what the system behind the light checks.

The only way to really know if your system can handle a particular fault is to try it out. Back when I was working on Falcon 4.0 the team down the hall was working on Top Gun, and they were trying out a new source control system, CVS. It seemed great. It understood merges. Multiple people could work concurrently on the same file and deal with it later. And our IT team took good care of the server. RAID array for the disks, redundant power supplies, daily backups, the whole 9 yards. And of course, the server died. It was the Pepsi Syndrome. Someone knocked over the server. Disks crashed. Motherboards broke. Network ports got ripped out.

Short story, it wasn’t coming back. But that’s ok. We’ve got backups. Weekly full backups and daily incrementals. Pick up a new server, restore the backup, and keep developing. Just a couple of days delay. Until we tried to restore the backup. Turns out we had write-only backups. The little green lights came on every week. The scripts returned successfully. The tapes were rotated off site for physical safety as planned. We just couldn’t restore the server from them.

Luckily we were able to piece together a close enough approximation of the current state from everyone’s machines and the project moved forward. We also added a new step to the weekly tape rotation process. Every Monday we’d restore from tape and apply an incremental on top of it to make sure the entire process worked. And of course, after that we never needed the backups again.

So next time you trust the absence of a red light, or the existence of a little green light, make sure you know what you’re really trusting.

by Leon Rosenshein

2021 In Review

And what a year it’s been. Just some personal reflections. I worked from home for 11 1/2 months. 1 business trip, which was the first in ~19 months (and that hasn’t happened in about 12 years). 1 vacation. Became empty nesters as our youngest moved out. Restarted our latke fry. 1 office evacuation. 1 company change. 2 role changes. 3 different managers. And 153 blog posts.

Which was the biggest? That’s easy. Becoming empty nesters. Especially after working from home for over 20 months. It’s amazing how big a house can feel with just 2 adults and one dog. A lot quieter too. But what really made me notice was when I was on that business trip. Both my wife and I have traveled, together and separately since we got married, but for the first time in 30 years my wife spent the 6 nights at home alone. That’s a big change.

After that, it’s probably the change in roles. The move from ATG to Aurora, for as much churn as there was, really wasn’t that big of a change for me at first. My direct team and the larger group around it moved as a unit. We merged in some really sharp folks, but the work didn’t change much, nor did the environment. Picked up a few more responsibilities. Dropped a few as well. But over a few months I went from tech lead on the compute team to tech lead for Tech Foundations. Big change in scope. More indirect. More influence. But still exciting and I’m enjoying the work.

< Interlude for office evacuation >
SmokeFilledRoad
It’s a little smokey around here
< /Interlude for office evacuation>

And with 153 blog posts for the year, still sharing. Since we moved to Confluence I’ve got a little better access to history so I decided to play around with the confluence API and see what folks like.

Entry Likes
Artificial Stupidity 8
Drive 6
Tenure 6
Ignorance As A Service 5
Exploitation 5
Another 4 Questions 5
Hyrum’s Law 5
Experimental Results 5

Alternatively, there’s the ones with 30+ viewers (since August when we started keeping track)

Entry Viewers
Cloth Or Die 41
More Wat? 36
Ignorance As A Service 35
Language Matters 35
PR Comments 33
Continuous Education 32
Scrape To Taste 31
Core Services 30

 

by Leon Rosenshein

Language Matters

Hyrum Wright, of Hyrum’s Law posted a, in my view, valid, reasonable, and realistic question. Where do the responsibilities lie when a change to a core library/service/tool that is demonstrably good for many breaks tests for some? Is it with the person/team that made the fix? The person/team that wrote the test? The person/team that set things up so that nothing can be released with a failing test? And what should the next steps, immediate and long term, be anyway? Because in an automated, coupled system with gates to prevent bad things from happening this good thing either brings the system to a halt or gets delayed, which means value isn’t being delivered. The right answer, as is often the case, is “It depends”.

In a small (whatever that means) project it’s a non-question. It’s the same team, so just fix it. If it’s a periodically released, externally maintained thing, then fixing the issue is just part of the periodic upgrade tax. It’s only when you have a large project, with a team of teams, that the issue of responsibility becomes something to solve for. In theory, in the Team of Teams everyone’s responsible. In practice, when everyone’s responsible, it will often appear that no-one is responsible. Dealing with that conundrum is a topic for next year though.

The thing about language though is that how you ask the question is as important as the question itself. In the original post Wright asked whose fault the broken tests were. Based on later discussion, and the way I took it, “fault” was shorthand for “responsible for fixing”. But that’s not what question, as written, asked.

The literal question was about blame. That took the discussion in the team dynamics direction. Talking about dysfunctional teams and how even getting into that situation is possible. Blame is not productive or conducive to fixing problems and preventing them happening again in the future. It might have a short term impact, but all it does is push the problem further underground and make finding/fixing it harder next time.

Our jobs are to deliver as much value as possible as quickly as possible. Playing the blame game makes that harder. That’s why language matters.

by Leon Rosenshein

Core Services

Ran across an article that says core services are a bad idea. Given my background in infrastructure and things like BatchAPI I got a little upset at the title. But I’m a tolerant person and open to learning, so I decided to read the article.

Turns out I don’t disagree with the article, but I’d probably give it a different title. What the author is calling Core Services is what I’d call shared global data. It’s the difference between platforms that services are built upon and services that are built into other services. BatchAPI, and platforms in general, come under the heading of Domain Expertise. The author says, and I agree with them, that it does make sense to have a centralized, core team to provide that expertise as a core platform.

On the other hand, consider the “User” database. Sure, there’s a service in front of the database, but really it’s just global data masquerading as an RPC call. It’s not that the data shouldn’t be shared, it’s that direct, synchronous sharing like that adds all sorts of complexity and cross domain coupling that’s not always needed. Sometimes it’s needed. When you need immediate consistency is the biggest reason. You’ll need to design for those cases, but in most cases it’s not really needed. RPCs and user interactions are already asynchronous, and there’s nothing you can do to prevent the user from changing their mailing address for their account milliseconds after you print the address on the monthly billing statement. You need to handle that some other way anyway.

The idea is that by defining strong bounded contexts and only crossing the boundary when it's truly needed you decrease design and development time coupling. Reducing that coupling increases velocity, which means producing more value sooner.

by Leon Rosenshein

Constraints and Creativity

A puzzle is a problem we usually cannot solve because we make an incorrect assumption or self-imposed constraint that precludes a solution

“Creativity is the ability to identify self-imposed constraints, remove them, and explore the consequences of their removal.”

    -- Russell Ackoff

Speaking of systems and systems thinking, what you don’t think about is just as important as what you do think about. Sometimes more so. Our experience puts lots of constraints on our thinking and in most cases it makes things simpler. Sometimes though, what you know can hurt you.

It’s called the Einstellung effect. In the classic water jug experiment people were asked to get a set amount of water using 3 different sized jugs. The way the problems are presented a consistent pattern appears in the first set then in later problems there’s a more efficient way, but most people stick with the pattern. It’s not wrong, but it’s not the best answer.

Another place you might run into it is the XY Problem. You have a problem you want to solve and start down a solution. You know almost everything you need and are pretty sure your solution will work. You get most of the way there, but run into an issue. Now you’ve got a new problem, so you ask for help with that. Solving that problem might be the right answer, but very often I get to ask my favorite question, “What are you really trying to do?” and that leads to a better solution to the original problem.

Being aware of the einstellung effect is nice, but awareness isn’t a solution. Avoiding it takes work. Work to notice that it’s happening and work to do something different.

Sometimes it’s easy to notice. If you’re having trouble finding a solution to a problem that seems like it should be straightforward, that's a good sign. Another is when you get stuck implementing a solution on something that doesn’t seem to be connected to the real problem. The most subtle though, and the one to be most alert for, is patterning. If you’re applying the same solution to all of your problems that might be a sign you’re stuck.

Once you realize that you might be stuck, there are things you can do to break the cycle. The first is to simply step back. Take a break from that problem and do something else that lets your mind handle things in the background. Take a walk. Take a shower. Anything but pound your head against the same spot on the wall again.

A different option is to do something to break the pattern. Maybe brainstorming. Maybe think about all the weird Rube Goldberg ways you might solve the problem. How would you do it brute force? How would you do it in a completely different environment? What if you needed to solve the problem without whatever tool you’re currently thinking about using?

Finally, you can check your assumptions. What previous problems is the current one like? How is this problem different from all others? What are you assuming about the problem space? The solution space? The available options? Did someone tell you how you need to solve the problem or were they just offering a suggestion?

Bottom line. Removing self-imposed constraints won’t solve the problem, but it will open up new possibilities for you.

by Leon Rosenshein

Beware the RAT

Not rodentia, although Pizza Rat should be avoided, but Rework Avoidance Theory (RAT).

Back in the before times you shipped a product in a box, marketed the hell out of it, and hoped people bought it. What was in the box was it. There was never going to be more to it. If enough people liked it you might make a new version in a couple of years and try the same thing. If not, you moved on to some other idea, put that in a box, and tried to sell it. The cycle continued.

At the heart of this process was a thing called Milestone 0. Take a big chunk of the available time and dedicate it to planning. Not just figure out where you wanted to be or what the end result was, but detailed planning. What things to do. How long they would take. The order to do them in. Who would do them. When everything would be done. Good planners even accounted for vacation time.

The resulting plans were a marvel. Everything needed was accounted for. Nothing extra was scheduled. Nothing would need to be changed. The end result would be exactly what was expected at the beginning. It’s a wonderful idea. No wasted effort. No missteps. No rework. That’s what BDUF is all about. Being as efficient as possible

Of course, about the time the first line of code was written the plans were out of date. By the time the product shipped the difference between the plan and the actual was usually bigger than the plan itself. There was lots of wasted effort. Lots of missteps. Lots of rework. Because we didn’t know what we didn’t know. The only way to learn was by trying, and then we’d apply the lessons learned. By doing rework.

It turns out that trying to avoid rework without perfect knowledge isn’t a way to avoid rework, it’s a way to push your rework off to the future. To a time when it’s harder to do, costs more to do, and has far greater impact on things.

Instead, accept that there will be rework, and use that acceptance to guide your path. Make decisions as late as possible. Make them as changeable as possible. Don’t make a plan, make a sea chart.

One of the best ways to do that is to take many much smaller steps. But that’s a topic for another time.