Recent Posts (page 45 / 70)

August 11, 2020 by Leon Rosenshein

What I'm NOT Doing

You know what's just as important as what you're working on but almost never gets talked about? What you're not doing and why you're not doing it. And that's a problem. There are a bunch of reasons why describing what you're not doing is just as important as describing what you are.

First on that list is that it helps you decide if a task fits or not. Let's say you're working on a function that takes in an image and calculates a histogram. Simple enough. You do your RFC, you pass around the API and everyone is happy and you begin to implement it. You're almost done, and someone says "We've got this other project that uses compressed images. Why aren't you able to handle that?" You know that wasn't part of the plan, but not everyone knows what's in your head. If your docs explicitly said "This will not handle compressed images then either requirements and schedule would have been adjusted or someone would have built the shared converter. At the very least you wouldn't have been blindsided by the request.

Along those lines, it helps to clearly define domains and separation of concerns. If you know and share what you're not doing, it's much easier to see the gaps between components and fill them if needed. Right now I'm working on a shared Authentication model for users and services that spans clusters. I'm not working on Authorization. It's a very closely related item, and the same group of people will probably be at the core of that too, but it's a very different problem. And our docs are explicit about that. Yes, we know it's coming and we know how it will fit in, but we're not working on it now.

Another way non-goals can help is when you're defining the actual task list. You have a set of goals, and you have an idea of what the end-state looks like, but something about it isn't quite right. And you keep adding new requirements to try and bend the result into what you're looking for. You end up with some really specific rules. Something like "schedule the meeting before 3:00 unless it's a Friday, then do it before Noon, but don't schedule it for early on Monday, or around lunch". That might be easier said as "Don't schedule the meeting when people are catching up on old work or will need to leave before we're done, even if it runs long" That's much simple and covers the real requirement.

So next time you're defining what you're doing, regardless of the scope, make sure you include the things you aren't doing and don't want to happen as well.

August 10, 2020 by Leon Rosenshein

Planting A Tree

refactoring

In some ways refactoring code like planting a tree. Not just the growing, branching, and pruning aspects (although that's something to talk about at another time), but also, like the old proverb about the best time to plant a tree being 20 years ago, the best time to refactor your code was back during the design phase. If you had made the change then you wouldn't be in the mess you're in now, and you'd be able to add all those new features with minimal friction.

The second best time is now. Sometimes it's actually the best time. You've come this far without needing the changes, so it wasn't too big of a mistake. You know a lot more about the situation, so you can make a better decision. You can set yourself up for the future.

Back when I was still at Microsoft we we built a data validation/QA tool to help people build up truth sets, easily view maps and fix/mark issues, and compare the 2D, 3D, and StreetSide (SS) views of the same area. And one of the things we knew from the beginning was that we needed to have a way to keep the viewpoints in sync. But there were issues with that. We could think of a bunch of different ways to keep 2D and 3D in sync, with different leaders/followers and considering if you were looking from a point or looking at a point and all of the different combinations of them. Some of those combinations were familiar to our staff, but some weren't, and. And none of them had any clue about how they would add a SS image cube to the mix, let alone be able to draw in one view and see it in the others, in real time.

So our solution was to handle the 2D/3D case and not worry about the SS cubes. We made the 2D/3D part clean, and applied pretzel logic to make the SS view work. That let us get the tool out to our production team so we could add value and find out exactly what they thought, where the pains were, and what more we needed to do.

Then, with that information in hand, we did the refactor and made SS really be just another view. And since we did it when we knew enough to make it simple to use but easy to extend, when someone came up with the idea to change our SS cubes into a voxel based world, that view just slipped right in and worked without a hitch.

If you want more information on exactly how to refactor, check out refactoring.com.

https://www.mindfueldaily.com/livewell/inspiring-chinese-proverb/

https://t.co/bF409g4pfl?amp=1

https://uberatg.slack.com/archives/CLVTB4W20/p1576251000000100

August 6, 2020 by Leon Rosenshein

Situational Awareness

learning

We all have biases. Some we know about, and some we don't. One of the benefits of being data driven is that, if you pick your metrics correctly, you can avoid those unconscious biases.

Going back to my aviation roots, that's why one of the leading causes of aircraft accidents is controlled flight into terrain. Basically it's the pilot not knowing there's a problem until it's too late to do anything about it. In aviation it's most likely to happen when flying on instruments. In those cases your eyes and ears will lie to you, telling you you're flying along straight and level, when actually you're spiraling into the ground. That's what we think happened to JFK Jr. off the coast of Martha's Vineyard. It's a mindset problem. It's a "what you know that just ain't so" problem. Kennedy was a relatively new pilot, and as experience shows, and the Dunning-Kruger effect explains, that's one of the most dangerous times for a pilot. He "knew" he was doing fine, so he ignored what his instruments were telling him.

But what does this have to do with software development? The same thing happens to us. We "know" things are true, so we don't take the time to make sure they really are. We make network requests, and we check the result to make sure things worked. But what about timing? Yes, the response is almost always very fast, but depending on the environment, it might be 30 seconds or more before the request times out and you get a failure message. Sure, you'll do the right thing, but it will be too late.

We ran into this with our healthchecks. We were doing user level testing of a distributed system, starting a job and making sure we could stop it. Everything was fine, usually. But every once in a while that process would take 10+ minutes. By itself that might or might not be OK, but because it was a blocking call in a set of serial tests it meant everything got delayed. Which threw off metrics, which caused other problems.

So we took it as a learning experience. We ended up fixing it in two ways. First, we parallelized the tests, which cut down the total time in the happy path. Second, we added a timeout on the test, so we'd know it took too long, mark it as a failure, and move on. We also did a quick review of other systems and added some other timeouts we hadn't realized we needed.

So what's your blind spot, and how can you close it?

August 3, 2020 by Leon Rosenshein

Careers

career

Last week at the Core Platform All-Hands meeting I did a short presentation on Owning Your Career, but I was having some internet issues and there were some dropouts so I figured I'd take some space here to spread the info a little wider and make sure I got all my points out. As a refresher, I started with a review of what a level is and that it's really a reflection of your scope of influence.

The other two key things to remember about level is that it's recognition of consistently demonstrating the core competencies, not an aspirational motivator, and that while the arc of the competencies tends up and to the right, it's not always a steady increase. How you are measured against those competencies is impacted by not only experience, but also role changes, job changes, and external events. As I described in the slides, over my 32 year career thus far, my effective ability to demonstrate those competencies dipped quite a few times. Each time I was able to recover, but it does have an impact.

But all of that was really a lead-in to the core of the matter, which is that you are responsible for your career. Your manager and your mentor(s) can help. but it's up to you to know what you want and drive to get there. To that point, the key takeaways are:

It's Your Career - There are lots of people who want to help, but the key is help, you own your career
Be Valuable, Not Critical - You want to be adding value to the company. Do important things and make a difference, but if you're critical in one spot you can't move and grow.
Find Your Balance - Understand what you want and what makes you happy/fulfilled. Recognize that it might change over time and that's OK. I thought I wanted to eventually be a Director or similar, but as I learned when I was a manager of managers, that's not me, so I don't do that anymore.
Personal Growth, Not Career Growth - This goes back to the original premise, that level = scope of influence and it's a recognition of what your scope is. If you spend your time chasing the level you'll miss the opportunities to learn and grow along the way. Grow your scope and the career will follow.
Understand the Big Picture - Finally, keep the bigger picture in mind. Don't get hung up on bumps in the road or detours. There's something to learn in those situations as well. The overall arc is up and to the right, but it's not constant.

And finally, the most important of the action items. Own your career. Figure out what you want your career to look like and talk to your manager about opportunities to make it happen. Understand the differences between how you see yourself, how others see you, and the engineering competencies by level.

A couple of years ago I put together the eng-stories framework, which I used to describe the arc of my career. I've found it to be a really good way to frame those discussions and identify the differences. Once you know the gaps you can work to close them. Without that knowledge it's impossible to know if you're working on the right things.

If you've got any questions about levels and scope, the arc of your career, using the eng-stories tool, or anything else related, happy to discuss in email or 1:1.

July 28, 2020 by Leon Rosenshein

DDD and Microservices

martin fowler domain driven design best practices

Last week there was a post on Uber's engineering blog about domain oriented microservice architecture. The first thing that stands out to me is that Jaeger graph of everything talking to everything. When I started at Uber 5 years ago there were more services than engineers in the graph. While microservices, reducing span of control and isolation, are good things, like anything else, you can take them too far. As the article notes, having to wade through 50 services and 12 teams to debug a single issue with a rider pickup is a bit much.

On the other hand, that level of isolation also allows for rapid growth and experimentation. You can change one thing and not impact the others. With a good service mesh you can bleed off a very limited amount of traffic for a live A/B test and see how it goes. You can let measured, real-time results drive your decisions.

But to me the most interesting thing wasn't the design itself or how we got there, but the outside analysis. Specifically the parts where folks were saying "Domain Driven Design has been around for years and they're claiming to have invented it." I know a bunch of the folks involved and no, they're not claiming to have invented it. What they did do was figure out a way to take best practices and retroactively apply them at scale, and that's no trivial task.

So the question we should be asking ourselves now is "What industry/academic best practices and methods should we be aware of?" That doesn't mean we should go chasing off after then next shiny new hotness. Just that we need to be aware of what's happening in the industry at large and use that to help guide our long term plans.

July 27, 2020 by Leon Rosenshein

Kicking It Off

working backward

Inertia is a thing. Not just for spherical chickens on a frictionless surface in a vacuum or 3 ton robots. It’s also a thing in organizations. And one of the places you’ll see this the most is when starting a new project that requires multiple groups to change direction. Some groups don’t think it applies and make no changes, some over-rotate, and the rest end up somewhere in between.

But there are ways to provide guardrails and help people stay on track. And they all pretty much come down to making sure that there is a shared vision of what the result is going to look like. So write it down and share it. That doc, your project kick-off doc, is going to frame the work and provide the context needed to allow everyone working on the project to make all of the small, daily decisions needed to make the project a success.

Executive Summary
Problem Statement
Product Vision
Minimal Viable Product (and follow-on phases)

Executive Summary

A paragraph (two at most) with what you’re doing, why you’re doing it, what done looks like, and the major milestones along the way.

Problem Statement

An outside looking in view of why this needs to be done. Define the customer and what their problem/pain point is.

Product Vision

Your proposed solution. Not the technical details, although some might be needed, but again, outside looking in. Show how things are different for the customer. How their life is better/easier with this change.

Because the people that made the plan aren’t going to be part of every decision that gets made. They certainly won’t be in the room when the decision is made, so make sure everyone involved knows why the big decisions have been made so they can align their decisions with them.

Another way to frame things is Amazon’s Working Backwards approach. Instead of a kick-off doc, they start with the press release for the product. And when you get down to it, a good press release contains all of the same information as what I just described for a kick-off doc, An overview, the customer problem, how this new thing solves customer problems, and teasers for the next great thing.

So before you start on your next big thing, make sure you define what you’re doing, why your customers need it (even if they don’t know it), and how they’ll be happier when you’re done.

July 24, 2020 by Leon Rosenshein

C Shell, Z Shell ...

I'm in a quandary. MacOS has been on Bash 3.2.X for a long time, while Ubuntu is on 4.4.X. And instead of upgrading Bash, the folks at Apple have decided to switch the default shell to `zsh`. And to make matters worse, they've made it very hard to upgrade bash and have the default version be something else.

So what's a developer to do? And I can hear a bunch of you saying "Get an Ubuntu machine". I've thought about it. I even asked for one, especially since my current MBP is, shall we say, sick, but all they offered was a little 13" MBP. I've also got an Ubuntu desktop in the office I ssh into all the time. If nothing else it's got better upload speed than going through the VPN from home.

So here's my dilemma. Sort of upgrade Bash on my current MBP and figure out how to beat it into submission, which has the benefit of matching what our official build systems and clusters run, but struggle against our Apple overlords every time they update the OS, or switch to zsh, which, admittedly is a nice experience, and will just work across OS updates, but means my local scripts either still need to use Bash 3.2.X or I have to either duplicate them for two shells, restrict them to the subset of commands that work with both, or struggle against Apple and upgrade the version of bash.

I'm thinking the best thing to do is just figure out (and document) all of the machinations to get my MBP running Bash 4.4.X as it's primary shell and just keep it that way. What do you all think, and if you've done it, got any documentation?

July 23, 2020 by Leon Rosenshein

Ownership

ownership

In software development, as in many professions, way up near the top of the importance list is getting things done. That's why defining the problem you're solving is so important. How do you know if you're done if you don't know what problem you're solving.

And knowing who's responsible for getting things done is just as important as knowing what done is. This is more than having a task list with names attached to each item, although that's critical too. Because solving a problem is more than completing a list of tasks. Especially now, with agile development and shorter design/planning cycles. Many design docs specify the verbs/functions that will be supported, but not the exact parameters, and that's good. It's good because at design time you almost certainly don't know the implementation details well enough to specify what those parameters will be.

Which leads to the issue. How can multiple people collaborate on building a thing, each with enough ownership and responsibility for their own parts to be able to make their own decisions, get the bigger thing to done? Many moons ago in an MBA class the instructor told us that our job was going to be less about the architecture diagram itself, and more about the white space between all the boxes in it.

And that's where ownership and scope comes into play. Diffusion of responsibility is a real thing. One way to keep your cognitive load down is to know what you're responsible for and not worry about the rest. And task lists with names on them are very clear. It's easy to say "I'm responsible for these specific things" and not worry about the rest.

It's also problematic. Because your customers want solutions to problems, not features. Each of those tasks in the list, by itself, is a feature. They need to be integrated with each other and the existing systems to form a solution.

Back in the early VirtualEarth days we had a very manual process to turn stereo image pairs into 2.5D extruded models. We needed to get our production rate up, so we were automating everything we could. Each bit of automation made sense, but they weren't exactly integrated with each other. After a few months of everyone throwing together tools and scripts the manager of our production team came to me and told me that we could go on unleashing all the tools against them we wanted, but until we got our act together and figured out a way to solve his problems his team was just going to ignore us.

That was a bit of an eye-opener, but he was right. We had taken the list of pain points, split them up into bite sized chunks and built tooling for a bunch of them. But we lost sight of the actual problem we were solving. No one was specifically responsible for making production faster/easier, so we hadn't done that.

So I took his feedback, took a step back, and looked at his team's problems. All it took was a little bit of workflow framework, better error reporting/management, and some written SOPs (on top of all the tools/scripts we had built) and suddenly we made significant improvements to the production throughput and acceptance rate.

Whether it's the PM, the EM, the TL, or the IC, if you want to solve a problem, someone needs to own the problem itself, not just the parts.

July 22, 2020 by Leon Rosenshein

Hello. Who Are You?

There are lots of times in logging and observability that you write a nice standard function to handle the boilerplate part(s), making it easier to focus on what you want to say/observe. And that's a good thing. Being able to focus on what you want to say and not worry about the details of getting the message out reduces cognitive load and increases the likelihood that things will be said.

And one thing you often want to know with logging/observability is which function is making the call. You might be using structured logging and want to fill out the Function member, or you might want to set a tag on your metric. The easiest way to do that is add a string variable to function's parameters and pass it in. That's a good min bar.

But there are problems with that. If you have multiple calls in a function you need to spell it the same every time. Do some refactoring and it's wrong. Rename the method and it's wrong. A little bit of copy-pasta and it's wrong.

You can avoid that by good use of constants and local variables, but it's still error-prone and there's no way to catch it automatically. There is however a better way, and it's not subject to those limitations. Let the compiler do it for you.

You're probably familiar with the predefined C macros __FILE__ and __LINE__, but since C++ there's also __func__. That's a good place to start in C/C++. Different languages have different specifics, but there are also (almost) always ways to access this info as well as the current call stack, so if you really need the function name, getting it automatically is often a better approach.

One thing to keep in mind is that while the compile time methods are fast, any of the runtime methods may incur significant time penalties, so make sure your not blowing your time budget :)

https://golang.org/pkg/runtime/

https://docs.python.org/3/library/inspect.html#inspect.getframeinfo

https://www.nongnu.org/libunwind/

https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html#getStackTrace()

July 21, 2020 by Leon Rosenshein

A Scrolling We Will Go

Many of our datasets are big. Thousands or millions of rows. That's fine. We can search them, we can order them, we can get a subset. And getting a subset is good because UIs like small datasets so they can be responsive. Quick scrolling. Short data query times. And that leads to pagination, so users can scroll through the data without ever having to have the entire set available.

The simplest way to paginate is to use the equivalent of SQL's OFFSET and LIMIT. To get any particular page set OFFSET to pageSize * (pageNumber - 1) and LIMIT to pageSize and go. Simple. Easy to implement on the back end, Easy to add to a URL as a query parameter or a gRPC request. And for small datasets that are mostly read-only it works fine. But as you scale up and the write to read ratio gets higher, not so much.

The first problem is simple scan time. If you want pages 19,000 and 19,001 of a 1,000,000 element resultset (assuming page size of 50) then each time you make the query you need to build the resultset, skip the first 950,000 results, then return something. That can take a bunch of time, and most of the time is spent dealing with information you're just going to drop on the floor since it's not part of the returned rows.

The second is the cost of the query. The more complicated your query, the longer it takes. If you're joining multiple tables the time can go exponential quickly. Exponentials and a large dataset are a bad combination if you're trying to be fast.

The third problem comes from changes to the datasource. If you've got 200 entries in your table and you want to get page 4 then that would be entries 151-200. Simple. But what happens if the first 10 entries get deleted between the time you ask for page 3 and page 4? Now you get entries 151-190, which isn't a problem, but, you never returned the original elements 151-160, since now they're elements 141-150. And you will have the inverse problem if things get added before the current page. Instead of missing entries you'll get the same entries at the beginning on the next page.

So what's a poor developer to do? First and foremost, indices are your friends. If you've got a natural key in your data, or you can create one from multiple columns, make an index and use it. Then you can use that for your pagination. Instead of using the OFFSET of the last row, use the last key value. Then you're doing a lookup on the index (much faster) and you know you're starting with the last row, regardless of any changes to data.

Another option, if it fits your use case, is to build a temporary table with the results of your query, give it a simple key, like row number, and index on it. Then just return pages from the temporary table. This one is a little more complicated since you need to somehow associate the query with the temp table and manage the table's lifetime. If your query is complicated/takes a long time, results in a big table, and can stand to be a little out of date, this can turn a slow UI into one that's lightning fast.

Older Newer