Recent Posts (page 58 / 70)

January 24, 2020 by Leon Rosenshein

The Wisdom Of James Mickens

Ex-Microsoft, Harvard prof, Distributed Systems Guy, Philosopher? James has been in the business of computer science and thinking about computer science for a long time. Besides doing the work he's been writing about the work. Whether it's about the life of pointer math coding systems programmer (The Night Watch) or protecting yourself from the Mossad (This World of Ours) or the dream of the web (To Wash It All Away). he gives cynical, humorous, and honest insights into some of the things we deal with every day. Take a look. There's a lot to see and think about.

January 23, 2020 by Leon Rosenshein

Hofstadter's Law

"Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law"

Let's be honest. Estimation is hard. It doesn't matter if you're talking about digging a ditch, building a house, doing year end perf feedback, or planning a sprint, let alone a whole software project. We estimate and we get it wrong, over and over again. There are lots of good reasons for it. From rocks, roots, and pipes where you want the ditch to shifting software requirements, you don't know what you don't know until you know it. Then there are the psychological aspects.

The Planning Fallacy, which says we don't consider past examples when we make estimates and that we ignore hidden complexities, can lead us to underestimation. Even when we take that into account, there's Optimism Bias, which says the future will be better than the past. So again, we under-estimate.

So what's a poor software developer to do? First and foremost, be aware of your own biases. Second, make sure you understand what you're estimating. How much like something else is it really? How significant are the differences? What are the external influences that are going to bend the design halfway through? Third, use your history. If you (and your team) are always 25% under, don't change anything (NB: This is hard. You'll want to reduce the estimate so the adjusted number looks right. Resist), just add 25%. Fourth, break things down. If you can't break it down into smaller tasks you don't understand it. If you don't understand it the unknown unknowns will kill you.

Finally, avoid feature creep, or at least acknowledge it and add some time to the schedule for every addition/change/update to what you're doing. Sure, that new idea is important, and you should probably do it, but it will take time, so account for it.

January 22, 2020 by Leon Rosenshein

Surprise And Delight

You read about how you should always surprise and delight your customer, but really, you just want them then be delighted. And they really shouldn't be surprised to be delighted. An unexpected new feature they like is OK, but really indicates a failure to properly market the new feature and create demand. The flip side of surprise is when your customer expects something to work and it doesn't.

There are lots of ways that can happen. Some are accidents or outside of our control, like when a worker violates safety protocol then shorts 5 MW to ground blows all the transformers for your new DC. Others could have been predicted in retrospect, like when a new dataset is sufficiently different from the usual ones that it causes some processing to fail. The final kind is when you know it's going to happen, but you still surprise your customers.

That last kind happens, but it really never should. I'm not saying that you can't make breaking changes or that you can't do heavy maintenance with downtime, just that your customers shouldn't be surprised. And that's called Change Management. There are lots of ways to do it. In the physical world there are whole departments dedicated to managing the change management process, from the reason for the change through the scheduling and implementation of it. If you need to make a change to a multi-million dollar production line you better have a pretty good idea of what you're doing.

Another kind of change management is a calendar. Just list and publicize changes that might impact others. It's based on the honor system. List the things you're doing and when, and watch the list for changes that might impact you. It's decentralized and can be low impact, but it's not no impact. You need to take the time to look for what might impact you for two reasons. First, there's no central authority that will do it for you, and second, even if there was, they don't know everything you do, so they might miss something.

So how do we do this? Right now we have 2 basic mechanisms, RFCs, and blameless postmortems. The RFC process lets you know what's coming and you can think about how it interacts with whatever it is you do. In many cases you can ignore the implementation details as long as you understand the "what" and "why" well enough to know how you interact with it. If there is some interaction, then you can figure out if it's breaking and what you need to do to coordinate.

The second is less for libraries and more for services/tools. If you have a planned outage/upgrade/scream-test let people know. Create a pro-active incident. Send out a few announcements before it happens. Give people time to respond, then remind them when it's about to happen.

Now go and delight your customers without surprising them.

January 21, 2020 by Leon Rosenshein

Project Euler

One of the nice(?) things about being a developer is that there's always something new to learn. It could be a design pattern, it could be a framework or library, or it could be a whole new language. For all of those, I find the best way to really learn is by doing. The problem with that is you need something to do. It could be a problem from work, it could be part of your hobby, or it could be something that came up as part of daily life (that's where my favorite interview question came from).

Besides using it for interviews, I use it to help figure out new languages and patterns. There are iterative solutions, recursive ones, tree-based, and and additive. I've written those solutions if 7 or 8 languages now, and it forms a decent basis for understanding a language syntax and environment, but there could be more.

That's where Project Euler comes in. It's a set of 700+ (and growing) math based problems you can use for lots of things. Learn a new language. Learn some new techniques, Understand some mathematical underpinnings better. Or you could just do it for fun :)

January 17, 2020 by Leon Rosenshein

Debugging

It's always the little things. Ran into a problem yesterday with the opus tool and pushing to the WBU2 docker registry. Docker is a wonderful thing, and we've made it scale to Terabyte registries serving data to 1000's of nodes simultaneously, but it has its warts. Crossing the CORP/PROD boundary and http/https are a couple of them. On OSX you have the added bonus of docker-machine, so there are lots of moving parts. Add in Kraken for distribution and there are lots of places to look when you can pull but not push.

I'm on-call and we got notification that people were having trouble pushing images. At first it was an isolated incident, so it must be that person's machine, right? Start down the debug path. What OS? What version of the software? What network? How big is the image? Are your tunnels working? Everything seems to be ok there, and look, another user with the same problem.

Keep digging. Hey look, there are error messages in the log. Something about "seek after file closed". But wait, the file referenced in the error message isn't there. In fact the folder path doesn't exist after 4 levels. Something's really not right here. Sounds like a disk error. Must be the hardware. Nope, or at least nothing in the logs, and this is a distributed system, and all of the logs are showing similar messages, so probably not a specific disk.

Keep digging. What's common to all if the nodes? The persistent store, HDFS. Well that's working. lots of reads/writes. Check the quota. Well below the limit for both space and objects, so that's not it. FSCK says that HDFS is OK. Maybe someone changed perms on a file? Let's check the perms and owner of some files.

Oh, that's interesting. hdfs dfs -ls crashed with a GC out of memory error. That shouldn't happen. The only time I've seen that is when there are too many files in a folder. Let's try counting the files. hdfs dfs -count -h shows 1M files. Now that's an interesting number. Did you know that the default limit in HDFS is 1M files/subfolders per folder. If you try to write more it fails.

Time for a workaround. Rename the folder from _upload to _upload_old. Pushes start working. Great. We've mitigated. Now to figure out what happened. First, actually look at the files in that folder. A few magic HDFS env variables later and we've got a list of files. Turns out files that were supposed to be temporary weren't. There are files in there from 18 months ago, and the size of the folder is 1.7 PB. That also explains why we're using so much space in our registry.

At this point, as on-call, I'm done. I've mitigated the issue and gotten our customers moving, opened issues with the Kraken team to better manage their space so it doesn't happen again, and hey, I've even got a post-incident review here.

But there are a few lessons to take out of this. First of all, clean up after yourself, and make sure you do. Second, error messages w/ incorrect info send you down the wrong path. Yes, the file that was being seek'd on was closed, because it didn't exist. It didn't exist, because a write failed, but that error wasn't logged, so we didn't know. We would have saved a lot of time if the error was

"ERROR: Unable to write to hdfs://oppusprodda3e-2/infra/dockerRegistry/_uploads/<filename>"

or something like that. We would have gone right to that folder and seen the problem. Third, don't just log errors to stdout, emit metrics and alert on them. User complaints should never be your alerting mechanism.

January 16, 2020 by Leon Rosenshein

Always Learning

career learning

Things you know, things you know don't know, things you don't know you don't know, and things you know that just ain't so. 4 categories of information. You know the first, and I've talked about the 4th. This is about the 2nd and 3rd. Whether it's starting a new job, joining a new team, getting assigned a bug in code you didn't write or something else, we've all been in the situation where we know enough about the situation to know that we aren't experts in it. I'm not talking about imposter syndrome or Dunning-Kruger. I'm talking about having to do something that's at least new to you, or maybe even something no-one has done before.

Back in the early days of my career I was working for a small aerospace company and we had a turn-based air combat simulation where we would have pilots come in and make inputs for every ¼ second (stick, throttle, switch/buttons, etc) and then the computer would advance time for all players and then it would happen again. We also had a nifty SGI 70-GTX that could do 1024x768 graphics at 30 hz, but it wasn't doing anything. At some point we realized that it took 3-5 minutes for all the inputs and < 1/10th of a second for the turn to play out and that we could do a real-time simulation. My boss told me about this then-new thing called Ethernet that could make computers from different companies talk to each other and asked me to "Make it work". I was fresh out of school, so I said yes. And then the fun started. What's a network? What's double buffering? What's big-endian vs little endian? Then there were the coding questions. And version control. and planning. But the biggest problem I ran into was getting the Ethernet system working. It was a thicknet backbone with vampire taps and drop cables. And it just wouldn't work. Drivers on both sides seemed to work, but I couldn't ping. The heartbeat light on the tap wouldn't come on. I got a flat spot on my forehead from trying things over and over again but nothing worked.

Then I dropped a screwdriver behind the computer and the heartbeat line came on. Huh? Picked up the screwdriver and it went off. Put it back down and the light came on. Then I realized that it was shorting out the outer and inner conductors on the coaxial cable. And that's then the lightbulb came on in my head. I had read something about a `terminating resistor` in a doc somewhere, but there were no details and I forgot about it. But it was kind of important. To make the two conductors into a bus you need the resistor. Got the parts, made it work and then we had a real-time man in the loop simulation.

A bunch of learnings in there. Be confident. Dig in. Study. Try things and keep notes. And probably the biggest one, ask questions. Once I had it working I told my boss what I did and what the problems were. He said he wished I had come to him sooner. I was a mechanical/aerospace engineer. He was a mechanical/electrical engineer. He knew about electrical buses. He could have saved me days and gotten us to success that much sooner.

So try things. You'll learn a lot that way. But don't forget that you have more resources than Google and Stack Overflow. There are folks around you that have a lot of context, and generally speaking, they're happy to help. So try first, then ask a good question

January 15, 2020 by Leon Rosenshein

Naming Things

naming

There are 2 hard problems in computer science, cache invalidation, naming things, and off-by-one errors. I'm going to talk about the middle one.

Names are important. At the simplest, they are identifiers that are (hopefully) easy to remember. But they are more than that. They set context. They inform metaphors. They can guide you in the right direction. Good names also respect and help manage scope.

Scope of influence is a good place to start with naming. Variables defined inside the scope of a method have that context, so you can use that to your advantage when choosing a name. Method names usually exist in the context of a class/package/namespace, so use that as well.

Consider the lowly row. If you call something a row what is it? If you do it inside the context of a database iterator then there's a good chance it's a row in a database table. On the other hand, if you're processing an image the row is probably a line of pixels. When working on a spreadsheet it's straightforward as well. In the read method in an vehcle_model class in a fleet management database library you don't need to call a variable `row_of_vehicle_models_for_fleet_management`, just call it `row`. On the other hand, if you're working on a translator from the database to some over the wire JSON then you might want to have `database_model_row` and `json_model_row` instead of `row1` and `row2`.

At a higher level, names become metaphors and help guide your thoughts in the right direction. For example, in a UI you might have check boxes and radio buttons. Visually they're pretty similar. When selected they have a mark, when deselected they don't. The metaphor helps because it lets you know how they're used. Pick as many checkboxes as you want. With a set group of radio buttons you can only have one selected at a time.

At an even higher level, how do you name your team/project/toolset? I worked on a windows installer builder called Wix. When it started it was Windows Installer eXtensions, and didn't do much. Over the years a whole ecosystem of tools has grown up around it and the names all relate to "wicks". There's candle, light, dark, smoke, and a host of others. All named around the candle metaphor. It helps to group logical things together, which reduces the cognitive load.

So, regardless of your scope, think about naming. It will make a difference.

https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-6.0/aa260976(v=vs.60)

https://www.mediawiki.org/wiki/Naming_things

https://hilton.org.uk/blog/why-naming-things-is-hard

https://news.ycombinator.com/item?id=13789817

January 14, 2020 by Leon Rosenshein

Compatibility Mode

api

Have you ever wondered what Windows Compatibility Mode is for? It's because APIs are forever. As new versions of the Windows OS came out they needed to handle the old APIs, bugs and all. So when Microsoft fixed a bug in a later version of Windows and broke a lot of 3rd party software they added a compatibility mode that lets you rely on that bug. Kind of a neat hack, but not a situation you want to get yourself in to. Just ask Raymond Chen.

On the plus side, from a customer standpoint, that kind of dedication to keeping things working is great. Back in my Flight Simulator days we did the same thing. Almost every 3rd party addon that worked with FS98 worked with FSX untouched. Great customer experience. Our customers and our ecosystem loved us for it.

The FlightSim team, no sot much. About 25% of our testing was to make sure changes didn't break existing addons. Lots of version checks in the code to either add reasonable defaults or disable features for those addons. Was it worth it? The jury is still out on that one, but here's a couple of datapoints to help inform your decision. Combat Flight Simulator used the same physics and networking code as FlightSim but a different graphics engine. More detail on the ground, more ability to interact with things moving on the ground, and more booms. But it didn't sell nearly as well and was cancelled. A few years later, as FS costs mounted (and sales remained large, but flat) the Flight Simulator franchise was shut down. I can't tell you how much of the increased cost was back-compat, but it's a non-trivial amount.

On the negative side, maintaining all that backward compatibility was tech debt. Every release a portion of our effort that could have gone in to new user facing features and experiences, or unlocking new capabilities for 3rd party devs was dedicated up front to keeping the old things working. Our forward velocity went down. Features sat on the backlog from version to version. And sometimes we ran into things we just didn't understand, like that section of assembly code in our keyboard handler. We didn't touch it because every time we did something bad happened. Eventually someone took the time to really understand what it was doing, what side effects it had, and to recode it in C so we could change it instead of developing around it.

Flight Simulator is coming back this year. It's not going to have backward compatibility with FSX, but they're fully embracing the 3rd party community. It's going to be interesting to see how they approach compatibility mode

January 8, 2020 by Leon Rosenshein

Blameless Post-Incident-Review

learning code review

I've been doing the software thing for 30+ years, but I come out of the Mechanical/Aerospace Engineering world. Man rated, safety of flight discussions. Long product cycles. Lots of physical prototypes, bench test models, and iron birds, up to and including destructive testing to figure out exactly where the limits are. Then flying models and test pilots to figure out where the edge of the envelope is.

And still, sometimes things go wrong. Pilots find themselves outside the envelope. Often they can recover, but not always. What happens next though is what I want to talk about. Whether it's the formal processes of the NTSB, articles in AOPA or Aviation Week magazine, or a group of pilots sitting around the FOB, hands waving around and saying "There I was …", taking one out of the luck bucket and putting it in the experience bucket, there's a discussion about what happened, how they got into that situation, how they got out, and how to avoid it as well.

Similarly, the medical field has the Morbidity and Mortality meeting. The goals here are the same. Something bad happened. What can we learn from it? What can we do so that things go better next time, or, if possible, never happen again.

In the software world it's the blameless post incident review (PIR). Sub-optimal things happen. We see build breaks, service outages, downed websites, late releases, or anything else that didn't go as well as planned. We have runbooks for dealing with them (your team does have them, don't they) while they're occurring and incident management SOPs for keeping people informed during the incident. And right there at the end of the incident management SOP is the PIR.

It's at least as important as resolving the incident. Do the PIR correctly and we can avoid not only that specific issue, but other similar issues. Whether you call it 5 Why's, Root Cause Analysis, or Continuous Learning, it's critical to understand not just what happened, but why it happened. What part of the system design/implementation allowed it to happen? Only by understanding why can you do something to keep it from happening again, not just in that specific case, but in cases just like it. Think of it like EMPOC for incident prevention.

There are a lot of things that go into a good blameless PIR, but I think the most important are:

Make it blameless. Talk about what was done and what happened, not who did what
Document what you're going to do (and when) to keep it from happening again
Have an accurate timeline. You need to know what was cause and what was effect
Know how you mitigated and then resolved the issue

January 7, 2020 by Leon Rosenshein

Counting From Zero

it depends

Whole numbers, counting numbers, natural numbers. Which of those sets has the number 0 in it? You'd think mathematics would be precise, but naming things is hard. Depending on who you talk to, the natural numbers include 0, or they don't. It depends.

What about array indexes? They're zero based, right? Not exactly. Yes, the C family, Lisp, anything JVM based, ECMAScript, Go, and now Rust are 0 based. Other well known languages such as FORTRAN, COBOL, R, and Matlab are 1 based. But why?

The standard reason on the zero based side is that it's easier for the computer/compiler. Just multiple the size of an element by the index, add it to the base, and you have the address of the element. That makes sense. And when compilers were big and slow and there were no optimizers saving time and memory was important. And look at who uses those 0 based languages. If you're using one of those languages you're probably either writing frameworks/libraries for others to use or really concerned about performance.

Contrast that with who's using the 1 based languages. For many folks using that set the computer is a tool that does simple math quickly. They're concerned with lift and drag or modulus of elasticity or probability or transactional cash flow. And when they think of the first item in a list they're counting things, so the first thing (1st) goes by the number 1. They're optimizing for cognitive load in their own heads while they work on the problems they're trying to solve.

So who's right? As with everything else engineering, the answer is, it depends. Figure out what you're optimizing for, make a choice, and stick with it. Or, if you're Edsger W. Dijkstra, you could go back to how to represent a sequence of numbers and make a decision based on that.

Older Newer