Recent Posts (page 53 / 65)

by Leon Rosenshein

Panzer Is NOT German For Tank

Or at least it wasn't originally. When I was working on Combat Flight Sim 2 we released 5 fully localized versions. Not just English, but FIGS (French, Italian, German, and Spanish). Since it was Microsoft, the process of localization was well defined. We did the easy things like using the correct locale for dates and numbers, and all of the UI and in-game text were resource based. We had 3rd party vendors to provide the resources and people trained in testing such things.

Of course, since word order can be different, not just that, but our strings were tokenized as well. We didn't just localize the words, we localized the format strings so we could but do word order and gender matching as needed. We'd just pass in enough strings (which were localized resources themselves) to the formatter and it would pick and choose which ones to use. And we did the same thing for our sound UI. Radio communications were done as voice instead of text, so we had to do the same kind of audio buildup. From a programming standpoint that wasn't much different from the text, but we did put a big load on our sound team. They had to make sure all of the snippets were done in a way that they could be strung together in some random order and sound right. That was a huge combination of strings, and my hat's off to them. Somehow they were able to do it.

All of that was relatively straightforward. The place we got bit was more cultural. As is typical, we did our UI wireframes in English. And there were cases where our UI got dense. Orders of battle, post action reports, pilot statistics. And our designers made it look good. There were several places where we had "Tank(s)" in the UI. And knowing that some languages use longer words than English, there was room for some extra characters. Then the localizers got involved. We got back the new resources, plugged them in, and tried it out. It worked, but the visuals, not so much. Turns out that instead of "Panzer", which we expected, we got back "Panzerkampfwagen" (armored combat vehicle). That's correct, but it didn't fit. Germans also don't like acronyms. They tend to make up new words out of groups of words, so we had to deal with that as well. In some cases we had to re-work the UI. In others we used modern words (like Panzer) instead of the more temporally appropriate terms.

So the moral of the story? Be flexible, Be adaptable. And remember, not everyone does things the way you do.

by Leon Rosenshein

Quality Is Job #1

Job1

Ford may have stopped using that tag-line over 20 years ago, but it’s still relevant. Quality is more than building something that usually works or ‘meets requirements’. It’s more than that. It’s something that not only works when things are perfect, but degrades gracefully and in a controlled manner when things aren’t. It’s not just done according to spec, it’s something the user thinks is done. It’s something that provides more value than cost. It’s something that delights. And it applies at every level. From the RFC describing what you’re building, to the entire SDVP ecosystem, quality stands out. You can rely on it. Does it cost more? Very often it does, but it provides even more value.

Quality is stable and reliable. As of last week, grep is 45 years old. Yet we still use it every day and you never wonder if it missed something. You might wonder if you got your regex correct, but you know grep did its job correctly. Yes, it’s been updated and new features have been added, but someone who used it 45 years ago would still be comfortable using it today. That’s quality. Similarly, the Douglas DC-3/C-47/Gooney Bird. First flown in 1936, with production ending in 1945, there are still over 1000 flying today. If you bought one you certainly got your money’s worth.

So how does that relate to us? Simple, and it goes back to Ford’s tag-line. Quality isn’t something you bolt on later. It’s something you build in. From the use cases to requirements to implementation to operationalizing and support. Regardless of your function, role, title, or part of the product life cycle you work on, quality is job

1.

by Leon Rosenshein

Mini-Maxing Your Java

"It is by caffeine alone I set my mind in motion.
  It is by the beans of Java that thoughts acquire speed,
  the hands acquire shakes, the shakes become a warning.

  It is by caffeine alone I set my mind in motion."

    - The Programmer's Mantra

I first ran across that line in the mid 90s, and to the best of my sleuthing, it was pretty new then. Now coffee has never been my preferred source of caffeine, but I've used caffeine to great effect. Back in the day software was shipped in boxes, and you had to reserve time in a production facility to print/box/ship things. That was expensive, and there might not be another spot in the schedule for a while, so meeting dates was pretty important. Factor in that I was working for a gaming company, and if you didn't have you boxes on the shelves in time for Black Friday you might as well wait for next year. That also meant that production houses were booked solid and the next available slot was after the deadline so any slip meant you would miss the holiday. Because of that, I was part of more than one 3 month long, caffeine-fueled, tech debt generating, crunch mode.

And you know what, they mostly worked. We shipped, and the company didn't go bankrupt. They weren't without cost though. A big part of the team would quit, we'd spend a few months patching things to fix problems, and then have to figure out how to move forward with a new version or the next great thing. but those were issues for another day.

Nothing beats getting enough sleep at ensuring you're at peak effectiveness, especially over extended periods of time. Chronic sleep deprivation will really mess you up, both short term and long term. the negative impacts range from forgetfulness and lowered alertness to high blood pressure and strokes.

And while I absolutely don't recommend it, especially for extended periods, sometimes it is justified and you do need to spend the extra time and just get things done. In that case, caffeine can be a big help. But like anything else, you want to maximize the impact. So how do you know just how much coffee to drink, and when? Well, the Army has thought about it, and while they're not quite ready to release the app, there's a beta website you can use to get an approximation. And the tech is licensable. Wonder if we should add it to the Driver/Operator console.

by Leon Rosenshein

Shift-Left Is Silly

Ok, the idea isn't silly, but…

When I started in this business it was developers, developers, developers and the ubiquitous Ballmer Monkey Dance (I was there for that). Write some code, throw it over to the testers, ship it over their objections, and then release updates. Maybe there was a PM (Program Manager) in there to mention the customer. Then came agile. We renamed our software test engineers (STE) to software development engineers in test (SDET), asked them to do some automation, called the PM a scrum master, then shipped the code. Then came devops, and the operations folks joined the team. The team charter got broader and included observability and alerting. Now they're talking about the wisdom of DecSecOps, saying we need to get security involved earlier. And again, they're not wrong, but

I'm a Mechanical Engineer by training. Mechanical engineering is _old_. It's as old as the screw, stick (lever), rope, ramp, and wheel. And over those thousands of years mechanical engineering has learned a few things. Not simply, and not without pain, and not without big mistakes (ever seen the video of the Tacoma Narrows Bridge wiggling?), but like the Farmer's guy said, "We know a thing or two because we've seen a thing or two." And one of the most important isn't engineering per-se, it's everything that happens *around* the engineering. From the building the pyramids to the Manhattan project to Apollo, it's not just the results that are impressive, the system required to build them is impressive too. Coordinating 10s of thousands of people in hundreds of locations and when you get done the result works. Learning how to do that is at least as big of an accomplishment as the thing being accomplished. We don't always get it right, but we try.

So what has that got to do with Shift-Left? Well, all Shift-Left is saying is that in software development you can break down a system into components and implement them separately, as long as you think about the entirety at the beginning. You need requirements, designers, engineers (software and hardware), program/project managers, testers, operations, security, supply chain management, and anyone else that might be involved going from an idea to a product.

And that's another way to describe Systems Engineering.

by Leon Rosenshein

The Wisdom Of James Mickens

Ex-Microsoft, Harvard prof, Distributed Systems Guy, Philosopher? James has been in the business of computer science and thinking about computer science for a long time. Besides doing the work he's been writing about the work. Whether it's about the life of pointer math coding systems programmer (The Night Watch) or protecting yourself from the Mossad (This World of Ours) or the dream of the web (To Wash It All Away). he gives cynical, humorous, and honest insights into some of the things we deal with every day. Take a look. There's a lot to see and think about.

by Leon Rosenshein

Hofstadter's Law

"Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law"

Let's be honest. Estimation is hard. It doesn't matter if you're talking about digging a ditch, building a house, doing year end perf feedback, or planning a sprint, let alone a whole software project. We estimate and we get it wrong, over and over again. There are lots of good reasons for it. From rocks, roots, and pipes where you want the ditch to shifting software requirements, you don't know what you don't know until you know it. Then there are the psychological aspects.

The Planning Fallacy, which says we don't consider past examples when we make estimates and that we ignore hidden complexities, can lead us to underestimation. Even when we take that into account, there's Optimism Bias, which says the future will be better than the past. So again, we under-estimate.

So what's a poor software developer to do? First and foremost, be aware of your own biases. Second, make sure you understand what you're estimating. How much like something else is it really? How significant are the differences? What are the external influences that are going to bend the design halfway through? Third, use your history. If you (and your team) are always 25% under, don't change anything (NB: This is hard. You'll want to reduce the estimate so the adjusted number looks right. Resist), just add 25%. Fourth, break things down. If you can't break it down into smaller tasks you don't understand it. If you don't understand it the unknown unknowns will kill you.

Finally, avoid feature creep, or at least acknowledge it and add some time to the schedule for every addition/change/update to what you're doing. Sure, that new idea is important, and you should probably do it, but it will take time, so account for it.

by Leon Rosenshein

Surprise And Delight

You read about how you should always surprise and delight your customer, but really, you just want them then be delighted. And they really shouldn't be surprised to be delighted. An unexpected new feature they like is OK, but really indicates a failure to properly market the new feature and create demand. The flip side of surprise is when your customer expects something to work and it doesn't.

There are lots of ways that can happen. Some are accidents or outside of our control, like when a worker violates safety protocol then shorts 5 MW to ground blows all the transformers for your new DC. Others could have been predicted in retrospect, like when a new dataset is sufficiently different from the usual ones that it causes some processing to fail. The final kind is when you know it's going to happen, but you still surprise your customers.

That last kind happens, but it really never should. I'm not saying that you can't make breaking changes or that you can't do heavy maintenance with downtime, just that your customers shouldn't be surprised. And that's called Change Management. There are lots of ways to do it. In the physical world there are whole departments dedicated to managing the change management process, from the reason for the change through the scheduling and implementation of it. If you need to make a change to a multi-million dollar production line you better have a pretty good idea of what you're doing.

Another kind of change management is a calendar. Just list and publicize changes that might impact others. It's based on the honor system. List the things you're doing and when, and watch the list for changes that might impact you. It's decentralized and can be low impact, but it's not no impact. You need to take the time to look for what might impact you for two reasons. First, there's no central authority that will do it for you, and second, even if there was, they don't know everything you do, so they might miss something.

So how do we do this? Right now we have 2 basic mechanisms, RFCs, and blameless postmortems. The RFC process lets you know what's coming and you can think about how it interacts with whatever it is you do. In many cases you can ignore the implementation details as long as you understand the "what" and "why" well enough to know how you interact with it. If there is some interaction, then you can figure out if it's breaking and what you need to do to coordinate.

The second is less for libraries and more for services/tools. If you have a planned outage/upgrade/scream-test let people know. Create a pro-active incident. Send out a few announcements before it happens. Give people time to respond, then remind them when it's about to happen.

Now go and delight your customers without surprising them.

by Leon Rosenshein

Project Euler

One of the nice(?) things about being a developer is that there's always something new to learn. It could be a design pattern, it could be a framework or library, or it could be a whole new language. For all of those, I find the best way to really learn is by doing. The problem with that is you need something to do. It could be a problem from work, it could be part of your hobby, or it could be something that came up as part of daily life (that's where my favorite interview question came from).

Besides using it for interviews, I use it to help figure out new languages and patterns. There are iterative solutions, recursive ones, tree-based, and and additive. I've written those solutions if 7 or 8 languages now, and it forms a decent basis for understanding a language syntax and environment, but there could be more.

That's where Project Euler comes in. It's a set of 700+ (and growing) math based problems you can use for lots of things. Learn a new language. Learn some new techniques, Understand some mathematical underpinnings better. Or you could just do it for fun :)

by Leon Rosenshein

Debugging

It's always the little things. Ran into a problem yesterday with the opus tool and pushing to the WBU2 docker registry. Docker is a wonderful thing, and we've made it scale to Terabyte registries serving data to 1000's of nodes simultaneously, but it has its warts. Crossing the CORP/PROD boundary and http/https are a couple of them. On OSX you have the added bonus of docker-machine, so there are lots of moving parts. Add in Kraken for distribution and there are lots of places to look when you can pull but not push.

I'm on-call and we got notification that people were having trouble pushing images. At first it was an isolated incident, so it must be that person's machine, right? Start down the debug path. What OS? What version of the software? What network? How big is the image? Are your tunnels working? Everything seems to be ok there, and look, another user with the same problem.

Keep digging. Hey look, there are error messages in the log. Something about "seek after file closed". But wait, the file referenced in the error message isn't there. In fact the folder path doesn't exist after 4 levels. Something's really not right here. Sounds like a disk error. Must be the hardware. Nope, or at least nothing in the logs, and this is a distributed system, and all of the logs are showing similar messages, so probably not a specific disk.

Keep digging. What's common to all if the nodes? The persistent store, HDFS. Well that's working. lots of reads/writes. Check the quota. Well below the limit for both space and objects, so that's not it. FSCK says that HDFS is OK. Maybe someone changed perms on a file? Let's check the perms and owner of some files.

Oh, that's interesting. hdfs dfs -ls crashed with a GC out of memory error. That shouldn't happen. The only time I've seen that is when there are too many files in a folder. Let's try counting the files. hdfs dfs -count -h shows 1M files. Now that's an interesting number. Did you know that the default limit in HDFS is 1M files/subfolders per folder. If you try to write more it fails.

Time for a workaround. Rename the folder from _upload to _upload_old. Pushes start working. Great. We've mitigated. Now to figure out what happened. First, actually look at the files in that folder. A few magic HDFS env variables later and we've got a list of files. Turns out files that were supposed to be temporary weren't. There are files in there from 18 months ago, and the size of the folder is 1.7 PB. That also explains why we're using so much space in our registry.

At this point, as on-call, I'm done. I've mitigated the issue and gotten our customers moving, opened issues with the Kraken team to better manage their space so it doesn't happen again, and hey, I've even got a post-incident review here.

But there are a few lessons to take out of this. First of all, clean up after yourself, and make sure you do. Second, error messages w/ incorrect info send you down the wrong path. Yes, the file that was being seek'd on was closed, because it didn't exist. It didn't exist, because a write failed, but that error wasn't logged, so we didn't know. We would have saved a lot of time if the error was

"ERROR: Unable to write to hdfs://oppusprodda3e-2/infra/dockerRegistry/_uploads/<filename>"

or something like that. We would have gone right to that folder and seen the problem. Third, don't just log errors to stdout, emit metrics and alert on them. User complaints should never be your alerting mechanism.

by Leon Rosenshein

Always Learning

Things you know, things you know don't know, things you don't know you don't know, and things you know that just ain't so. 4 categories of information. You know the first, and I've talked about the 4th. This is about the 2nd and 3rd. Whether it's starting a new job, joining a new team, getting assigned a bug in code you didn't write or something else, we've all been in the situation where we know enough about the situation to know that we aren't experts in it. I'm not talking about imposter syndrome or Dunning-Kruger. I'm talking about having to do something that's at least new to you, or maybe even something no-one has done before.

Back in the early days of my career I was working for a small aerospace company and we had a turn-based air combat simulation where we would have pilots come in and make inputs for every ¼ second (stick, throttle, switch/buttons, etc) and then the computer would advance time for all players and then it would happen again. We also had a nifty SGI 70-GTX that could do 1024x768 graphics at 30 hz, but it wasn't doing anything. At some point we realized that it took 3-5 minutes for all the inputs and < 1/10th of a second for the turn to play out and that we could do a real-time simulation. My boss told me about this then-new thing called Ethernet that could make computers from different companies talk to each other and asked me to "Make it work". I was fresh out of school, so I said yes. And then the fun started. What's a network? What's double buffering? What's big-endian vs little endian? Then there were the coding questions. And version control. and planning. But the biggest problem I ran into was getting the Ethernet system working. It was a thicknet backbone with vampire taps and drop cables. And it just wouldn't work. Drivers on both sides seemed to work, but I couldn't ping. The heartbeat light on the tap wouldn't come on. I got a flat spot on my forehead from trying things over and over again but nothing worked.

Then I dropped a screwdriver behind the computer and the heartbeat line came on. Huh? Picked up the screwdriver and it went off. Put it back down and the light came on. Then I realized that it was shorting out the outer and inner conductors on the coaxial cable. And that's then the lightbulb came on in my head. I had read something about a `terminating resistor` in a doc somewhere, but there were no details and I forgot about it. But it was kind of important. To make the two conductors into a bus you need the resistor. Got the parts, made it work and then we had a real-time man in the loop simulation.

A bunch of learnings in there. Be confident. Dig in. Study. Try things and keep notes. And probably the biggest one, ask questions. Once I had it working I told my boss what I did and what the problems were. He said he wished I had come to him sooner. I was a mechanical/aerospace engineer. He was a mechanical/electrical engineer. He knew about electrical buses. He could have saved me days and gotten us to success that much sooner.

So try things. You'll learn a lot that way. But don't forget that you have more resources than Google and Stack Overflow. There are folks around you that have a lot of context, and generally speaking, they're happy to help. So try first, then ask a good question