by Leon Rosenshein

Options

We've got lots of them. Almost always there's not one right answer for a given set of constraints, but there are often wrong ones. One of our jobs as engineers is to balance those constraints and make the right trade-offs for now and what we know about the future. We also need to keep in mind that the constraints can/will change over time, and the "right" answer might shift.

Consider joining two text files and doing some simple analysis on the results. You can do it with everything from an awk one-liner to a giant spark job that uses 10s (hundreds?) of nodes. Which one you chose depends, among other things, on how big your data is, where your data lives, how often you want to do the analysis, and how important it is to ensure that the analysis happens unattended on a schedule. If it's just you working on a few MB of data, and you know awk then that one-liner might be the right answer. If, on the other hand you're dealing w/ TB of data in an overnight pipeline that needs to be done by 6:00 local the next morning then ensuring the data is loaded into Hive on ingestion every day and having a distributed query that runs in a managed workflow system might be a better choice.