Often I find myself wanting more information about the Python functions I'm running, whether it's because I want to debug, log or even time their completion. All of these are relatively well-defined problems (debugging excepted). Unfortunately no tool makes it easy enough from my research to truly see the input, output, time elapsed, errors, warnings,…

The Power of Hadoop in under 10 lines!

Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce. code_to_run=$1 in_file=$2 out_file=$3 split -d -a 5 -l 100000 $in_file $in_file"_" && \ ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out && \ cat $in_file"_"*.out > $out_file…

ELI5 – The Human analogy to Multi-threading (Concurrency) and how it can go wrong

Ever feel like your brain is processing multiple things at once? An example might be when you're reading a book and listening to music. While you may not realize it, your brain is processing both what you're reading and what you're hearing simultaneously. A computer operates similarly; it is constantly working in the background doing…

Apache Pig chokes on many small files [part 2]

As a followup to my initial post regarding Apache Pig's inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers). Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an…