Often I find myself wanting more information about the Python functions I’m running, whether it’s because I want to debug, log or even time their completion. All of these are relatively well-defined problems (debugging excepted). Unfortunately no tool makes it easy enough from my research to truly see the input, output, time elapsed, errors, warnings,… Continue reading→
Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce. code_to_run=$1 in_file=$2 out_file=$3 split -d -a 5 -l 100000 $in_file $in_file”_” && \ ls $in_file”_”* | xargs -P8 -n1 -I file $code_to_run file file.out && \ cat $in_file”_”*.out > $out_file… Continue reading→
Ever feel like your brain is processing multiple things at once? An example might be when you’re reading a book and listening to music. While you may not realize it, your brain is processing both what you’re reading and what you’re hearing simultaneously. A computer operates similarly; it is constantly working in the background doing… Continue reading→
As a followup to my initial post regarding Apache Pig’s inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers). Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an… Continue reading→