January 2015 – Garren's [Big] Data Blog

Metadata for Functions | Python Decorator

Posted by Garren on 2015/01/23

Often I find myself wanting more information about the Python functions I’m running, whether it’s because I want to debug, log or even time their completion. All of these are relatively well-defined problems (debugging excepted). Unfortunately no tool makes it easy enough from my research to truly see the input, output, time elapsed, errors, warnings,… Continue reading→

Default Leave a Comment

The Power of Hadoop in under 10 lines!

Posted by Garren on 2015/01/16

Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce. code_to_run=$1 in_file=$2 out_file=$3 split -d -a 5 -l 100000 $in_file $in_file”_” && \ ls $in_file”_”* | xargs -P8 -n1 -I file $code_to_run file file.out && \ cat $in_file”_”*.out > $out_file… Continue reading→

Default Leave a Comment

ELI5 – The Human analogy to Multi-threading (Concurrency) and how it can go wrong

Posted by Garren on 2015/01/10

Ever feel like your brain is processing multiple things at once? An example might be when you’re reading a book and listening to music. While you may not realize it, your brain is processing both what you’re reading and what you’re hearing simultaneously. A computer operates similarly; it is constantly working in the background doing… Continue reading→

Default Leave a Comment

Apache Pig chokes on many small files [part 2]

Posted by Garren on 2015/01/09

As a followup to my initial post regarding Apache Pig’s inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers). Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an… Continue reading→

Default Leave a Comment