Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code… Continue reading→
Recently I wanted to quickly convert some CSV files to AVRO due to recent logging changes that meant we were receiving AVRO logs instead of CSV. I wanted to have some AVRO files to test my logic on and to get more familiar with AVRO. After looking around for a while and trying a few… Continue reading→
Often I find myself wanting more information about the Python functions I’m running, whether it’s because I want to debug, log or even time their completion. All of these are relatively well-defined problems (debugging excepted). Unfortunately no tool makes it easy enough from my research to truly see the input, output, time elapsed, errors, warnings,… Continue reading→
Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce. code_to_run=$1 in_file=$2 out_file=$3 split -d -a 5 -l 100000 $in_file $in_file”_” && \ ls $in_file”_”* | xargs -P8 -n1 -I file $code_to_run file file.out && \ cat $in_file”_”*.out > $out_file… Continue reading→