Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce.
code_to_run=$1 in_file=$2 out_file=$3 split -d -a 5 -l 100000 $in_file $in_file"_" && \ ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out && \ cat $in_file"_"*.out > $out_file && \ rm $in_file"_"*
What will this do?
Takes 3 args
- code_to_run is just a path to an executable
- in_file is a path to a single in_file
- out_file is a path to a single out_file
split -d -a 5 -l 100000 $in_file $in_file"_"
Split the in_file into 100,000 line chunks with an underscore and numbers following (e.g. in_file = “file.tsv”, temp files file.tsv_00000, file.tsv_00001, etc)
ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out
Get a list of all temp numbered in files, pass into xargs to run 8 processes of your code_to_run executable passing in the chunked in_file and outputting a chunked out_file.
cat $in_file"_"*.out > $out_file
Then cat chunked out files into single out file as you expect
rm $in_file"_"*
Cleanup (re: remove) all temporary files; both in and out temporary files will be removed.
For the sake of data safety, we include “&&” following each line to ensure all subsequent commands are not run unless the prior conditions are met.