Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce.

code_to_run=$1
in_file=$2
out_file=$3
split -d -a 5 -l 100000 $in_file $in_file"_" && \
ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out && \
cat $in_file"_"*.out > $out_file && \
rm $in_file"_"*

What will this do?
Takes 3 args

  • code_to_run is just a path to an executable
  • in_file is a path to a single in_file
  • out_file is a path to a single out_file
split -d -a 5 -l 100000 $in_file $in_file"_"

Split the in_file into 100,000 line chunks with an underscore and numbers following (e.g. in_file = “file.tsv”, temp files file.tsv_00000, file.tsv_00001, etc)

ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out

Get a list of all temp numbered in files, pass into xargs to run 8 processes of your code_to_run executable passing in the chunked in_file and outputting a chunked out_file.

cat $in_file"_"*.out > $out_file

Then cat chunked out files into single out file as you expect

rm $in_file"_"*

Cleanup (re: remove) all temporary files; both in and out temporary files will be removed.

For the sake of data safety, we include “&&” following each line to ensure all subsequent commands are not run unless the prior conditions are met.


Metadata for Functions | Python Decorator ELI5 - The Human analogy to Multi-threading (Concurrency) and how it can go wrong

Leave a Reply

Your email address will not be published.