Split STDIN to multiple Compressed Files on highly skewed data

Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not have the luxury of having enough space available to both decompress and process the data to a smaller size. While I could have processed the files as they were (gzipped), the performance would have been sub-standard for these reasons:

    1) GZIP on the fly decompression in Python is noticeably slower than reading uncompressed data.
      2) As I was going to run a process on over 1 Billion records, I was going to kick it off in concurrent threads using xargs

simple parallelized processing.

    Unfortunately there was a significant skew to the data where a handful of hundreds of files made up 90%+ of the data. So one thread would process a file with 800,000 lines then process another one on its own with 400,000,000 lines while other threads were unused because all the small files were quickly processed.

So what I wanted was a way to concatenate all the compressed files files, then split those into roughly equivalently sized *compressed* files. Initially I tried

time for i in $(ls /tmp/files/skewed_file*.gz); do zcat $i | split -l 1000000 -d -a 5 - /tmp/files/$(basename $i ".gz")_ && ls /tmp/files/$(basename $i ".gz")_* | grep -v ".out" | xargs -P10 -n1 python etl_file.py; done

What that stream of commands would do is iterate over each skewed_file (hundreds, ranging in size from 8KB to 1GB+) in the /tmp/files/ directory, then zcat (gzip concat), split the STDIN by using the hyphen (“-“) instead of a file name to represent STDIN, output it to the same named file without .gz extension and an underscore for numbering (e.g. skewed_file_a.gz becomes skewed_file_a_00000 and skewed_file_a_00001), then runs a python script to handle ETL using simple parallelized processing.

Now with that long of a command, it makes you wonder whether there are any faster/simpler ways to do a similar thing. There are!

zcat *.gz | parallel -l2 --pipe --block 50m gzip ">"all_{#}.gz

With this one [relatively simple] line, my large group of similarly formatted, but skewed files are split into blocks and compressed using gzip and redirected to properly named files. Booya!

Relevant Stack Overflow question/answer: http://stackoverflow.com/questions/22628610/split-stdin-to-multiple-files-and-compress-them-if-possible


It is also possible to use parallel to split by number of lines using the -N flag:

zcat *.gz | parallel --pipe -N1000000 gzip ">"all_{#}

Simple parallelized processing with GIL languages (Python, Ruby, etc)

Sometimes we want parallel processing, but don’t want to pay the cost of ensuring proper multi-threaded handling. Because who wants to spend an extra 30 minutes setting up threads, ensuring race condition safety, et al just to save a few minutes?

If you have access to “xargs,” you have access to a built-in utility that can simplify parallel processing. Say you have a few hundred files you want to convert from json to csv, and it’s enough data to warrant parallel processes, but not enough to warrant the extra time spent building a truly parallel process. Do a simple “ls”, pipe it to “xargs -P8 -n1 python my_conversion.py” and now you’ll have 8 separate processes working on 1 file each independently.


ls file.json_* | xargs -P8 -n1 python my_conversion.py

xargs explained:
-P8 means we want 8 separate processes (so no Global Interpreter Lock issues)
-n1 means we want xargs to take the arguments and pass only one (file name) at a time as an argument to the python code