Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not have the luxury of having enough space available to both decompress and process the data to a smaller size. While I could have processed the files as they were (gzipped), the performance would have been sub-standard for these reasons:

    1) GZIP on the fly decompression in Python is noticeably slower than reading uncompressed data.
      2) As I was going to run a process on over 1 Billion records, I was going to kick it off in concurrent threads using xargs

simple parallelized processing.

    Unfortunately there was a significant skew to the data where a handful of hundreds of files made up 90%+ of the data. So one thread would process a file with 800,000 lines then process another one on its own with 400,000,000 lines while other threads were unused because all the small files were quickly processed.

So what I wanted was a way to concatenate all the compressed files files, then split those into roughly equivalently sized *compressed* files. Initially I tried

time for i in $(ls /tmp/files/skewed_file*.gz); do zcat $i | split -l 1000000 -d -a 5 - /tmp/files/$(basename $i ".gz")_ && ls /tmp/files/$(basename $i ".gz")_* | grep -v ".out" | xargs -P10 -n1 python etl_file.py; done

What that stream of commands would do is iterate over each skewed_file (hundreds, ranging in size from 8KB to 1GB+) in the /tmp/files/ directory, then zcat (gzip concat), split the STDIN by using the hyphen (“-“) instead of a file name to represent STDIN, output it to the same named file without .gz extension and an underscore for numbering (e.g. skewed_file_a.gz becomes skewed_file_a_00000 and skewed_file_a_00001), then runs a python script to handle ETL using simple parallelized processing.

Now with that long of a command, it makes you wonder whether there are any faster/simpler ways to do a similar thing. There are!

zcat *.gz | parallel -l2 --pipe --block 50m gzip ">"all_{#}.gz

With this one [relatively simple] line, my large group of similarly formatted, but skewed files are split into blocks and compressed using gzip and redirected to properly named files. Booya!

Relevant Stack Overflow question/answer: http://stackoverflow.com/questions/22628610/split-stdin-to-multiple-files-and-compress-them-if-possible

EDIT:

It is also possible to use parallel to split by number of lines using the -N flag:

zcat *.gz | parallel --pipe -N1000000 gzip ">"all_{#}

Apache Pig chokes on many small files [part 2] Simple parallelized processing with GIL languages (Python, Ruby, etc)

Leave a Reply

Your email address will not be published.