Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /home/garrens3/public_html/blog/wp-includes/comment-template.php on line 1747

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1927

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1941

Deprecated: Optional parameter $term_id declared before required parameter $meta_key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1956

Deprecated: Optional parameter $term_id declared before required parameter $key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1970

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391
{"id":59,"date":"2014-12-31T10:49:55","date_gmt":"2014-12-31T18:49:55","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=59"},"modified":"2015-06-29T13:49:34","modified_gmt":"2015-06-29T21:49:34","slug":"split-stdin-to-multiple-compressed-files-on-highly-skewed-data","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2014\/12\/31\/split-stdin-to-multiple-compressed-files-on-highly-skewed-data\/","title":{"rendered":"Split STDIN to multiple Compressed Files on highly skewed data"},"content":{"rendered":"

Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not have the luxury of having enough space available to both decompress and process the data to a smaller size. While I could have processed the files as they were (gzipped), the performance would have been sub-standard for these reasons:<\/p>\n

simple parallelized processing.<\/a><\/p>\n

Unfortunately there was a significant skew to the data where a handful of hundreds of files made up 90%+ of the data. So one thread would process a file with 800,000 lines then process another one on its own with 400,000,000 lines while other threads were unused because all the small files were quickly processed.<\/ol>\n

So what I wanted was a way to concatenate all the compressed files files, then split those into roughly equivalently sized *compressed* files. Initially I tried<\/p>\n

time for i in $(ls \/tmp\/files\/skewed_file*.gz); do zcat $i | split -l 1000000 -d -a 5 - \/tmp\/files\/$(basename $i \".gz\")_ && ls \/tmp\/files\/$(basename $i \".gz\")_* | grep -v \".out\" | xargs -P10 -n1 python etl_file.py; done\r\n<\/pre>\nWhat that stream of commands would do is iterate over each skewed_file (hundreds, ranging in size from 8KB to 1GB+) in the \/tmp\/files\/ directory, then zcat (gzip concat), split the STDIN by using the hyphen (“-“) instead of a file name to represent STDIN, output it to the same named file without .gz extension and an underscore for numbering (e.g. skewed_file_a.gz becomes skewed_file_a_00000 and skewed_file_a_00001), then runs a python script to handle ETL using simple parallelized processing.<\/a><\/p>\n
Now with that long of a command, it makes you wonder whether there are any faster\/simpler ways to do a similar thing. There are!<\/p>\n
zcat *.gz | parallel -l2 --pipe --block 50m gzip \">\"all_{#}.gz\r\n<\/pre>\nWith this one [relatively simple] line, my large group of similarly formatted, but skewed files are split into blocks and compressed using gzip and redirected to properly named files. Booya!<\/p>\n
Relevant Stack Overflow question\/answer: http:\/\/stackoverflow.com\/questions\/22628610\/split-stdin-to-multiple-files-and-compress-them-if-possible<\/p>\n
EDIT:<\/p>\n
It is also possible to use parallel to split by number of lines using the -N flag:<\/p>\nzcat *.gz | parallel --pipe -N1000000 gzip \">\"all_{#}<\/pre>\n","protected":false},"excerpt":{"rendered":"Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/59"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=59"}],"version-history":[{"count":9,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/59\/revisions"}],"predecessor-version":[{"id":101,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/59\/revisions\/101"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=59"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=59"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=59"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}