<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$categories is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$post2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$link2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>250</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>265</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
{"id":59,"date":"2014-12-31T10:49:55","date_gmt":"2014-12-31T18:49:55","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=59"},"modified":"2015-06-29T13:49:34","modified_gmt":"2015-06-29T21:49:34","slug":"split-stdin-to-multiple-compressed-files-on-highly-skewed-data","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2014\/12\/31\/split-stdin-to-multiple-compressed-files-on-highly-skewed-data\/","title":{"rendered":"Split STDIN to multiple Compressed Files on highly skewed data"},"content":{"rendered":"<p>Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not have the luxury of having enough space available to both decompress and process the data to a smaller size. While I could have processed the files as they were (gzipped), the performance would have been sub-standard for these reasons:<\/p>\n<ol>1) GZIP on the fly decompression in Python is noticeably slower than reading uncompressed data.<\/ol>\n<ol>\n<ol>2) As I was going to run a process on over 1 Billion records, I was going to kick it off in concurrent threads using xargs<\/ol>\n<\/ol>\n<p><a href=\"http:\/\/garrens.com\/blog\/2014\/12\/22\/simple-parallelized-processing-with-gil-languages-python-ruby-etc\/\">simple parallelized processing.<\/a><\/p>\n<ol>Unfortunately there was a significant skew to the data where a handful of hundreds of files made up 90%+ of the data. So one thread would process a file with 800,000 lines then process another one on its own with 400,000,000 lines while other threads were unused because all the small files were quickly processed.<\/ol>\n<p>So what I wanted was a way to concatenate all the compressed files files, then split those into roughly equivalently sized *compressed* files. Initially I tried<\/p>\n<pre>time for i in $(ls \/tmp\/files\/skewed_file*.gz); do zcat $i | split -l 1000000 -d -a 5 - \/tmp\/files\/$(basename $i \".gz\")_ &amp;&amp; ls \/tmp\/files\/$(basename $i \".gz\")_* | grep -v \".out\" | xargs -P10 -n1 python etl_file.py; done\r\n<\/pre>\n<p>What that stream of commands would do is iterate over each skewed_file (hundreds, ranging in size from 8KB to 1GB+) in the \/tmp\/files\/ directory, then zcat (gzip concat), split the STDIN by using the hyphen (&#8220;-&#8220;) instead of a file name to represent STDIN, output it to the same named file without .gz extension and an underscore for numbering (e.g. skewed_file_a.gz becomes skewed_file_a_00000 and skewed_file_a_00001), then runs a python script to handle ETL using <a href=\"http:\/\/garrens.com\/blog\/2014\/12\/22\/simple-parallelized-processing-with-gil-languages-python-ruby-etc\/\">simple parallelized processing.<\/a><\/p>\n<p>Now with that long of a command, it makes you wonder whether there are any faster\/simpler ways to do a similar thing. There are!<\/p>\n<pre>zcat *.gz | parallel -l2 --pipe --block 50m gzip \"&gt;\"all_{#}.gz\r\n<\/pre>\n<p>With this one [relatively simple] line, my large group of similarly formatted, but skewed files are split into blocks and compressed using gzip and redirected to properly named files. Booya!<\/p>\n<p>Relevant Stack Overflow question\/answer: http:\/\/stackoverflow.com\/questions\/22628610\/split-stdin-to-multiple-files-and-compress-them-if-possible<\/p>\n<p>EDIT:<\/p>\n<p>It is also possible to use parallel to split by number of lines using the -N flag:<\/p>\n<pre>zcat *.gz | parallel --pipe -N1000000 gzip \"&gt;\"all_{#}<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not&hellip; <a href=\"https:\/\/garrens.com\/blog\/2014\/12\/31\/split-stdin-to-multiple-compressed-files-on-highly-skewed-data\/\" title=\"Read More\" class=\"read-more\">Continue reading<span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/59"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=59"}],"version-history":[{"count":9,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/59\/revisions"}],"predecessor-version":[{"id":101,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/59\/revisions\/101"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=59"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=59"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=59"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}