Comments for Garren's [Big] Data Blog

Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118

Deprecated: Creation of dynamic property WP_Term::$object_id is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-term-query.php on line 1118
Comments for Garren's [Big] Data Blog https://garrens.com/blog Discussing data, big, small and in between Sat, 11 Apr 2026 04:48:42 +0000 hourly 1 https://wordpress.org/?v=6.0.12 Comment on Spark File Format Showdown – CSV vs JSON vs Parquet by on cloud running shoes women's https://garrens.com/blog/2017/10/09/spark-file-format-showdown-csv-vs-json-vs-parquet/#comment-261 Sat, 11 Apr 2026 04:48:42 +0000 http://garrens.com/blog/?p=169#comment-261 Great effort, well appreciated.

]]> Comment on Big data [Spark] and its small files problem by Mohammad https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/#comment-254 Tue, 08 Dec 2020 19:23:38 +0000 http://garrens.com/blog/?p=181#comment-254 In reply to Garren.

Oh do you mean to simulate a stream? That’s a pretty cool idea. My approach, which worked, is a little different. I used python’s zipfile module. Since I had the files stored locally I was able to use python to append json files together to make bigger files of at least 128MB and at most 1GB. So far it seems to have worked. I went from having ~2M mini json files to 28 appropriately-sized json files (~300MB each). Your blog was super helpful, thanks!

]]> Comment on Big data [Spark] and its small files problem by Garren https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/#comment-253 Tue, 08 Dec 2020 00:46:26 +0000 http://garrens.com/blog/?p=181#comment-253 In reply to Mohammad.

Mohammad, if you already have an army of tiny files, you can do a batch conversion to parquet by reading in chunks of the data set (e.g. 1 hour partition at a time of a 24 hour day)

]]> Comment on Big data [Spark] and its small files problem by Mohammad https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/#comment-252 Mon, 07 Dec 2020 23:10:10 +0000 http://garrens.com/blog/?p=181#comment-252 What if you already have an army of tiny json files (~10KB) in S3. Is there a way to preliminarily convert those files to parquet in S3?

]]> Comment on Converting CSVs with Headers to AVRO by Garren https://garrens.com/blog/2015/03/21/converting-csvs-with-headers-to-avro/#comment-251 Wed, 17 Jun 2020 02:49:07 +0000 http://garrens.com/blog/?p=87#comment-251 In reply to Manpreet Shuann.

Probably related to your python version (e.g. 2.7 versus 3.6). Hopefully you got it resolved.

]]>