What is Apache Parquet?
It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts.
What role does it fill?
It is a fast and efficient data format great for scalable big data analytics.
Optimization Tips
Aim for around 1GB parquet output files, but experiment with other sizes for your use case and cluster setup (source)
Ideally store on HDFS in file sizes of at least the HDFS block size (default 128MB)
Storing Parquet files on S3 is also possible (side note: use amazon athena, which charges based on data read if you want Presto SQL-like queries on demand at low cost)
Use snappy compression if storage space is not a concern due to it being splittable, but for what should be a relatively small performance hit but much better compression, use gzip (source)
More information:
https://parquet.apache.org/
http://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files/42919015#42919015
Deprecated: Creation of dynamic property WP_Term::$cat_ID is deprecated in
/home/garrens3/public_html/blog/wp-includes/category.php on line
378
Deprecated: Creation of dynamic property WP_Term::$category_count is deprecated in
/home/garrens3/public_html/blog/wp-includes/category.php on line
379
Deprecated: Creation of dynamic property WP_Term::$category_description is deprecated in
/home/garrens3/public_html/blog/wp-includes/category.php on line
380
Deprecated: Creation of dynamic property WP_Term::$cat_name is deprecated in
/home/garrens3/public_html/blog/wp-includes/category.php on line
381
Deprecated: Creation of dynamic property WP_Term::$category_nicename is deprecated in
/home/garrens3/public_html/blog/wp-includes/category.php on line
382
Deprecated: Creation of dynamic property WP_Term::$category_parent is deprecated in
/home/garrens3/public_html/blog/wp-includes/category.php on line
383
Categories
Apache SparkTags
Parquet, spark