Tips for using Apache Parquet with Spark 2.x

What is Apache Parquet?
It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts.

What role does it fill?
It is a fast and efficient data format great for scalable big data analytics.

Optimization Tips

Aim for around 1GB parquet output files, but experiment with other sizes for your use case and cluster setup (source)

Ideally store on HDFS in file sizes of at least the HDFS block size (default 128MB)

Storing Parquet files on S3 is also possible (side note: use amazon athena, which charges based on data read if you want Presto SQL-like queries on demand at low cost)

Use snappy compression if storage space is not a concern due to it being splittable, but for what should be a relatively small performance hit but much better compression, use gzip (source)

More information:
https://parquet.apache.org/
http://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files/42919015#42919015

Leave a Reply Cancel Reply