Parquet – Garren's [Big] Data Blog

Spark File Format Showdown – CSV vs JSON vs Parquet

Posted by Garren on 2017/10/09

Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my… Continue reading→

Apache Spark Best Practices, CSV, JSON, Parquet, s3, spark Leave a Comment

Real Time Big Data analytics: Parquet (and Spark) + bonus

Posted by Garren on 2017/06/26

Apache Spark and Parquet (SParquet) are a match made in scalable data analytics and delivery heaven. Spark brings a wide ranging, powerful computing platform to the equation while Parquet offers a data format that is purpose-built for high-speed big data analytics. If this sounds like fluffy marketing talk, resist the temptation to close this tab,… Continue reading→

Apache Spark aws, Best Practices, Cloudera, Impala, Parquet, spark 3 Comments

Tips for using Apache Parquet with Spark 2.x

Posted by Garren on 2017/04/08

What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading→

Apache Spark Parquet, spark 2 Comments