Spark File Format Showdown – CSV vs JSON vs Parquet

Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my… Continue reading


Real Time Big Data analytics: Parquet (and Spark) + bonus

Apache Spark and Parquet (SParquet) are a match made in scalable data analytics and delivery heaven. Spark brings a wide ranging, powerful computing platform to the equation while Parquet offers a data format that is purpose-built for high-speed big data analytics. If this sounds like fluffy marketing talk, resist the temptation to close this tab,… Continue reading


Tips for using Apache Parquet with Spark 2.x

What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading