Apache Spark and Parquet (SParquet) are a match made in scalable data analytics and delivery heaven. Spark brings a wide ranging, powerful computing platform to the equation while Parquet offers a data format that is purpose-built for high-speed big data analytics. If this sounds like fluffy marketing talk, resist the temptation to close this tab,… Continue reading→
Pre-requisites AWS S3 Hadoop AWS Jar AWS Java SDK Jar * Note: These AWS jars should not be necessary if you’re using Amazon EMR. Amazon Redshift JDBC Driver Spark-Redshift package * * The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead… Continue reading→
What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading→