Connecting Apache Spark to External Data sources (e.g. Redshift, S3, MySQL)

Pre-requisites AWS S3 Hadoop AWS Jar AWS Java SDK Jar * Note: These AWS jars should not be necessary if you’re using Amazon EMR. Amazon Redshift JDBC Driver Spark-Redshift package * * The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead… Continue reading


Tips for using Apache Parquet with Spark 2.x

What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading