April 2017 – Garren's [Big] Data Blog

Connecting Apache Spark to External Data sources (e.g. Redshift, S3, MySQL)

Posted by Garren on 2017/04/09

Pre-requisites AWS S3 Hadoop AWS Jar AWS Java SDK Jar * Note: These AWS jars should not be necessary if you’re using Amazon EMR. Amazon Redshift JDBC Driver Spark-Redshift package * * The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead… Continue reading→

Apache Spark aws, mysql, redshift, s3, spark Leave a Comment

Tips for using Apache Parquet with Spark 2.x

Posted by Garren on 2017/04/08

What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading→

Apache Spark Parquet, spark 2 Comments