s3 – Garren's [Big] Data Blog

Big data [Spark] and its small files problem

Posted by Garren on 2017/11/04

Often we log data in JSON, CSV or other text format to Amazon’s S3 as compressed files. This pattern is a) accessible and b) infinitely scalable by nature of being in S3 as common text files. However, there are some subtle but critical caveats that come with this pattern that can cause quite a bit… Continue reading→

Apache Spark Best Practices, s3, Small Files, spark 6 Comments

Spark File Format Showdown – CSV vs JSON vs Parquet

Posted by Garren on 2017/10/09

Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my… Continue reading→

Apache Spark Best Practices, CSV, JSON, Parquet, s3, spark Leave a Comment

Connecting Apache Spark to External Data sources (e.g. Redshift, S3, MySQL)

Posted by Garren on 2017/04/09

Pre-requisites AWS S3 Hadoop AWS Jar AWS Java SDK Jar * Note: These AWS jars should not be necessary if you’re using Amazon EMR. Amazon Redshift JDBC Driver Spark-Redshift package * * The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead… Continue reading→

Apache Spark aws, mysql, redshift, s3, spark Leave a Comment