Best Practices – Garren's [Big] Data Blog

Scaling Python for Data Science using Spark

Posted by Garren on 2018/01/06

Python is the de facto language of Data Science & Engineering. (IMHO R is grand for statisticians, but Python is for the rest of us.) As a prominent language in the field, it only makes sense that Apache Spark supports it with Python specific APIs. Spark makes it so easy to use Python that it… Continue reading→

Apache Spark Best Practices, Data Science, PySpark, Python, spark 1 Comment

Big data [Spark] and its small files problem

Posted by Garren on 2017/11/04

Often we log data in JSON, CSV or other text format to Amazon’s S3 as compressed files. This pattern is a) accessible and b) infinitely scalable by nature of being in S3 as common text files. However, there are some subtle but critical caveats that come with this pattern that can cause quite a bit… Continue reading→

Apache Spark Best Practices, s3, Small Files, spark 6 Comments

Spark File Format Showdown – CSV vs JSON vs Parquet

Posted by Garren on 2017/10/09

Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my… Continue reading→

Apache Spark Best Practices, CSV, JSON, Parquet, s3, spark Leave a Comment

Using Spark Efficiently | Understanding Spark Event 7/29/17

Posted by Garren on 2017/07/29

This page is dedicated to resources related to the 7/29/17 Understanding Spark event presentation in Bellevue, WA. Slides Great [FREE!] resources on all things Spark: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ https://spark.apache.org/docs/latest/sql-programming-guide.html Databricks was founded by the original creators of Spark and is currently the largest contributor to Apache Spark. As such, they are a phenomenal resource for information and… Continue reading→

Apache Spark Best Practices, spark 1 Comment