Intro to PySpark Workshop 2018-01-24 – Garren's [Big] Data Blog

In this Intro to PySpark Workshop, there are five main points:

About Apache Spark
Sample PySpark Application walkthrough with explanations
Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts
Python-specific Spark advice
Curated resources to learn more

Slides

Q&A Options:

Twitter: #PySparkWorkshop

Sample app

from pyspark.sql import SparkSession
# Build SparkSession, gateway to everything Spark 2.x
spark = SparkSession.builder.appName(name="PySpark Intro").master("local[*]").getOrCreate()

# Create PySpark SQL DataFrame from CSV 
# inferring schema from file
# and using header
green_trips = spark.read\
    .option("header", "true")\
    .option("inferSchema", "false")\
    .csv("green_tripdata_2017-06.csv")

# Create a view to use as if it were a SQL table
green_trips.createOrReplaceTempView("green_trips")

# Run arbitrary SQL to view total revenue by hour
revenue_by_hour = spark.sql("""
SELECT hour(lpep_pickup_datetime), SUM(total_amount) AS total
FROM green_trips
GROUP BY hour(lpep_pickup_datetime)
ORDER BY hour(lpep_pickup_datetime) ASC""")

# Write out to 25 files (because of 25 partitions) in a directory
revenue_by_hour.write.mode("overwrite").csv("green_revenue_by_hour")

This code can be put into a .py file and run using spark-submit at the command line:

> spark-submit sample_app.py

UPDATE: The content for this workshop was live streamed and recorded for PyLadies Remote which can be viewed here

Resources to learn more

Advice for vetting Spark resources

Newer content is generally much more relevant due to the rapid pace that Apache Spark has been developed. Avoid most content before July 2016 which is when Spark 2.0 was released because it may not reflect many critical changes to Spark (such as Data structure APIs like DataFrames/Datasets, Structured Streaming, SparkSession, etc). Content that revolves around Spark 1.x (e.g. Spark 1.6.3) should be avoided as it’s effectively obsolete (ie: last release on 1.x line was Nov ’16 while Spark 2.x has had 6 releases since then). Databricks is essentially a commercial offshoot of the original project at UC Berkeley AMPLab, has Matei Zaharia, the original author of Spark as a co-founder, and employs the majority of Spark contributors. Basically, if Databricks says something about Spark, it would be a good idea to listen.

Books

Learning PySpark (Feb 2017) by Tomasz Drabas and Denny Lee

Gentle Introduction to Spark by Databricks

Mastering Apache Spark 2 by Jacek Laskowski – note this is more of a dense, incredibly useful reference than a tutorial or book meant to be read linearly

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark by Holden Karau and Rachel Warren – highly recommended once you’re more comfortable with Spark

Articles and Blog Posts

Introducing Vectorized UDFs for PySpark by Databricks

Jump Start with Apache Spark 2.0 on Databricks by Databricks

Scaling Python for Data Science using Spark by Garren Staubli (me)

Notebooks

Intro to Apache Spark on Databricks by Databricks

Jupyter Azure Notebook by Garren Staubli (me)

Repositories

Spark: The Definitive Guide (WIP) by Bill Chambers and Matei Zaharia (Databricks)

Presentations

Extending Spark ML (+ 2nd video) by Holden Karau

Performance Optimization of Recommendation Training Pipeline at Netflix by DB Tsai

Free notebooks

Jupyter notebook in the Microsoft Azure cloud: Azure Notebooks

Databricks community edition

Docker image for Jupyter + PySpark