In this Intro to PySpark Workshop, there are five main points:
- About Apache Spark
- Sample PySpark Application walkthrough with explanations
- Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts
- Python-specific Spark advice
- Curated resources to learn more
Slides
PDF Version: Intro to PySpark Workshop
Q&A Options:
Twitter: #PySparkWorkshop
Sample app
from pyspark.sql import SparkSession # Build SparkSession, gateway to everything Spark 2.x spark = SparkSession.builder.appName(name="PySpark Intro").master("local[*]").getOrCreate() # Create PySpark SQL DataFrame from CSV # inferring schema from file # and using header green_trips = spark.read\ .option("header", "true")\ .option("inferSchema", "false")\ .csv("green_tripdata_2017-06.csv") # Create a view to use as if it were a SQL table green_trips.createOrReplaceTempView("green_trips") # Run arbitrary SQL to view total revenue by hour revenue_by_hour = spark.sql(""" SELECT hour(lpep_pickup_datetime), SUM(total_amount) AS total FROM green_trips GROUP BY hour(lpep_pickup_datetime) ORDER BY hour(lpep_pickup_datetime) ASC""") # Write out to 25 files (because of 25 partitions) in a directory revenue_by_hour.write.mode("overwrite").csv("green_revenue_by_hour")
This code can be put into a .py file and run using spark-submit at the command line:
> spark-submit sample_app.py
UPDATE: The content for this workshop was live streamed and recorded for PyLadies Remote which can be viewed here
Resources to learn more
Advice for vetting Spark resources
Newer content is generally much more relevant due to the rapid pace that Apache Spark has been developed. Avoid most content before July 2016 which is when Spark 2.0 was released because it may not reflect many critical changes to Spark (such as Data structure APIs like DataFrames/Datasets, Structured Streaming, SparkSession, etc). Content that revolves around Spark 1.x (e.g. Spark 1.6.3) should be avoided as it’s effectively obsolete (ie: last release on 1.x line was Nov ’16 while Spark 2.x has had 6 releases since then). Databricks is essentially a commercial offshoot of the original project at UC Berkeley AMPLab, has Matei Zaharia, the original author of Spark as a co-founder, and employs the majority of Spark contributors. Basically, if Databricks says something about Spark, it would be a good idea to listen.
Books
Learning PySpark (Feb 2017) by
Gentle Introduction to Spark by Databricks
Mastering Apache Spark 2 by Jacek Laskowski – note this is more of a dense, incredibly useful reference than a tutorial or book meant to be read linearly
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark by Holden Karau and Rachel Warren – highly recommended once you’re more comfortable with Spark
Articles and Blog Posts
Introducing Vectorized UDFs for PySpark by Databricks
Jump Start with Apache Spark 2.0 on Databricks by Databricks
Scaling Python for Data Science using Spark by Garren Staubli (me)
Notebooks
Intro to Apache Spark on Databricks by Databricks
Jupyter Azure Notebook by Garren Staubli (me)
Repositories
Spark: The Definitive Guide (WIP) by Bill Chambers and Matei Zaharia (Databricks)
Presentations
Extending Spark ML (+ 2nd video) by Holden Karau
Performance Optimization of Recommendation Training Pipeline at Netflix by DB Tsai
Free notebooks
Jupyter notebook in the Microsoft Azure cloud: Azure Notebooks
Docker image for Jupyter + PySpark