Avoiding Performance Potholes: Scaling Python for Data Science using Spark @ Spark + AI Summit

Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These “Performance Potholes” include PySpark’s ease of integration with existing packages (e.g. Pandas, SciPy, Scikit Learn, etc), using Python UDFs, and utilizing the RDD APIs instead of Spark SQL DataFrames without understanding the implications. Additionally, Spark 2.3 changes the game even further with vectorized UDFs. In this talk, we will discuss:

– How PySpark works broadly (& why it matters)
– Integrating popular Python packages with Spark
– Python UDFs (how to [not] use them)
– RDDs vs Spark SQL DataFrames
– Spark 2.3 Vectorized UDFs

Session hashtag: #Py9SAIS

Download full slides here

Spark + AI Summit session page with video

Scaling Python for Data Science using Spark


Real-Time Decision Engine using Spark Structured Streaming + ML

Real-time decision making using ML/AI is the holy grail of customer-facing applications. It’s no longer a long-shot dream; it’s our new reality. The real-time decision engine leverages the latest features in Apache Spark 2.3, including stream-to-stream joins and Spark ML, to directly improve the customer experience. We will discuss the architecture at length, including data source features and technical intricacies, as well as model training and serving dynamics. Critically, real-time decision engines that directly affect customer experience require production-level SLAs and/or reliable fallbacks to avoid meltdowns.

These Slides were put together for Data Platforms 2018 presented by Qubole.

Recorded Video of the talk @ BrightTalk

Using new PySpark 2.3 Vectorized Pandas UDFs: Lessons

Since Spark 2.3 was officially released 2/28/18, I wanted to check the performance of the new Vectorized Pandas UDFs using Apache Arrow.

Following up to my Scaling Python for Data Science using Spark post where I mentioned Spark 2.3 introducing Vectorized UDFs, I’m using the same Data (from NYC yellow cabs) with this code:

from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd

df = spark.read\
  .option("header", "true")\
  .option("inferSchema", "true")\

def timestamp_to_epoch(t):
 return t.dt.strftime("%s").apply(str) # <-- pandas.Series calls

f_timestamp_copy = pandas_udf(timestamp_to_epoch, returnType=StringType())
df = df.withColumn("timestamp_copy", f_timestamp_copy(F.col("tpep_pickup_datetime")))

# s = pd.Series({'ds': pd.Timestamp('2018-03-03 04:31:19')})
# timestamp_to_epoch(s)
## ds 1520080279
## dtype: object

Pandas Scalar (the default; as opposed to grouped map) UDFs operate on pandas.Series objects for both input and output, hence the .dt call chain as opposed to directly calling strftime on a python datetime object. The entire functionality is dependent on using PyArrow (>= 0.8.0).


Expect errors to crop up as this functionality is new. I have seen a fair share of memory leaks and casting errors causing my jobs to fail during testing.

Running the job above shows some new items in the Spark UI (DAG) and explain plan:

Note the addition of ArrowEvalPython

What’s the performance like?!

To jog your memory, PySpark SQL took 17 seconds to count the distinct epoch timestamps, and regular Python UDFs took over 10 minutes (610 seconds).

Much to my dismay, the performance of my contrived test was in line with Python UDFs, not Spark SQL with a runtime of 9-10 minutes.

I’ll update this post [hopefully] as I get more information.

PySpark ML + NLP Workshop


1. Explore Amazon reviews

2. Sentimentalize the reviews

3. Word frequency by helpfulness

Workshop Resources

Azure Notebooks Library

Sentiment Notebook

Commoners Notebook

More information


http://jmcauley.ucsd.edu/data/amazon/  | Amazon reviews for NLP

http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon


http://nlp.johnsnowlabs.com/ | Spark Package for NLP

https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame based, NOT RDD-based