Avoiding Performance Potholes: Scaling Python for Data Science using Spark @ Spark + AI Summit

Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These “Performance Potholes” include PySpark’s ease of integration with existing packages (e.g. Pandas, SciPy, Scikit Learn, etc), using Python UDFs, and utilizing the RDD APIs instead of Spark SQL DataFrames without understanding the implications. Additionally, Spark 2.3 changes the game even further with vectorized UDFs. In this talk, we will discuss:

– How PySpark works broadly (& why it matters)
– Integrating popular Python packages with Spark
– Python UDFs (how to [not] use them)
– RDDs vs Spark SQL DataFrames
– Spark 2.3 Vectorized UDFs

Session hashtag: #Py9SAIS

Download full slides here

Spark + AI Summit session page with video

Scaling Python for Data Science using Spark

 

Leave a Reply

Your email address will not be published. Required fields are marked *