Avoiding Performance Potholes: Scaling Python for Data Science using Spark @ Spark + AI Summit

Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These “Performance Potholes” include PySpark’s ease of integration with existing packages (e.g. Pandas, SciPy, Scikit Learn, etc), using Python UDFs, and utilizing the RDD APIs instead of Spark SQL DataFrames without understanding the implications. Additionally, Spark 2.3 changes the game even further with vectorized UDFs. In this talk, we will discuss:

– How PySpark works broadly (& why it matters)
– Integrating popular Python packages with Spark
– Python UDFs (how to [not] use them)
– RDDs vs Spark SQL DataFrames
– Spark 2.3 Vectorized UDFs

Session hashtag: #Py9SAIS

Download full slides here

Spark + AI Summit session page with video

Scaling Python for Data Science using Spark


Real-Time Decision Engine using Spark Structured Streaming + ML

Real-time decision making using ML/AI is the holy grail of customer-facing applications. It’s no longer a long-shot dream; it’s our new reality. The real-time decision engine leverages the latest features in Apache Spark 2.3, including stream-to-stream joins and Spark ML, to directly improve the customer experience. We will discuss the architecture at length, including data source features and technical intricacies, as well as model training and serving dynamics. Critically, real-time decision engines that directly affect customer experience require production-level SLAs and/or reliable fallbacks to avoid meltdowns.

These Slides were put together for Data Platforms 2018 presented by Qubole.

Runtime Stats for Functions | Python Decorator

In a similar vein to my prior Python decorator metadata for functions (“meta_func” => github | PyPi | blog), this decorator is intended to help illuminate the number of calls and time taken per call aggregates.

It will keep track of each function by its uniquely assigned python object identifier, the total number of function calls, total time taken for all calls to that function, and min, max and average time for the function calls.

Sample usage:
def self_mult(n):
return n*n

print(self_mult(10)) # => 100
print(self_mult(7)) # => 49
print(self_mult.get_func_runtime_stats()) # => {'total_time': 401.668, 'avg': 200.834, 'func_uid': 4302206808, 'func_name': 'self_mult', 'min': 200.445, 'max': 201.223, 'total_calls': 2}

Replace CTRL-A in a file while in a screen session

echo -e "\u0001” | cat -v
# ^A

cat -v 000001 | tr '^A' '\t' | head

Inspiration: http://stackoverflow.com/questions/31460818/creating-a-ctrl-a-delimiter-file

Note: Within the same day, this strategy both worked then failed. YMMV

More reliable would be to get into a non screen session and do “ctrl-v then a”