Real-Time Decision Engine using Spark Structured Streaming + ML

Real-time decision making using ML/AI is the holy grail of customer-facing applications. It’s no longer a long-shot dream; it’s our new reality. The real-time decision engine leverages the latest features in Apache Spark 2.3, including stream-to-stream joins and Spark ML, to directly improve the customer experience. We will discuss the architecture at length, including data source features and technical intricacies, as well as model training and serving dynamics. Critically, real-time decision engines that directly affect customer experience require production-level SLAs and/or reliable fallbacks to avoid meltdowns.

These Slides were put together for Data Platforms 2018 presented by Qubole.

Runtime Stats for Functions | Python Decorator

In a similar vein to my prior Python decorator metadata for functions (“meta_func” => github | PyPi | blog), this decorator is intended to help illuminate the number of calls and time taken per call aggregates.

It will keep track of each function by its uniquely assigned python object identifier, the total number of function calls, total time taken for all calls to that function, and min, max and average time for the function calls.

Sample usage:
def self_mult(n):
return n*n

print(self_mult(10)) # => 100
print(self_mult(7)) # => 49
print(self_mult.get_func_runtime_stats()) # => {'total_time': 401.668, 'avg': 200.834, 'func_uid': 4302206808, 'func_name': 'self_mult', 'min': 200.445, 'max': 201.223, 'total_calls': 2}

Replace CTRL-A in a file while in a screen session

echo -e "\u0001” | cat -v
# ^A

cat -v 000001 | tr '^A' '\t' | head


Note: Within the same day, this strategy both worked then failed. YMMV

More reliable would be to get into a non screen session and do “ctrl-v then a”

Split file by keys

Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code to process it concurrently and be done in 1.5 hours because the added overhead of verifying the output, thread safety, etc exceeds the processing time serially. This made me sad, because concurrent processes are awesome. But self-managed thread safe concurrent processes are even more awesome!

I thought, what if I could split an input file on keys and group those similarly keyed lines into separate files for processing. Aha!

So I naturally first tried finding existing solutions and to be honest, awk has a pretty killer one liner as noted here on Stack Overflow:

awk '{ print >> $5 }' yourfile

This one liner is likely great for many folks (especially when using only small files). But for me, awk threw an error due to having “too many open files” – again, sad face :(.

So… I wrote my own python command line utility to take an input file and split it into any number of output files by unique keys in the file. So all your input is maintained, just in different files sorted/segregated on keys you provide.

While my naming conventions may be lacking panache, they are at least clearly intentioned utilities. (But seriously, if you have a better name, I’m all ears)

Without further ado, I give you