Garren's [Big] Data Blog – Page 2 – Discussing data, big, small and in between

Real-Time Decision Engine using Spark Structured Streaming + ML

Posted by Garren on 2018/04/12

Real-time decision making using ML/AI is the holy grail of customer-facing applications. It’s no longer a long-shot dream; it’s our new reality. The real-time decision engine leverages the latest features in Apache Spark 2.3, including stream-to-stream joins and Spark ML, to directly improve the customer experience. We will discuss the architecture at length, including data… Continue reading→

Default Leave a Comment

Using new PySpark 2.3 Vectorized Pandas UDFs: Lessons

Posted by Garren on 2018/03/04

Since Spark 2.3 was officially released 2/28/18, I wanted to check the performance of the new Vectorized Pandas UDFs using Apache Arrow. Following up to my Scaling Python for Data Science using Spark post where I mentioned Spark 2.3 introducing Vectorized UDFs, I’m using the same Data (from NYC yellow cabs) with this code: from… Continue reading→

Apache Spark 1 Comment

Snowflake: Getting Started with Walkthrough

Posted by Garren on 2018/03/02

What is Snowflake? Snowflake is a new era relational SQL data warehouse built for the cloud that seeks to enable seamless and fully elastic access to business-critical data that satisfies everyone from Analysts to IT to Finance. But why – aren’t there enough Data Warehouses already?! tl;dr Quantity != Quality. Snowflake offers decoupled elastic compute… Continue reading→

Default Leave a Comment

PySpark ML + NLP Workshop

Posted by Garren on 2018/02/22

Objectives: 1. Explore Amazon reviews 2. Sentimentalize the reviews 3. Word frequency by helpfulness Workshop Resources Azure Notebooks Library – Sentiment Notebook – Commoners Notebook More information Datasets http://jmcauley.ucsd.edu/data/amazon/ | Amazon reviews for NLP http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon Packages http://nlp.johnsnowlabs.com/ | Spark Package for NLP https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame… Continue reading→

Apache Spark Data Science, IPython Notebook, Jupyter, Machine Learning, ML, Natural Language Processing, NLP, PySpark, Python, spark, Workshop Leave a Comment