Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These “Performance Potholes” include PySpark’s ease of integration with existing packages (e.g.… Continue reading
Objectives: 1. Explore Amazon reviews 2. Sentimentalize the reviews 3. Word frequency by helpfulness Workshop Resources Azure Notebooks Library – Sentiment Notebook – Commoners Notebook More information Datasets http://jmcauley.ucsd.edu/data/amazon/ | Amazon reviews for NLP http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon Packages http://nlp.johnsnowlabs.com/ | Spark Package for NLP https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame… Continue reading
In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading
This post is accessible via garrens.com/DataSnowCat and references material covered at Spark + AI Summit (link) 2019.
Real-time decision making using ML/AI is the holy grail of customer-facing applications. It’s no longer a long-shot dream; it’s our new reality. The real-time decision engine leverages the latest features in Apache Spark 2.3, including stream-to-stream joins and Spark ML, to directly improve the customer experience. We will discuss the architecture at length, including data… Continue reading
Since Spark 2.3 was officially released 2/28/18, I wanted to check the performance of the new Vectorized Pandas UDFs using Apache Arrow. Following up to my Scaling Python for Data Science using Spark post where I mentioned Spark 2.3 introducing Vectorized UDFs, I’m using the same Data (from NYC yellow cabs) with this code: from… Continue reading