Avoiding Performance Potholes: Scaling Python for Data Science using Spark @ Spark + AI Summit

Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These “Performance Potholes” include PySpark’s ease of integration with existing packages (e.g.… Continue reading

PySpark ML + NLP Workshop

Objectives: 1. Explore Amazon reviews 2. Sentimentalize the reviews 3. Word frequency by helpfulness Workshop Resources Azure Notebooks Library – Sentiment Notebook – Commoners Notebook More information Datasets http://jmcauley.ucsd.edu/data/amazon/  | Amazon reviews for NLP http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon Packages http://nlp.johnsnowlabs.com/ | Spark Package for NLP https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame… Continue reading

Intro to PySpark Workshop 2018-01-24

In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading