Apache Spark – Garren's [Big] Data Blog

Using new PySpark 2.3 Vectorized Pandas UDFs: Lessons

Posted by Garren on 2018/03/04

Since Spark 2.3 was officially released 2/28/18, I wanted to check the performance of the new Vectorized Pandas UDFs using Apache Arrow. Following up to my Scaling Python for Data Science using Spark post where I mentioned Spark 2.3 introducing Vectorized UDFs, I’m using the same Data (from NYC yellow cabs) with this code: from… Continue reading→

Apache Spark 1 Comment

PySpark ML + NLP Workshop

Posted by Garren on 2018/02/22

Objectives: 1. Explore Amazon reviews 2. Sentimentalize the reviews 3. Word frequency by helpfulness Workshop Resources Azure Notebooks Library – Sentiment Notebook – Commoners Notebook More information Datasets http://jmcauley.ucsd.edu/data/amazon/ | Amazon reviews for NLP http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon Packages http://nlp.johnsnowlabs.com/ | Spark Package for NLP https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame… Continue reading→

Apache Spark Data Science, IPython Notebook, Jupyter, Machine Learning, ML, Natural Language Processing, NLP, PySpark, Python, spark, Workshop Leave a Comment

Intro to PySpark Workshop 2018-01-24

Posted by Garren on 2018/01/24

In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading→

Apache Spark IPython Notebook, Jupyter, PySpark, Python, spark, Workshop Leave a Comment

Scaling Python for Data Science using Spark

Posted by Garren on 2018/01/06

Python is the de facto language of Data Science & Engineering. (IMHO R is grand for statisticians, but Python is for the rest of us.) As a prominent language in the field, it only makes sense that Apache Spark supports it with Python specific APIs. Spark makes it so easy to use Python that it… Continue reading→

Apache Spark Best Practices, Data Science, PySpark, Python, spark 1 Comment