Garren's [Big] Data Blog – Discussing data, big, small and in between

Avoiding Performance Potholes: Scaling Python for Data Science using Spark @ Spark + AI Summit

Posted by Garren on 2018/06/05

Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These “Performance Potholes” include PySpark’s ease of integration with existing packages (e.g.… Continue reading→

Default Leave a Comment

PySpark ML + NLP Workshop

Posted by Garren on 2018/02/22

Objectives: 1. Explore Amazon reviews 2. Sentimentalize the reviews 3. Word frequency by helpfulness Workshop Resources Azure Notebooks Library – Sentiment Notebook – Commoners Notebook More information Datasets http://jmcauley.ucsd.edu/data/amazon/ | Amazon reviews for NLP http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon Packages http://nlp.johnsnowlabs.com/ | Spark Package for NLP https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame… Continue reading→

Apache Spark Data Science, IPython Notebook, Jupyter, Machine Learning, ML, Natural Language Processing, NLP, PySpark, Python, spark, Workshop Leave a Comment

Intro to PySpark Workshop 2018-01-24

Posted by Garren on 2018/01/24

In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading→