spark – Garren's [Big] Data Blog

PySpark ML + NLP Workshop

Posted by Garren on 2018/02/22

Objectives: 1. Explore Amazon reviews 2. Sentimentalize the reviews 3. Word frequency by helpfulness Workshop Resources Azure Notebooks Library – Sentiment Notebook – Commoners Notebook More information Datasets http://jmcauley.ucsd.edu/data/amazon/ | Amazon reviews for NLP http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/ | +/- Effect Lexicon Packages http://nlp.johnsnowlabs.com/ | Spark Package for NLP https://spark.apache.org/docs/latest/ml-guide.html | Spark ML guide – focus on DataFrame… Continue reading→

Apache Spark Data Science, IPython Notebook, Jupyter, Machine Learning, ML, Natural Language Processing, NLP, PySpark, Python, spark, Workshop Leave a Comment

Intro to PySpark Workshop 2018-01-24

Posted by Garren on 2018/01/24

In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading→

Apache Spark IPython Notebook, Jupyter, PySpark, Python, spark, Workshop Leave a Comment

Scaling Python for Data Science using Spark

Posted by Garren on 2018/01/06

Python is the de facto language of Data Science & Engineering. (IMHO R is grand for statisticians, but Python is for the rest of us.) As a prominent language in the field, it only makes sense that Apache Spark supports it with Python specific APIs. Spark makes it so easy to use Python that it… Continue reading→

Apache Spark Best Practices, Data Science, PySpark, Python, spark 1 Comment

Big data [Spark] and its small files problem

Posted by Garren on 2017/11/04

Often we log data in JSON, CSV or other text format to Amazon’s S3 as compressed files. This pattern is a) accessible and b) infinitely scalable by nature of being in S3 as common text files. However, there are some subtle but critical caveats that come with this pattern that can cause quite a bit… Continue reading→

Apache Spark Best Practices, s3, Small Files, spark 6 Comments