Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391
{"id":203,"date":"2018-01-24T16:21:32","date_gmt":"2018-01-25T00:21:32","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=203"},"modified":"2020-01-24T08:30:36","modified_gmt":"2020-01-24T16:30:36","slug":"intro-to-pyspark-workshop-2018-01-24","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2018\/01\/24\/intro-to-pyspark-workshop-2018-01-24\/","title":{"rendered":"Intro to PySpark Workshop 2018-01-24"},"content":{"rendered":"

In this Intro to PySpark Workshop, there are five main points:<\/p>\n

About Apache Spark<\/li>\n
Sample PySpark Application walkthrough with explanations<\/li>\n
Custom built Jupyter Azure Notebook<\/a> to interactively demonstrate fundamental PySpark concepts<\/li>\n
Python-specific Spark advice<\/a><\/li>\n

Curated resources to learn more<\/li>\n<\/ol>\n

Slides<\/h2>\n
PDF Version: Intro to PySpark Workshop<\/a><\/p>\n

Q&A Options:<\/h2>\n

Twitter: #PySparkWorkshop<\/p>\n

Sample app<\/h3>\n

from pyspark.sql import SparkSession\n# Build SparkSession, gateway to everything Spark 2.x\nspark = SparkSession.builder.appName(name=\"PySpark Intro\").master(\"local[*]\").getOrCreate()\n\n# Create PySpark SQL DataFrame from CSV \n# inferring schema from file\n# and using header\ngreen_trips = spark.read\\\n    .option(\"header\", \"true\")\\\n    .option(\"inferSchema\", \"false\")\\\n    .csv(\"green_tripdata_2017-06.csv\")\n\n# Create a view to use as if it were a SQL table\ngreen_trips.createOrReplaceTempView(\"green_trips\")\n\n# Run arbitrary SQL to view total revenue by hour\nrevenue_by_hour = spark.sql(\"\"\"\nSELECT hour(lpep_pickup_datetime), SUM(total_amount) AS total\nFROM green_trips\nGROUP BY hour(lpep_pickup_datetime)\nORDER BY hour(lpep_pickup_datetime) ASC\"\"\")\n\n# Write out to 25 files (because of 25 partitions) in a directory\nrevenue_by_hour.write.mode(\"overwrite\").csv(\"green_revenue_by_hour\")<\/pre>\nThis code can be put into a .py file and run using spark-submit <\/strong>at the command line:<\/p>\n> spark-submit sample_app.py<\/pre>\nUPDATE: The content for this workshop was live streamed and recorded for PyLadies Remote which can be viewed here<\/a><\/h2>\nResources to learn more<\/h2>\n
Advice for vetting Spark resources<\/h3>\nNewer content is generally much <\/em>more relevant due to the rapid pace that Apache Spark has been developed. Avoid most content before July 2016 which is when Spark 2.0 was released because it may not reflect many critical changes to Spark (such as Data structure APIs like DataFrames\/Datasets, Structured Streaming, SparkSession, etc). Content that revolves around Spark 1.x (e.g. Spark 1.6.3) should be avoided as it’s effectively obsolete (ie: last release on 1.x line was Nov ’16 while Spark 2.x has had 6 releases since then<\/em>). Databricks is essentially a commercial offshoot of the original project at UC Berkeley AMPLab, has Matei Zaharia, the original author of Spark as a co-founder, and employs the majority of Spark contributors. Basically, if Databricks says something about Spark, it would be a good idea to listen.<\/p>\nBooks<\/h3>\nLearning PySpark<\/a> (Feb 2017) by Tomasz Drabas<\/span> and Denny Lee <\/span><\/p>\n
Gentle Introduction to Spark<\/a> by Databricks<\/p>\n
Mastering Apache Spark 2<\/a> by Jacek Laskowski – note this is more of a dense, incredibly useful reference than a tutorial or book meant to be read linearly<\/p>\n
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark<\/a> by Holden Karau and Rachel Warren – highly<\/em> recommended once you’re more comfortable with Spark<\/p>\n
Articles and Blog Posts<\/h3>\nIntroducing Vectorized UDFs for PySpark<\/a> by Databricks<\/p>\n
Jump Start with Apache Spark 2.0 on Databricks<\/a> by Databricks<\/p>\n
Scaling Python for Data Science using Spark<\/a> by Garren Staubli (me)<\/p>\n
Notebooks<\/h3>\nIntro to Apache Spark on Databricks<\/a> by Databricks<\/p>\n
Jupyter Azure Notebook<\/a> by Garren Staubli (me)<\/p>\n
Repositories<\/h3>\nSpark: The Definitive Guide (WIP)<\/a> by Bill Chambers and Matei Zaharia (Databricks)<\/p>\n
Presentations<\/h3>\nExtending Spark ML<\/a> (+ 2nd video<\/a>) by Holden Karau<\/p>\n
Performance Optimization of Recommendation Training Pipeline at Netflix<\/a> by DB Tsai<\/p>\n
Free notebooks<\/h3>\nJupyter notebook in the Microsoft Azure cloud: Azure Notebooks<\/a><\/p>\n
Databricks community edition<\/a><\/p>\n
Docker image for Jupyter + PySpark<\/a><\/p>\n
 <\/p>\n","protected":false},"excerpt":{"rendered":"
In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":true,"template":"","format":"standard","meta":[],"categories":[22],"tags":[15,13,7,14,2,12],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/203"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=203"}],"version-history":[{"count":16,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/203\/revisions"}],"predecessor-version":[{"id":333,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/203\/revisions\/333"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=203"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=203"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=203"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

Resources to learn more<\/h2>\n

Notebooks<\/h3>\n
Intro to Apache Spark on Databricks<\/a> by Databricks<\/p>\n
Jupyter Azure Notebook<\/a> by Garren Staubli (me)<\/p>\n

Repositories<\/h3>\n
Spark: The Definitive Guide (WIP)<\/a> by Bill Chambers and Matei Zaharia (Databricks)<\/p>\n

Presentations<\/h3>\n
Extending Spark ML<\/a> (+ 2nd video<\/a>) by Holden Karau<\/p>\n
Performance Optimization of Recommendation Training Pipeline at Netflix<\/a> by DB Tsai<\/p>\n

Slides<\/h2>\n
PDF Version: Intro to PySpark Workshop<\/a><\/p>\n