In this Intro to PySpark Workshop, there are five main points:<\/p>\n
PDF Version: Intro to PySpark Workshop<\/a><\/p>\n Twitter: #PySparkWorkshop<\/p>\n This code can be put into a .py file and run using spark-submit <\/strong>at the command line:<\/p>\n Newer content is generally much <\/em>more relevant due to the rapid pace that Apache Spark has been developed. Avoid most content before July 2016 which is when Spark 2.0 was released because it may not reflect many critical changes to Spark (such as Data structure APIs like DataFrames\/Datasets, Structured Streaming, SparkSession, etc). Content that revolves around Spark 1.x (e.g. Spark 1.6.3) should be avoided as it’s effectively obsolete (ie: last release on 1.x line was Nov ’16 while Spark 2.x has had 6 releases since then<\/em>). Databricks is essentially a commercial offshoot of the original project at UC Berkeley AMPLab, has Matei Zaharia, the original author of Spark as a co-founder, and employs the majority of Spark contributors. Basically, if Databricks says something about Spark, it would be a good idea to listen.<\/p>\nQ&A Options:<\/h2>\n
Sample app<\/h3>\n
from pyspark.sql import SparkSession\n# Build SparkSession, gateway to everything Spark 2.x\nspark = SparkSession.builder.appName(name=\"PySpark Intro\").master(\"local[*]\").getOrCreate()\n\n# Create PySpark SQL DataFrame from CSV \n# inferring schema from file\n# and using header\ngreen_trips = spark.read\\\n .option(\"header\", \"true\")\\\n .option(\"inferSchema\", \"false\")\\\n .csv(\"green_tripdata_2017-06.csv\")\n\n# Create a view to use as if it were a SQL table\ngreen_trips.createOrReplaceTempView(\"green_trips\")\n\n# Run arbitrary SQL to view total revenue by hour\nrevenue_by_hour = spark.sql(\"\"\"\nSELECT hour(lpep_pickup_datetime), SUM(total_amount) AS total\nFROM green_trips\nGROUP BY hour(lpep_pickup_datetime)\nORDER BY hour(lpep_pickup_datetime) ASC\"\"\")\n\n# Write out to 25 files (because of 25 partitions) in a directory\nrevenue_by_hour.write.mode(\"overwrite\").csv(\"green_revenue_by_hour\")<\/pre>\n
> spark-submit sample_app.py<\/pre>\n
UPDATE: The content for this workshop was live streamed and recorded for PyLadies Remote which can be viewed here<\/a><\/h2>\n
Resources to learn more<\/h2>\n
Advice for vetting Spark resources<\/h3>\n
Books<\/h3>\n