Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /home/garrens3/public_html/blog/wp-includes/comment-template.php on line 1747

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1927

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1941

Deprecated: Optional parameter $term_id declared before required parameter $meta_key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1956

Deprecated: Optional parameter $term_id declared before required parameter $key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1970

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Return type of Requests_Cookie_Jar::offsetExists($key) should either be compatible with ArrayAccess::offsetExists(mixed $offset): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 63

Deprecated: Return type of Requests_Cookie_Jar::offsetGet($key) should either be compatible with ArrayAccess::offsetGet(mixed $offset): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 73

Deprecated: Return type of Requests_Cookie_Jar::offsetSet($key, $value) should either be compatible with ArrayAccess::offsetSet(mixed $offset, mixed $value): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 89

Deprecated: Return type of Requests_Cookie_Jar::offsetUnset($key) should either be compatible with ArrayAccess::offsetUnset(mixed $offset): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 102

Deprecated: Return type of Requests_Cookie_Jar::getIterator() should either be compatible with IteratorAggregate::getIterator(): Traversable, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 111

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetExists($key) should either be compatible with ArrayAccess::offsetExists(mixed $offset): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 40

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetGet($key) should either be compatible with ArrayAccess::offsetGet(mixed $offset): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 51

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetSet($key, $value) should either be compatible with ArrayAccess::offsetSet(mixed $offset, mixed $value): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 68

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetUnset($key) should either be compatible with ArrayAccess::offsetUnset(mixed $offset): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 82

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::getIterator() should either be compatible with IteratorAggregate::getIterator(): Traversable, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 91

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723
{"id":203,"date":"2018-01-24T16:21:32","date_gmt":"2018-01-25T00:21:32","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=203"},"modified":"2020-01-24T08:30:36","modified_gmt":"2020-01-24T16:30:36","slug":"intro-to-pyspark-workshop-2018-01-24","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2018\/01\/24\/intro-to-pyspark-workshop-2018-01-24\/","title":{"rendered":"Intro to PySpark Workshop 2018-01-24"},"content":{"rendered":"

In this Intro to PySpark Workshop, there are five main points:<\/p>\n

    \n
  1. About Apache Spark<\/li>\n
  2. Sample PySpark Application walkthrough with explanations<\/li>\n
  3. Custom built Jupyter Azure Notebook<\/a> to interactively demonstrate fundamental PySpark concepts<\/li>\n
  4. Python-specific Spark advice<\/a><\/li>\n
  5. Curated resources to learn more<\/li>\n<\/ol>\n

    Slides<\/h2>\n

    PDF Version: Intro to PySpark Workshop<\/a><\/p>\n

    Q&A Options:<\/h2>\n

    Twitter: #PySparkWorkshop<\/p>\n

    Sample app<\/h3>\n
    from pyspark.sql import SparkSession\n# Build SparkSession, gateway to everything Spark 2.x\nspark = SparkSession.builder.appName(name=\"PySpark Intro\").master(\"local[*]\").getOrCreate()\n\n# Create PySpark SQL DataFrame from CSV \n# inferring schema from file\n# and using header\ngreen_trips = spark.read\\\n    .option(\"header\", \"true\")\\\n    .option(\"inferSchema\", \"false\")\\\n    .csv(\"green_tripdata_2017-06.csv\")\n\n# Create a view to use as if it were a SQL table\ngreen_trips.createOrReplaceTempView(\"green_trips\")\n\n# Run arbitrary SQL to view total revenue by hour\nrevenue_by_hour = spark.sql(\"\"\"\nSELECT hour(lpep_pickup_datetime), SUM(total_amount) AS total\nFROM green_trips\nGROUP BY hour(lpep_pickup_datetime)\nORDER BY hour(lpep_pickup_datetime) ASC\"\"\")\n\n# Write out to 25 files (because of 25 partitions) in a directory\nrevenue_by_hour.write.mode(\"overwrite\").csv(\"green_revenue_by_hour\")<\/pre>\n

    This code can be put into a .py file and run using spark-submit <\/strong>at the command line:<\/p>\n

    > spark-submit sample_app.py<\/pre>\n

    UPDATE: The content for this workshop was live streamed and recorded for PyLadies Remote which can be viewed here<\/a><\/h2>\n

    Resources to learn more<\/h2>\n

    Advice for vetting Spark resources<\/h3>\n

    Newer content is generally much <\/em>more relevant due to the rapid pace that Apache Spark has been developed. Avoid most content before July 2016 which is when Spark 2.0 was released because it may not reflect many critical changes to Spark (such as Data structure APIs like DataFrames\/Datasets, Structured Streaming, SparkSession, etc). Content that revolves around Spark 1.x (e.g. Spark 1.6.3) should be avoided as it’s effectively obsolete (ie: last release on 1.x line was Nov ’16 while Spark 2.x has had 6 releases since then<\/em>). Databricks is essentially a commercial offshoot of the original project at UC Berkeley AMPLab, has Matei Zaharia, the original author of Spark as a co-founder, and employs the majority of Spark contributors. Basically, if Databricks says something about Spark, it would be a good idea to listen.<\/p>\n

    Books<\/h3>\n

    Learning PySpark<\/a> (Feb 2017) by Tomasz Drabas<\/span> and Denny Lee <\/span><\/p>\n

    Gentle Introduction to Spark<\/a> by Databricks<\/p>\n

    Mastering Apache Spark 2<\/a> by Jacek Laskowski – note this is more of a dense, incredibly useful reference than a tutorial or book meant to be read linearly<\/p>\n

    High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark<\/a> by Holden Karau and Rachel Warren – highly<\/em> recommended once you’re more comfortable with Spark<\/p>\n

    Articles and Blog Posts<\/h3>\n

    Introducing Vectorized UDFs for PySpark<\/a> by Databricks<\/p>\n

    Jump Start with Apache Spark 2.0 on Databricks<\/a> by Databricks<\/p>\n

    Scaling Python for Data Science using Spark<\/a> by Garren Staubli (me)<\/p>\n

    Notebooks<\/h3>\n

    Intro to Apache Spark on Databricks<\/a> by Databricks<\/p>\n

    Jupyter Azure Notebook<\/a> by Garren Staubli (me)<\/p>\n

    Repositories<\/h3>\n

    Spark: The Definitive Guide (WIP)<\/a> by Bill Chambers and Matei Zaharia (Databricks)<\/p>\n

    Presentations<\/h3>\n

    Extending Spark ML<\/a> (+ 2nd video<\/a>) by Holden Karau<\/p>\n

    Performance Optimization of Recommendation Training Pipeline at Netflix<\/a> by DB Tsai<\/p>\n

    Free notebooks<\/h3>\n

    Jupyter notebook in the Microsoft Azure cloud: Azure Notebooks<\/a><\/p>\n

    Databricks community edition<\/a><\/p>\n

    Docker image for Jupyter + PySpark<\/a><\/p>\n

     <\/p>\n","protected":false},"excerpt":{"rendered":"

    In this Intro to PySpark Workshop, there are five main points: About Apache Spark Sample PySpark Application walkthrough with explanations Custom built Jupyter Azure Notebook to interactively demonstrate fundamental PySpark concepts Python-specific Spark advice Curated resources to learn more Slides PDF Version: Intro to PySpark Workshop Q&A Options: Twitter: #PySparkWorkshop Sample app from pyspark.sql import… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":true,"template":"","format":"standard","meta":[],"categories":[22],"tags":[15,13,7,14,2,12],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/203"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=203"}],"version-history":[{"count":16,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/203\/revisions"}],"predecessor-version":[{"id":333,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/203\/revisions\/333"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=203"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=203"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=203"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}