Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391
{"id":232,"date":"2018-03-04T15:38:13","date_gmt":"2018-03-04T23:38:13","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=232"},"modified":"2018-03-04T15:38:13","modified_gmt":"2018-03-04T23:38:13","slug":"using-new-pyspark-2-3-vectorized-pandas-udfs-lessons","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2018\/03\/04\/using-new-pyspark-2-3-vectorized-pandas-udfs-lessons\/","title":{"rendered":"Using new PySpark 2.3 Vectorized Pandas UDFs: Lessons"},"content":{"rendered":"

Since Spark 2.3 was officially released<\/a> 2\/28\/18, I wanted to check the performance of the new Vectorized Pandas UDFs<\/a> using Apache Arrow<\/a>.<\/p>\n

Following up to my Scaling Python for Data Science using Spark<\/a> post where I mentioned Spark 2.3 introducing Vectorized UDFs, I’m using the same Data (from NYC yellow cabs) with this code:<\/p>\n

from pyspark.sql import functions as F\r\nfrom pyspark.sql.functions import pandas_udf, PandasUDFType\r\nfrom pyspark.sql.types import *\r\nimport pandas as pd\r\n\r\ndf = spark.read\\\r\n  .option(\"header\", \"true\")\\\r\n  .option(\"inferSchema\", \"true\")\\\r\n  .csv(\"yellow_tripdata_2017-06.csv\")\r\n\r\ndef timestamp_to_epoch(t):\r\n return t.dt.strftime(\"%s\").apply(str) # <-- pandas.Series calls\r\n\r\nf_timestamp_copy = pandas_udf(timestamp_to_epoch, returnType=StringType())\r\ndf = df.withColumn(\"timestamp_copy\", f_timestamp_copy(F.col(\"tpep_pickup_datetime\")))\r\ndf.select('timestamp_copy').distinct().count()\r\n\r\n# s = pd.Series({'ds': pd.Timestamp('2018-03-03 04:31:19')})\r\n# timestamp_to_epoch(s)\r\n## ds 1520080279\r\n## dtype: object<\/pre>\nPandas Scalar (the default; as opposed to grouped map) UDFs operate on pandas.Series<\/a> objects for both input and output, hence the .dt<\/span> call chain as opposed to directly calling strftime on a python datetime object. The entire functionality is dependent on using PyArrow (>= 0.8.0).<\/p>\n
<\/p>\n
Expect errors to crop up as this functionality is new. I have seen a fair share of memory leaks and casting errors causing my jobs to fail during testing.<\/p>\n
Running the job above shows some new items in the Spark UI (DAG) and explain plan:<\/p>\n
<\/a> <\/a><\/p>\n
Note the addition of ArrowEvalPython<\/p>\n
What’s the performance like?!<\/strong><\/h3>\nTo jog your memory, PySpark SQL took 17 seconds to count the distinct epoch timestamps, and regular Python UDFs took over 10 minutes (610 seconds).<\/p>\n
Much to my dismay, the performance of my contrived test was in line with Python UDFs, not Spark SQL with a runtime of 9-10 minutes.<\/p>\n
I’ll update this post [hopefully] as I get more information.<\/p>\n","protected":false},"excerpt":{"rendered":"Since Spark 2.3 was officially released 2\/28\/18, I wanted to check the performance of the new Vectorized Pandas UDFs using Apache Arrow. Following up to my Scaling Python for Data Science using Spark post where I mentioned Spark 2.3 introducing Vectorized UDFs, I’m using the same Data (from NYC yellow cabs) with this code: from… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/232"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=232"}],"version-history":[{"count":3,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/232\/revisions"}],"predecessor-version":[{"id":237,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/232\/revisions\/237"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}