<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$categories is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$post2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$link2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>250</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>265</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
{"id":232,"date":"2018-03-04T15:38:13","date_gmt":"2018-03-04T23:38:13","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=232"},"modified":"2018-03-04T15:38:13","modified_gmt":"2018-03-04T23:38:13","slug":"using-new-pyspark-2-3-vectorized-pandas-udfs-lessons","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2018\/03\/04\/using-new-pyspark-2-3-vectorized-pandas-udfs-lessons\/","title":{"rendered":"Using new PySpark 2.3 Vectorized Pandas UDFs: Lessons"},"content":{"rendered":"<p>Since Spark 2.3 was <a href=\"https:\/\/spark.apache.org\/releases\/spark-release-2-3-0.html\">officially released<\/a> 2\/28\/18, I wanted to check the performance of the new <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/pyspark.sql.html#pyspark.sql.functions.pandas_udf\">Vectorized Pandas UDFs<\/a> using <a href=\"https:\/\/arrow.apache.org\/\">Apache Arrow<\/a>.<\/p>\n<p>Following up to my <a href=\"http:\/\/garrens.com\/blog\/2018\/01\/06\/scaling-python-for-data-science-using-spark\/\">Scaling Python for Data Science using Spark<\/a> post where I mentioned Spark 2.3 introducing Vectorized UDFs, I&#8217;m using the same Data (from NYC yellow cabs) with this code:<\/p>\n<pre>from pyspark.sql import functions as F\r\nfrom pyspark.sql.functions import pandas_udf, PandasUDFType\r\nfrom pyspark.sql.types import *\r\nimport pandas as pd\r\n\r\ndf = spark.read\\\r\n  .option(\"header\", \"true\")\\\r\n  .option(\"inferSchema\", \"true\")\\\r\n  .csv(\"yellow_tripdata_2017-06.csv\")\r\n\r\ndef timestamp_to_epoch(t):\r\n return t.dt.strftime(\"%s\").apply(str) # &lt;-- pandas.Series calls\r\n\r\nf_timestamp_copy = pandas_udf(timestamp_to_epoch, returnType=StringType())\r\ndf = df.withColumn(\"timestamp_copy\", f_timestamp_copy(F.col(\"tpep_pickup_datetime\")))\r\ndf.select('timestamp_copy').distinct().count()\r\n\r\n# s = pd.Series({'ds': pd.Timestamp('2018-03-03 04:31:19')})\r\n# timestamp_to_epoch(s)\r\n## ds 1520080279\r\n## dtype: object<\/pre>\n<p>Pandas Scalar (the default; as opposed to grouped map) UDFs operate on <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.Series.html\">pandas.Series<\/a> objects for both input and output, hence the <span style=\"text-decoration: underline;\">.dt<\/span> call chain as opposed to directly calling strftime on a python datetime object. The entire functionality is dependent on using PyArrow (&gt;= 0.8.0).<\/p>\n<p><img class=\"transparent\" src=\"https:\/\/www.bandegraphix.com\/DecalImages\/531-caution-m.png\" alt=\"https:\/\/www.bandegraphix.com\/DecalImages\/531-caution-m.png\" \/><\/p>\n<p>Expect errors to crop up as this functionality is new. I have seen a fair share of memory leaks and casting errors causing my jobs to fail during testing.<\/p>\n<p>Running the job above shows some new items in the Spark UI (DAG) and explain plan:<\/p>\n<p><a href=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.20-PM.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-234\" src=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.20-PM.png\" alt=\"\" width=\"690\" height=\"1324\" srcset=\"https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.20-PM.png 690w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.20-PM-156x300.png 156w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.20-PM-534x1024.png 534w\" sizes=\"(max-width: 690px) 100vw, 690px\" \/><\/a> <a href=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.02-PM.png\"><img loading=\"lazy\" class=\" wp-image-235 aligncenter\" src=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.02-PM.png\" alt=\"\" width=\"807\" height=\"121\" srcset=\"https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.02-PM.png 2812w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.02-PM-300x45.png 300w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.02-PM-768x115.png 768w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2018\/03\/Screen-Shot-2018-03-03-at-8.20.02-PM-1024x153.png 1024w\" sizes=\"(max-width: 807px) 100vw, 807px\" \/><\/a><\/p>\n<p>Note the addition of ArrowEvalPython<\/p>\n<h3><strong>What&#8217;s the performance like?!<\/strong><\/h3>\n<p>To jog your memory, PySpark SQL took 17 seconds to count the distinct epoch timestamps, and regular Python UDFs took over 10 minutes (610 seconds).<\/p>\n<p>Much to my dismay, the performance of my contrived test was in line with Python UDFs, not Spark SQL with a runtime of 9-10 minutes.<\/p>\n<p>I&#8217;ll update this post [hopefully] as I get more information.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since Spark 2.3 was officially released 2\/28\/18, I wanted to check the performance of the new Vectorized Pandas UDFs using Apache Arrow. Following up to my Scaling Python for Data Science using Spark post where I mentioned Spark 2.3 introducing Vectorized UDFs, I&#8217;m using the same Data (from NYC yellow cabs) with this code: from&hellip; <a href=\"https:\/\/garrens.com\/blog\/2018\/03\/04\/using-new-pyspark-2-3-vectorized-pandas-udfs-lessons\/\" title=\"Read More\" class=\"read-more\">Continue reading<span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/232"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=232"}],"version-history":[{"count":3,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/232\/revisions"}],"predecessor-version":[{"id":237,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/232\/revisions\/237"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}