<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$categories is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$post2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$link2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1927</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1941</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $meta_key is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1956</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $key is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1970</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>250</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>265</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
{"id":141,"date":"2017-06-26T13:46:17","date_gmt":"2017-06-26T21:46:17","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=141"},"modified":"2018-03-02T20:51:58","modified_gmt":"2018-03-03T04:51:58","slug":"real-time-big-data-analytics-parquet-and-spark-bonus","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2017\/06\/26\/real-time-big-data-analytics-parquet-and-spark-bonus\/","title":{"rendered":"Real Time Big Data analytics: Parquet (and Spark) + bonus"},"content":{"rendered":"<div class=\"prose\">\n<p><a href=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/spark-logo-trademark.png\"><img loading=\"lazy\" class=\"size-medium wp-image-142 aligncenter\" src=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/spark-logo-trademark-300x160.png\" alt=\"\" width=\"300\" height=\"160\" srcset=\"https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/spark-logo-trademark-300x160.png 300w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/spark-logo-trademark.png 376w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Apache Spark and Parquet (SParquet) are a match made in scalable data analytics and delivery heaven. Spark brings a wide ranging, powerful computing platform to the equation while Parquet offers a data format that is purpose-built for <strong>high-speed big data analytics<\/strong>. If this sounds like fluffy marketing talk, resist the temptation to close this tab, because what follows are substantial insights I&#8217;ve personally procured and am sharing here to help others get the most out of Parquet and Spark.<\/p>\n<p><strong>What is Parquet? <\/strong><\/p>\n<p>Parquet is a binary compressed columnar file format available to any project in the Hadoop ecosystem (and others outside it even). It&#8217;s a mouthful, but let&#8217;s break it down.<\/p>\n<p><em>Binary <\/em>means parquet files cannot be opened by typical text editors natively (sublime text*, vim, etc).<\/p>\n<p>* My former colleague James Yu wrote a Sublime Text plugin you can find <a href=\"https:\/\/github.com\/yuj\/sublime-parquet\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">here<\/a> to view parquet files.<\/p>\n<p><em>Columnar<\/em> means the data is stored as columns instead of rows as most traditional databases (MySQL, PostgreSQL, etc) and file formats (CSV, JSON, etc). This is going to be very important.<\/p>\n<p><a href=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/row-columnar.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-143\" src=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/row-columnar-300x295.png\" alt=\"\" width=\"729\" height=\"717\" srcset=\"https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/row-columnar-300x295.png 300w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/row-columnar-768x756.png 768w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/06\/row-columnar.png 887w\" sizes=\"(max-width: 729px) 100vw, 729px\" \/><\/a><\/p>\n<div class=\"slate-resizable-image-embed slate-image-embed__resize-full-width\" data-imgsrc=\"https:\/\/media.licdn.com\/mpr\/mpr\/AAEAAQAAAAAAAAwoAAAAJDE2ZDRlYmUyLTJhYmMtNGU3ZS1iYzQ2LTdkZGMyNmZlOGM4NQ.png\"><\/div>\n<p><em>Compressed<\/em> means the file footprint on disk (HDFS, S3, or local filesystem) is smaller than a typical raw uncompressed file. Parquet handles compression differently than traditional compression of a CSV file for example, but in a similar vein to Avro.<\/p>\n<p>Now that the basic definition is out of the way, let&#8217;s get right to it.<\/p>\n<p><strong>How can Parquet help me?<\/strong><\/p>\n<p>Parquet is exceptionally good at <strong>high-speed big data analytics<\/strong>. It can store vast quantities of data and operate on it more quickly than many other solutions. Let&#8217;s say you have CSV files in S3 (20 TB and growing by 250GiB per day) and a use case that necessitates reporting on those files in a dashboard. A common approach to this problem is aggregating the CSV files down to a MySQL-friendly size, so that reports can be built on this aggregated data. However, this is limited in multiple ways:<\/p>\n<ol>\n<li>CSV is slow to parse because it requires reading all of the rows in the entire file, parsing each line&#8217;s columns.<\/li>\n<li>MySQL can only handle so much data, especially high dimensionality data where your users may want to pivot on many different attributes. Every pivot requirement is likely to be impossible to meet, so users must have their functionality restricted for the sake of tech limitations.<\/li>\n<li>Building many tables to support the various pivot requirements becomes onerous, because each table (and the database itself) has to be limited in both size and scope. This increases database storage costs and complexity.<\/li>\n<\/ol>\n<p>If those limitations had you cringing, I&#8217;ve made my case well :). There is an alternative that utilizes SParquet&#8230;<\/p>\n<ol>\n<li>Process the CSV files into Parquet files (snappy or gzip compressed)<\/li>\n<li>Use Spark with those Parquet files to drive a powerful and scalable analytics solution<\/li>\n<\/ol>\n<p>CSV File for Proof of Concept (PoC): <a href=\"https:\/\/s3.amazonaws.com\/nyc-tlc\/trip+data\/green_tripdata_2016-12.csv\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">NYC TLC Green Taxi for December 2016<\/a><\/p>\n<p>The CSV file has 1,224,160 rows and 19 columns, coming in at 107MB uncompressed. Here&#8217;s the file schema (using header and inferSchema options in Spark 2.1.1):<\/p>\n<pre spellcheck=\"false\"> |-- VendorID: integer (nullable = true)\r\n |-- lpep_pickup_datetime: timestamp (nullable = true)\r\n |-- lpep_dropoff_datetime: timestamp (nullable = true)\r\n |-- store_and_fwd_flag: string (nullable = true)\r\n |-- RatecodeID: integer (nullable = true)\r\n |-- PULocationID: integer (nullable = true)\r\n |-- DOLocationID: integer (nullable = true)\r\n |-- passenger_count: integer (nullable = true)\r\n |-- trip_distance: nullable = true)\r\n |-- fare_amount: nullable = true)\r\n |-- extra: nullable = true)\r\n |-- mta_tax: nullable = true)\r\n |-- tip_amount: nullable = true)\r\n |-- tolls_amount: nullable = true)\r\n |-- ehail_fee: string (nullable = true)\r\n |-- improvement_surcharge: nullable = true)\r\n |-- total_amount: nullable = true)\r\n |-- payment_type: integer (nullable = true)\r\n |-- trip_type: integer (nullable = true)\r\n<\/pre>\n<p>Uncompressed CSV of 107MB was reduced to 24MB (Snappy Parquet) and 19MB (GZIP Parquet). But the real power comes in once the data (now in parquet format) is accessed. Parquet is exceptionally fast when accessing specific columns, which is the opposite of row-based file formats, which thrive when accessing an entire row record. Here are simple SQL examples to show the differences:<\/p>\n<pre spellcheck=\"false\">--#1\r\n--CSV will read the entire file row-by-row\r\n--Parquet will dump the rows based on their column values\r\n--Winner: Parquet (minimal; because of no parsing)\r\nSELECT *\r\ngreen_tlc\r\n\r\n--#2\r\n--CSV will read the entire file row-by-row\r\n--filter the PULocation column to only ones containing 226 \r\n--and output all rows\/columns that match the filter criteria as the results\r\n--Winner: Parquet (minimal; because of no parsing and push down filtering)\r\nSELECT *\r\ngreen_tlc\r\n<span class=\"hljs-keyword\">WHERE PULocation = 226\r\n\r\n--#3\r\n--Parquet will first find only the relevant \"data blocks\" based on the filter criteria\r\n--and only aggregate the rows\/columns that match the filter criteria\r\n--Winner: Parquet (huge; because of no parsing and only specific columns)\r\nSELECT PULocation, SUM(total_amount)\r\ngreen_tlc\r\nWHERE PULocation IN (77, 102, 107, 226)\r\nGROUP BY PULocation\r\n<\/pre>\n<p>#3 above is a great example of where Parquet shines, because you&#8217;re using pushdown filtering, operating on only specific columns (the rest are ignored), and do not have to parse what you don&#8217;t care about (all the other columns\/rows).<\/p>\n<p><strong>What implementation strategies can I use?<\/strong><\/p>\n<p>Some ideas:<\/p>\n<ol>\n<li>Spark with Parquet (SParquet) on <a href=\"http:\/\/livy.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Livy<\/a> to be able to cache entire datasets\/queries<\/li>\n<li>Bonus: Impala with Parquet-backed Hive tables (also Spark compatible) to get hyperfast results available via SQL queries<\/li>\n<\/ol>\n<p>By now, you have hopefully learned that Parquet is a powerful data format that facilitates big data analytics at a scale far greater than many traditional limited approaches. Go forth and play with Parquet!<\/p>\n<p>Here&#8217;s my blog post for specific optimization tips: <a href=\"http:\/\/garrens.com\/blog\/2017\/04\/08\/getting-started-and-tips-for-using-apache-parquet-with-apache-spark-2-x\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">http:\/\/garrens.com\/blog\/2017\/04\/08\/getting-started-and-tips-for-using-apache-parquet-with-apache-spark-2-x\/<\/a><\/p>\n<p><em>Garren Staubli <\/em>is a Big Data Engineer Consultant at Blueprint Consulting Services, and formerly a big data engineer at iSpot.TV.<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark and Parquet (SParquet) are a match made in scalable data analytics and delivery heaven. Spark brings a wide ranging, powerful computing platform to the equation while Parquet offers a data format that is purpose-built for high-speed big data analytics. If this sounds like fluffy marketing talk, resist the temptation to close this tab,&hellip; <a href=\"https:\/\/garrens.com\/blog\/2017\/06\/26\/real-time-big-data-analytics-parquet-and-spark-bonus\/\" title=\"Read More\" class=\"read-more\">Continue reading<span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[6,17,25,24,19,2],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/141"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=141"}],"version-history":[{"count":6,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/141\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/141\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=141"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=141"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}