<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$categories is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$post2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$link2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Using ${var} in strings is deprecated, use {$var} instead in <b>/home/garrens3/public_html/blog/wp-includes/comment-template.php</b> on line <b>1747</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1927</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1941</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $meta_key is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1956</b><br />
<br />
<b>Deprecated</b>:  Optional parameter $term_id declared before required parameter $key is implicitly treated as a required parameter in <b>/home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php</b> on line <b>1970</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>250</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>265</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
{"id":169,"date":"2017-10-09T06:40:19","date_gmt":"2017-10-09T14:40:19","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=169"},"modified":"2018-03-02T20:49:07","modified_gmt":"2018-03-03T04:49:07","slug":"spark-file-format-showdown-csv-vs-json-vs-parquet","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2017\/10\/09\/spark-file-format-showdown-csv-vs-json-vs-parquet\/","title":{"rendered":"Spark File Format Showdown &#8211; CSV vs JSON vs Parquet"},"content":{"rendered":"<p>Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my other posts: <a href=\"http:\/\/garrens.com\/blog\/2017\/06\/26\/real-time-big-data-analytics-parquet-and-spark-bonus\/\">Real Time Big Data analytics: Parquet (and Spark) + bonus<\/a> and <a href=\"http:\/\/garrens.com\/blog\/2017\/04\/08\/getting-started-and-tips-for-using-apache-parquet-with-apache-spark-2-x\/\">Tips for using Apache Parquet with Spark 2.x<\/a><\/p>\n<p>In this post we&#8217;re going to cover the attributes of using these 3 formats (CSV, JSON and Parquet) with Apache Spark.<\/p>\n<p><a href=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/10\/Spark-Format-Showdown-3.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-173\" src=\"http:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/10\/Spark-Format-Showdown-3.png\" alt=\"\" width=\"1575\" height=\"627\" srcset=\"https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/10\/Spark-Format-Showdown-3.png 1575w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/10\/Spark-Format-Showdown-3-300x119.png 300w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/10\/Spark-Format-Showdown-3-768x306.png 768w, https:\/\/garrens.com\/blog\/wp-content\/uploads\/2017\/10\/Spark-Format-Showdown-3-1024x408.png 1024w\" sizes=\"(max-width: 1575px) 100vw, 1575px\" \/><\/a><\/p>\n<p><strong>Splittable (definition):<\/strong> Spark likes to <em>split<\/em> 1 single input file into multiple <em>chunks\u00a0<\/em>(partitions to be precise) so that it [Spark] can work on\u00a0many partitions at one time (re: concurrently).<\/p>\n<p>* CSV is splittable when it is a raw, uncompressed file or using a splittable compression format such as BZIP2 or LZO (note: LZO needs to be indexed to be splittable!)<\/p>\n<p>** JSON has the same conditions about splittability when compressed as CSV with one extra difference. When &#8220;wholeFile&#8221; option is set to true (re: <a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-18352\">SPARK-18352<\/a>), JSON is NOT splittable.<\/p>\n<p>CSV should generally be the fastest to <em>write<\/em>, JSON the easiest for a human to <em>understand<\/em> and Parquet the fastest to <em>read<\/em>.<\/p>\n<p>CSV is the defacto standard of a lot of data and for fair reasons; it&#8217;s [relatively] easy to comprehend for both users and computers and made more accessible via Microsoft Excel.<\/p>\n<p>JSON is the standard for communicating on the web. APIs and websites are constantly communicating using JSON because of its usability properties such as well-defined schemas.<\/p>\n<p>Parquet is optimized for the <em>Write Once Read Many (WORM)<\/em> paradigm. It&#8217;s slow to write, but incredibly fast to read, especially when you&#8217;re only accessing a subset of the total columns. For use cases requiring operating on entire rows of data, a format like CSV, JSON or even AVRO should be used.<\/p>\n<h2>Code examples and explanations<\/h2>\n<h3>CSV<\/h3>\n<h4>Generic column names | all string types | lazily evaluated<\/h4>\n<pre>scala&gt; val df = spark.read.option(\"sep\", \"\\t\").csv(\"data.csv\")\r\nscala&gt; df.printSchema\r\nroot\r\n\u00a0|-- _c0: string (nullable = true)\r\n\u00a0|-- _c1: string (nullable = true)\r\n\u00a0|-- _c2: string (nullable = true)\r\n\u00a0|-- _c3: string (nullable = true)<\/pre>\n<blockquote>\n<h4>Header-defined column names | all string types | lazily evaluated<\/h4>\n<\/blockquote>\n<pre>scala&gt; val df = spark.read.option(\"sep\", \"\\t\").option(\"header\",\"true\").csv(\"data.csv\")\r\nscala&gt; df.printSchema\r\nroot\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- date: string (nullable = true)\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n<blockquote>\n<h4>Header-defined column names | inferred types | EAGERLY evaluated (!!!)<\/h4>\n<\/blockquote>\n<pre>scala&gt; val df = spark.read.option(\"sep\", \"\\t\").option(\"header\",\"true\").option(\"inferSchema\",\"true\").csv(\"data.csv\")\r\nscala&gt; df.printSchema\r\nroot\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- date: timestamp (nullable = true)\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n<p>The eager evaluation of this version is critical to understand. In order to determine with certainty the proper data types to assign to each column, Spark has to <em>READ AND PARSE THE ENTIRE DATASET. <\/em>This can be a very high cost, especially when the number of files\/rows\/columns is large. It also does <em>no processing<\/em> while it&#8217;s inferring the schema, so if you thought it would be running your actual transformation code while it&#8217;s inferring the schema, sorry, it won&#8217;t. Spark has to therefore read your file(s) TWICE instead of ONCE.<\/p>\n<h2>JSON<\/h2>\n<h6>Named columns | inferred types | EAGERLY evaluated<\/h6>\n<pre>scala&gt; val df = spark.read.json(\"data.json\")\r\nscala&gt; df.printSchema\r\nroot\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- epoch_date: long (nullable = true)\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n<p>Like the eagerly evaluated (for schema inferencing) CSV above, JSON files are eagerly evaluated.<\/p>\n<h2>Parquet<\/h2>\n<h6>Named Columns | Defined types | lazily evaluated<\/h6>\n<pre>scala&gt; val df = spark.read.parquet(\"data.parquet\")\r\nscala&gt; df.printSchema\r\nroot\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- date: long (nullable = true)\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n<p>Unlike CSV and JSON, Parquet files are <em>binary<\/em> files that contain meta data about their contents, so without needing to read\/parse the content of the file(s), Spark can just rely on the header\/meta data inherent to Parquet to determine column names and data types.<\/p>\n<div class=\"page\" title=\"Page 3\">\u00a0TL;DR Use Apache Parquet instead of CSV or JSON whenever possible, because it&#8217;s faster and better.<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my&hellip; <a href=\"https:\/\/garrens.com\/blog\/2017\/10\/09\/spark-file-format-showdown-csv-vs-json-vs-parquet\/\" title=\"Read More\" class=\"read-more\">Continue reading<span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[17,20,21,19,5,2],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":2,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":175,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions\/175"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}