Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /home/garrens3/public_html/blog/wp-includes/comment-template.php on line 1747

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1927

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1941

Deprecated: Optional parameter $term_id declared before required parameter $meta_key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1956

Deprecated: Optional parameter $term_id declared before required parameter $key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1970

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Return type of Requests_Cookie_Jar::offsetExists($key) should either be compatible with ArrayAccess::offsetExists(mixed $offset): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 63

Deprecated: Return type of Requests_Cookie_Jar::offsetGet($key) should either be compatible with ArrayAccess::offsetGet(mixed $offset): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 73

Deprecated: Return type of Requests_Cookie_Jar::offsetSet($key, $value) should either be compatible with ArrayAccess::offsetSet(mixed $offset, mixed $value): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 89

Deprecated: Return type of Requests_Cookie_Jar::offsetUnset($key) should either be compatible with ArrayAccess::offsetUnset(mixed $offset): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 102

Deprecated: Return type of Requests_Cookie_Jar::getIterator() should either be compatible with IteratorAggregate::getIterator(): Traversable, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php on line 111

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetExists($key) should either be compatible with ArrayAccess::offsetExists(mixed $offset): bool, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 40

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetGet($key) should either be compatible with ArrayAccess::offsetGet(mixed $offset): mixed, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 51

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetSet($key, $value) should either be compatible with ArrayAccess::offsetSet(mixed $offset, mixed $value): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 68

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::offsetUnset($key) should either be compatible with ArrayAccess::offsetUnset(mixed $offset): void, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 82

Deprecated: Return type of Requests_Utility_CaseInsensitiveDictionary::getIterator() should either be compatible with IteratorAggregate::getIterator(): Traversable, or the #[\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /home/garrens3/public_html/blog/wp-includes/Requests/Utility/CaseInsensitiveDictionary.php on line 91

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723

Warning: Cannot modify header information - headers already sent by (output started at /home/garrens3/public_html/blog/wp-includes/Requests/Cookie/Jar.php:15) in /home/garrens3/public_html/blog/wp-includes/rest-api/class-wp-rest-server.php on line 1723
{"id":169,"date":"2017-10-09T06:40:19","date_gmt":"2017-10-09T14:40:19","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=169"},"modified":"2018-03-02T20:49:07","modified_gmt":"2018-03-03T04:49:07","slug":"spark-file-format-showdown-csv-vs-json-vs-parquet","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2017\/10\/09\/spark-file-format-showdown-csv-vs-json-vs-parquet\/","title":{"rendered":"Spark File Format Showdown – CSV vs JSON vs Parquet"},"content":{"rendered":"

Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my other posts: Real Time Big Data analytics: Parquet (and Spark) + bonus<\/a> and Tips for using Apache Parquet with Spark 2.x<\/a><\/p>\n

In this post we’re going to cover the attributes of using these 3 formats (CSV, JSON and Parquet) with Apache Spark.<\/p>\n

\"\"<\/a><\/p>\n

Splittable (definition):<\/strong> Spark likes to split<\/em> 1 single input file into multiple chunks\u00a0<\/em>(partitions to be precise) so that it [Spark] can work on\u00a0many partitions at one time (re: concurrently).<\/p>\n

* CSV is splittable when it is a raw, uncompressed file or using a splittable compression format such as BZIP2 or LZO (note: LZO needs to be indexed to be splittable!)<\/p>\n

** JSON has the same conditions about splittability when compressed as CSV with one extra difference. When “wholeFile” option is set to true (re: SPARK-18352<\/a>), JSON is NOT splittable.<\/p>\n

CSV should generally be the fastest to write<\/em>, JSON the easiest for a human to understand<\/em> and Parquet the fastest to read<\/em>.<\/p>\n

CSV is the defacto standard of a lot of data and for fair reasons; it’s [relatively] easy to comprehend for both users and computers and made more accessible via Microsoft Excel.<\/p>\n

JSON is the standard for communicating on the web. APIs and websites are constantly communicating using JSON because of its usability properties such as well-defined schemas.<\/p>\n

Parquet is optimized for the Write Once Read Many (WORM)<\/em> paradigm. It’s slow to write, but incredibly fast to read, especially when you’re only accessing a subset of the total columns. For use cases requiring operating on entire rows of data, a format like CSV, JSON or even AVRO should be used.<\/p>\n

Code examples and explanations<\/h2>\n

CSV<\/h3>\n

Generic column names | all string types | lazily evaluated<\/h4>\n
scala> val df = spark.read.option(\"sep\", \"\\t\").csv(\"data.csv\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- _c0: string (nullable = true)\r\n\u00a0|-- _c1: string (nullable = true)\r\n\u00a0|-- _c2: string (nullable = true)\r\n\u00a0|-- _c3: string (nullable = true)<\/pre>\n
\n

Header-defined column names | all string types | lazily evaluated<\/h4>\n<\/blockquote>\n
scala> val df = spark.read.option(\"sep\", \"\\t\").option(\"header\",\"true\").csv(\"data.csv\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- date: string (nullable = true)\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n
\n

Header-defined column names | inferred types | EAGERLY evaluated (!!!)<\/h4>\n<\/blockquote>\n
scala> val df = spark.read.option(\"sep\", \"\\t\").option(\"header\",\"true\").option(\"inferSchema\",\"true\").csv(\"data.csv\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- date: timestamp (nullable = true)\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n

The eager evaluation of this version is critical to understand. In order to determine with certainty the proper data types to assign to each column, Spark has to READ AND PARSE THE ENTIRE DATASET. <\/em>This can be a very high cost, especially when the number of files\/rows\/columns is large. It also does no processing<\/em> while it’s inferring the schema, so if you thought it would be running your actual transformation code while it’s inferring the schema, sorry, it won’t. Spark has to therefore read your file(s) TWICE instead of ONCE.<\/p>\n

JSON<\/h2>\n
Named columns | inferred types | EAGERLY evaluated<\/h6>\n
scala> val df = spark.read.json(\"data.json\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- epoch_date: long (nullable = true)\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n

Like the eagerly evaluated (for schema inferencing) CSV above, JSON files are eagerly evaluated.<\/p>\n

Parquet<\/h2>\n

Named Columns | Defined types | lazily evaluated<\/h6>\n
scala> val df = spark.read.parquet(\"data.parquet\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- date: long (nullable = true)\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n

Unlike CSV and JSON, Parquet files are binary<\/em> files that contain meta data about their contents, so without needing to read\/parse the content of the file(s), Spark can just rely on the header\/meta data inherent to Parquet to determine column names and data types.<\/p>\n

\u00a0TL;DR Use Apache Parquet instead of CSV or JSON whenever possible, because it’s faster and better.<\/div>\n","protected":false},"excerpt":{"rendered":"

Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[17,20,21,19,5,2],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":2,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":175,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions\/175"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}