scala> val df = spark.read.option(\"sep\", \"\\t\").csv(\"data.csv\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- _c0: string (nullable = true)\r\n\u00a0|-- _c1: string (nullable = true)\r\n\u00a0|-- _c2: string (nullable = true)\r\n\u00a0|-- _c3: string (nullable = true)<\/pre>\n
Header-defined column names | all string types | lazily evaluated<\/h4>\n<\/blockquote>\n
scala> val df = spark.read.option(\"sep\", \"\\t\").option(\"header\",\"true\").csv(\"data.csv\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- date: string (nullable = true)\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\n\nHeader-defined column names | inferred types | EAGERLY evaluated (!!!)<\/h4>\n<\/blockquote>\nscala> val df = spark.read.option(\"sep\", \"\\t\").option(\"header\",\"true\").option(\"inferSchema\",\"true\").csv(\"data.csv\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- date: timestamp (nullable = true)\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\nThe eager evaluation of this version is critical to understand. In order to determine with certainty the proper data types to assign to each column, Spark has to READ AND PARSE THE ENTIRE DATASET. <\/em>This can be a very high cost, especially when the number of files\/rows\/columns is large. It also does no processing<\/em> while it’s inferring the schema, so if you thought it would be running your actual transformation code while it’s inferring the schema, sorry, it won’t. Spark has to therefore read your file(s) TWICE instead of ONCE.<\/p>\n
JSON<\/h2>\nNamed columns | inferred types | EAGERLY evaluated<\/h6>\nscala> val df = spark.read.json(\"data.json\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- epoch_date: long (nullable = true)\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\nLike the eagerly evaluated (for schema inferencing) CSV above, JSON files are eagerly evaluated.<\/p>\n
Parquet<\/h2>\n
Named Columns | Defined types | lazily evaluated<\/h6>\nscala> val df = spark.read.parquet(\"data.parquet\")\r\nscala> df.printSchema\r\nroot\r\n\u00a0|-- alphanum: string (nullable = true)\r\n\u00a0|-- date: long (nullable = true)\r\n\u00a0|-- guid: string (nullable = true)\r\n\u00a0|-- name: string (nullable = true)<\/pre>\nUnlike CSV and JSON, Parquet files are binary<\/em> files that contain meta data about their contents, so without needing to read\/parse the content of the file(s), Spark can just rely on the header\/meta data inherent to Parquet to determine column names and data types.<\/p>\n
\u00a0TL;DR Use Apache Parquet instead of CSV or JSON whenever possible, because it’s faster and better.<\/div>\n","protected":false},"excerpt":{"rendered":"Apache Spark supports many different data sources, such as the ubiquitous Comma Separated Value (CSV) format and web API friendly JavaScript Object Notation (JSON) format. A common format used primarily for big data analytical purposes is Apache Parquet. Parquet is a fast columnar data format that you can read more about in two of my… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[17,20,21,19,5,2],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":2,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":175,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions\/175"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}
CSV<\/h3>\n

Parquet<\/h2>\n