Using Spark Efficiently | Understanding Spark Event 7/29/17

This page is dedicated to resources related to the 7/29/17 Understanding Spark event presentation in Bellevue, WA.


Great [FREE!] resources on all things Spark:

Databricks was founded by the original creators of Spark and is currently the largest contributor to Apache Spark. As such, they are a phenomenal resource for information and services relating to Spark.


Personally curated Examples:

Create mock typed object data
import org.apache.spark.sql.functions._

case class CountryGDP(countryCode : String, countryName : String, Year : String, gdp: Double, language : Option[String])
val objects = Seq[CountryGDP](
CountryGDP("USA", "'Murica", "2014", 17393103000000f, None),
CountryGDP("USA", "'Murica", "2015", 18036648000000f, None),
CountryGDP("USA", "'Murica", "2016", 18569100000000f, None),
CountryGDP("CHE", "Switzerland", "2014", 702705544908.583, None),
CountryGDP("CHE", "Switzerland", "2015", 670789928809.882, None),
CountryGDP("CHE", "Switzerland", "2016", 659827235193.83, None)

Strongly typed Datasets
val objectsDS = spark.createDataset(objects)

// typed objects are evaluated at compile time (great for development in IDEs!)
val countriesWithLanguages = => {
val lang = o.countryCode match {
case "USA" => Some("English")
case "CHE" => Some("Schweizerdeutsch")
case _ => Some("Simlish")
o.copy(language = lang)

Creating DataFrame and using UDF to transform
val rowsDF = spark.createDataFrame(objects)

def getLang(countryCode: String): Option[String] = {
countryCode match {
case "USA" => Some("English")
case "CHE" => Some("Schweizerdeutsch")
case _ => Some("Simlish")
val gl = sqlContext.udf.register("getLang", getLang _)

// String-based lookups are evaluated at Runtime
val rowsDFWithLanguage = rowsDF.withColumn("language", gl($"countryCode"))

Event link:

Video recording is here:

Learn more about parquet