Using Spark Efficiently | Understanding Spark Event 7/29/17

This page is dedicated to resources related to the 7/29/17 Understanding Spark event presentation in Bellevue, WA.

Slides

Great [FREE!] resources on all things Spark:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
https://spark.apache.org/docs/latest/sql-programming-guide.html

Databricks was founded by the original creators of Spark and is currently the largest contributor to Apache Spark. As such, they are a phenomenal resource for information and services relating to Spark.

Datasets: https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
Catalyst: https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huai
https://de.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
Tungsten: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Matrix: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Personally curated Examples:

Create mock typed object data
import org.apache.spark.sql.functions._

case class CountryGDP(countryCode : String, countryName : String, Year : String, gdp: Double, language : Option[String])
val objects = Seq[CountryGDP](
CountryGDP("USA", "'Murica", "2014", 17393103000000f, None),
CountryGDP("USA", "'Murica", "2015", 18036648000000f, None),
CountryGDP("USA", "'Murica", "2016", 18569100000000f, None),
CountryGDP("CHE", "Switzerland", "2014", 702705544908.583, None),
CountryGDP("CHE", "Switzerland", "2015", 670789928809.882, None),
CountryGDP("CHE", "Switzerland", "2016", 659827235193.83, None)
)

Strongly typed Datasets
val objectsDS = spark.createDataset(objects)

// typed objects are evaluated at compile time (great for development in IDEs!)
val countriesWithLanguages = objectsDS.map(o => {
val lang = o.countryCode match {
case "USA" => Some("English")
case "CHE" => Some("Schweizerdeutsch")
case _ => Some("Simlish")
}
o.copy(language = lang)
})

Creating DataFrame and using UDF to transform
val rowsDF = spark.createDataFrame(objects)

def getLang(countryCode: String): Option[String] = {
countryCode match {
case "USA" => Some("English")
case "CHE" => Some("Schweizerdeutsch")
case _ => Some("Simlish")
}
}
val gl = sqlContext.udf.register("getLang", getLang _)

// String-based lookups are evaluated at Runtime
val rowsDFWithLanguage = rowsDF.withColumn("language", gl($"countryCode"))

Event link: https://www.eventbrite.com/e/understanding-spark-tickets-35440866586#

Video recording is here: https://livestream.com/metis/events/7597562

Learn more about parquet