This page is dedicated to resources related to the 7/29/17 Understanding Spark event presentation in Bellevue, WA.
Great [FREE!] resources on all things Spark:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
https://spark.apache.org/docs/latest/sql-programming-guide.html
Databricks was founded by the original creators of Spark and is currently the largest contributor to Apache Spark. As such, they are a phenomenal resource for information and services relating to Spark.
Datasets: https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
Catalyst: https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huai
https://de.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
Tungsten: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Matrix: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Personally curated Examples:
Create mock typed object data
import org.apache.spark.sql.functions._
case class CountryGDP(countryCode : String, countryName : String, Year : String, gdp: Double, language : Option[String])
val objects = Seq[CountryGDP](
CountryGDP(“USA”, “‘Murica”, “2014”, 17393103000000f, None),
CountryGDP(“USA”, “‘Murica”, “2015”, 18036648000000f, None),
CountryGDP(“USA”, “‘Murica”, “2016”, 18569100000000f, None),
CountryGDP(“CHE”, “Switzerland”, “2014”, 702705544908.583, None),
CountryGDP(“CHE”, “Switzerland”, “2015”, 670789928809.882, None),
CountryGDP(“CHE”, “Switzerland”, “2016”, 659827235193.83, None)
)
Strongly typed Datasets
val objectsDS = spark.createDataset(objects)
// typed objects are evaluated at compile time (great for development in IDEs!)
val countriesWithLanguages = objectsDS.map(o => {
val lang = o.countryCode match {
case “USA” => Some(“English”)
case “CHE” => Some(“Schweizerdeutsch”)
case _ => Some(“Simlish”)
}
o.copy(language = lang)
})
Creating DataFrame and using UDF to transform
val rowsDF = spark.createDataFrame(objects)
def getLang(countryCode: String): Option[String] = {
countryCode match {
case “USA” => Some(“English”)
case “CHE” => Some(“Schweizerdeutsch”)
case _ => Some(“Simlish”)
}
}
val gl = sqlContext.udf.register(“getLang”, getLang _)
// String-based lookups are evaluated at Runtime
val rowsDFWithLanguage = rowsDF.withColumn(“language”, gl($”countryCode”))
Event link: https://www.eventbrite.com/e/understanding-spark-tickets-35440866586#
Video recording is here: https://livestream.com/metis/events/7597562
Learn more about parquet
Thanks for the sharing this information. Pretty detailed info & Great presentation