Deprecated: Creation of dynamic property wpdb::$categories is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$post2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Creation of dynamic property wpdb::$link2cat is deprecated in /home/garrens3/public_html/blog/wp-includes/wp-db.php on line 760

Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /home/garrens3/public_html/blog/wp-includes/comment-template.php on line 1747

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1927

Deprecated: Optional parameter $term_id declared before required parameter $meta_value is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1941

Deprecated: Optional parameter $term_id declared before required parameter $meta_key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1956

Deprecated: Optional parameter $term_id declared before required parameter $key is implicitly treated as a required parameter in /home/garrens3/public_html/blog/wp-content/plugins/advanced-code-editor/advanced-code-editor.php on line 1970

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 250

Deprecated: Automatic conversion of false to array is deprecated in /home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php on line 265

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391

Deprecated: Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in /home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php on line 391
{"id":126,"date":"2017-04-09T09:36:05","date_gmt":"2017-04-09T17:36:05","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=126"},"modified":"2018-03-02T20:49:42","modified_gmt":"2018-03-03T04:49:42","slug":"connecting-apache-spark-to-external-data-sources","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2017\/04\/09\/connecting-apache-spark-to-external-data-sources\/","title":{"rendered":"Connecting Apache Spark to External Data sources (e.g. Redshift, S3, MySQL)"},"content":{"rendered":"

Pre-requisites<\/strong><\/p>\n

AWS S3<\/em><\/p>\n

AWS Java SDK<\/a> Jar<\/p>\n

* Note: These AWS jars should not be necessary if you’re using Amazon EMR.<\/p>\n

Amazon Redshift<\/em><\/p>\n

JDBC Driver<\/a><\/p>\n

Spark-Redshift package<\/a> *<\/p>\n

* The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead of individual insert statements. If you’re only looking to READ from Redshift, this package may not be quite as helpful.<\/p>\n

MySQL<\/em><\/p>\n

MySQL JDBC Connector jar<\/a><\/p>\n

Setting your password [relatively securely]<\/strong><\/p>\n

This is not extremely secure, but is much better than putting your password directly into code.<\/p>\n

Use a properties file:<\/em><\/p>\n

echo \"spark.jdbc.password=test_pass_prop\" > secret_credentials.properties\r\nspark-submit --properties-file secret_credentials.properties<\/code><\/pre>\nExamples (in Scala unless otherwise noted)<\/strong><\/p>\n
S3<\/em> (using S3A)<\/p>\nspark-shell --jars hadoop-aws-2.7.3.jar,aws-java-sdk-1.7.4.jar<\/code>\r\nspark.conf.set(\"fs.s3a.access.key\", \"<ACCESS_KEY>\")\r\nspark.conf.set(\"fs.s3a.secret.key\", \"<SECRET_KEY>\")\r\nval d = spark.read.parquet(\"s3a:\/\/parquet-lab\/files\")\r\nd.select(\"device_id\").distinct().count() \/\/ => 1337<\/code><\/pre>\n* On Amazon EMR, you may be able to skip the jars and key settings.
\n** Also, you may also want to try using the “s3” or “s3n” protocols if s3a doesn’t work.<\/p>\n
 MySQL<\/em><\/p>\nspark-shell --jars mysql-connector-java-5.1.40-bin.jar\r\nval properties = new java.util.Properties()\r\nproperties.put(\"driver\", \"com.mysql.jdbc.Driver\")\r\nproperties.put(\"url\", \"jdbc:mysql:\/\/mysql-host:3306\")\r\nproperties.put(\"user\", )\r\nproperties.put(\"password\", spark.conf.get(\"spark.jdbc.password\", \"<default_pass>\"))\r\n\/\/ This will form a SQL query like \"SELECT model_id, prediction, actual_value FROM ml_models WHERE utc_start_time BETWEEN '2017-03-31' AND '2017-04-02'\"\r\n\/\/ Using .limit(INT) will NOT work as you might expect - it will retrieve all the data first THEN limit when showing you\r\nval models = spark.read.jdbc(properties.get(\"url\").toString, \"ml_models\", Array(\"utc_start_time BETWEEN '2017-03-31' AND '2017-04-02'\"), properties).select(\"model_id\", \"prediction\", \"actual_value\")<\/code><\/pre>\nRedshift<\/em><\/p>\n
Recommended approach using Databricks’ spark-redshift:<\/p>\nspark-shell --packages com.databricks:spark-redshift_2.11:3.0.0-preview1 --jars RedshiftJDBC42-1.2.1.1001.jar<\/pre>\nBasic JDBC connection only:<\/p>\nspark-shell --jars RedshiftJDBC42-1.2.1.1001.jar<\/pre>\n\r\nval properties = new java.util.Properties() \r\nproperties.put(\"driver\", \"com.amazon.redshift.jdbc42.Driver\") \r\nproperties.put(\"url\", \"jdbc:redshift:\/\/redshift-host:5439\/\") \r\nproperties.put(\"user\", \"<username>\") properties.put(\"password\",spark.conf.get(\"spark.jdbc.password\", \"<default_pass>\")) \r\nval d_rs = spark.read.jdbc(properties.get(\"url\").toString, \"data_table\", properties)<\/code><\/pre>\nUsing the Databricks Redshift data source package – for Bulk Data WRITING to Redshift, use this package:<\/p>\nReading from and writing to Redshift stages data [and doesn’t clean up after itself] in S3, so use object lifecycle management<\/a>!<\/p>\n
val devices = spark.read.format(\"com.databricks.spark.redshift\").\r\noption(\"forward_spark_s3_credentials\", \"true\").\r\noption(\"url\", \"jdbc:redshift:\/\/redshift-host:5439\/?user=<user>&password=<password>\").\r\noption(\"query\", \"SELECT * FROM devices\").\r\noption(\"tempdir\", \"s3:\/\/temporary-holding-bucket\/\").load()<\/pre>\nWriting the dataframe to Redshift in the “public.temporary_devices” table:<\/p>\ndevices_transformed.coalesce(64).write .format(\"com.databricks.spark.redshift\") .option(\"forward_spark_s3_credentials\", \"true\") .option(\"url\", \"jdbc:redshift:\/\/redshift-host:5439\/?user=&password=\") .option(\"dbtable\", \"public.temporary_devices\") .option(\"tempdir\", \"s3a:\/\/temporary-holding-bucket\/\") .option(\"tempformat\", \"CSV GZIP\") \/\/ EXPERIMENTAL, but CSV is higher performance than AVRO for loading into redshift .mode(\"error\") .save()<\/pre>\n* Note: coalesce(64) is called to reduce the number of output files to the s3 staging directory, because renaming files from their temporary location in S3 can be slow. This S3Committer<\/a> should help alleviate that issue.<\/p>\n
Resources<\/p>\n
http:\/\/deploymentzone.com\/2015\/12\/20\/s3a-on-spark-on-aws-ec2\/<\/p>\n","protected":false},"excerpt":{"rendered":"
Pre-requisites AWS S3 Hadoop AWS Jar AWS Java SDK Jar * Note: These AWS jars should not be necessary if you’re using Amazon EMR. Amazon Redshift JDBC Driver Spark-Redshift package * * The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead… Continue reading→<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[6,3,4,5,2],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/126"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=126"}],"version-history":[{"count":5,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/126\/revisions"}],"predecessor-version":[{"id":131,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/126\/revisions\/131"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=126"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=126"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=126"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}