<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$categories is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$post2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property wpdb::$link2cat is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/wp-db.php</b> on line <b>760</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>250</b><br />
<br />
<b>Deprecated</b>:  Automatic conversion of false to array is deprecated in <b>/home/garrens3/public_html/blog/wp-content/plugins/loginizer/init.php</b> on line <b>265</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
<br />
<b>Deprecated</b>:  Creation of dynamic property WP_Block_Type::$skip_inner_blocks is deprecated in <b>/home/garrens3/public_html/blog/wp-includes/class-wp-block-type.php</b> on line <b>391</b><br />
{"id":126,"date":"2017-04-09T09:36:05","date_gmt":"2017-04-09T17:36:05","guid":{"rendered":"http:\/\/garrens.com\/blog\/?p=126"},"modified":"2018-03-02T20:49:42","modified_gmt":"2018-03-03T04:49:42","slug":"connecting-apache-spark-to-external-data-sources","status":"publish","type":"post","link":"https:\/\/garrens.com\/blog\/2017\/04\/09\/connecting-apache-spark-to-external-data-sources\/","title":{"rendered":"Connecting Apache Spark to External Data sources (e.g. Redshift, S3, MySQL)"},"content":{"rendered":"<p><strong>Pre-requisites<\/strong><\/p>\n<p><em>AWS S3<\/em><\/p>\n<p><a href=\"https:\/\/mvnrepository.com\/artifact\/org.apache.hadoop\/hadoop-aws\">Hadoop AWS Jar<\/a><\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/sdk-for-java\/\">AWS Java SDK<\/a> Jar<\/p>\n<p>* Note: These AWS jars should not be necessary if you&#8217;re using Amazon EMR.<\/p>\n<p><em>Amazon Redshift<\/em><\/p>\n<p><a href=\"http:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/configure-jdbc-connection.html#download-jdbc-driver\">JDBC Driver<\/a><\/p>\n<p><a href=\"https:\/\/spark-packages.org\/package\/databricks\/spark-redshift\">Spark-Redshift package<\/a> *<\/p>\n<p>* The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead of individual insert statements. If you&#8217;re only looking to READ from Redshift, this package may not be quite as helpful.<\/p>\n<p><em>MySQL<\/em><\/p>\n<p><a href=\"https:\/\/dev.mysql.com\/downloads\/connector\/j\/\">MySQL JDBC Connector jar<\/a><\/p>\n<p><strong>Setting your password [relatively securely]<\/strong><\/p>\n<p>This is not extremely secure, but is much better than putting your password directly into code.<\/p>\n<p><em>Use a properties file:<\/em><\/p>\n<pre><code>echo \"spark.jdbc.password=test_pass_prop\" &gt; secret_credentials.properties\r\nspark-submit --properties-file secret_credentials.properties<\/code><\/pre>\n<p><strong>Examples (in Scala unless otherwise noted)<\/strong><\/p>\n<p><em>S3<\/em> (using S3A)<\/p>\n<pre><code class=\"bash plain\">spark-shell --jars hadoop-aws-2.7.3.jar,aws-java-sdk-1.7.4.jar<\/code>\r\n<code>spark.conf.set(\"fs.s3a.access.key\", \"&lt;ACCESS_KEY&gt;\")\r\nspark.conf.set(\"fs.s3a.secret.key\", \"&lt;SECRET_KEY&gt;\")\r\nval d = spark.read.parquet(\"s3a:\/\/parquet-lab\/files\")\r\nd.select(\"device_id\").distinct().count() \/\/ =&gt; 1337<\/code><\/pre>\n<p>* On Amazon EMR, you may be able to skip the jars and key settings.<br \/>\n** Also, you may also want to try using the &#8220;s3&#8221; or &#8220;s3n&#8221; protocols if s3a doesn&#8217;t work.<\/p>\n<p><em> MySQL<\/em><\/p>\n<pre><code>spark-shell --jars mysql-connector-java-5.1.40-bin.jar\r\nval properties = new java.util.Properties()\r\nproperties.put(\"driver\", \"com.mysql.jdbc.Driver\")\r\nproperties.put(\"url\", \"jdbc:mysql:\/\/mysql-host:3306\")\r\nproperties.put(\"user\", )\r\nproperties.put(\"password\", spark.conf.get(\"spark.jdbc.password\", \"&lt;default_pass&gt;\"))\r\n\/\/ This will form a SQL query like \"SELECT model_id, prediction, actual_value FROM ml_models WHERE utc_start_time BETWEEN '2017-03-31' AND '2017-04-02'\"\r\n\/\/ Using .limit(INT) will NOT work as you might expect - it will retrieve all the data first THEN limit when showing you\r\nval models = spark.read.jdbc(properties.get(\"url\").toString, \"ml_models\", Array(\"utc_start_time BETWEEN '2017-03-31' AND '2017-04-02'\"), properties).select(\"model_id\", \"prediction\", \"actual_value\")<\/code><\/pre>\n<p><em>Redshift<\/em><\/p>\n<p>Recommended approach using Databricks&#8217; spark-redshift:<\/p>\n<pre>spark-shell --packages com.databricks:spark-redshift_2.11:3.0.0-preview1 --jars RedshiftJDBC42-1.2.1.1001.jar<\/pre>\n<p>Basic JDBC connection only:<\/p>\n<pre>spark-shell --jars RedshiftJDBC42-1.2.1.1001.jar<\/pre>\n<pre><code>\r\nval properties = new java.util.Properties() \r\nproperties.put(\"driver\", \"com.amazon.redshift.jdbc42.Driver\") \r\nproperties.put(\"url\", \"jdbc:redshift:\/\/redshift-host:5439\/\") \r\nproperties.put(\"user\", \"&lt;username&gt;\") properties.put(\"password\",spark.conf.get(\"spark.jdbc.password\", \"&lt;default_pass&gt;\")) \r\nval d_rs = spark.read.jdbc(properties.get(\"url\").toString, \"data_table\", properties)<\/code><\/pre>\n<p>Using the Databricks Redshift data source package &#8211; for Bulk Data WRITING to Redshift, use this package:<\/p>\n<p>Reading from and writing to Redshift stages data [and doesn&#8217;t clean up after itself] in S3, so use <a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/dev\/object-lifecycle-mgmt.html\">object lifecycle management<\/a>!<\/p>\n<pre>val devices = spark.read.format(\"com.databricks.spark.redshift\").\r\noption(\"forward_spark_s3_credentials\", \"true\").\r\noption(\"url\", \"jdbc:redshift:\/\/redshift-host:5439\/?user=&lt;user&gt;&amp;password=&lt;password&gt;\").\r\noption(\"query\", \"SELECT * FROM devices\").\r\noption(\"tempdir\", \"s3:\/\/temporary-holding-bucket\/\").load()<\/pre>\n<p>Writing the dataframe to Redshift in the &#8220;public.temporary_devices&#8221; table:<\/p>\n<pre>devices_transformed.coalesce(64).write .format(\"com.databricks.spark.redshift\") .option(\"forward_spark_s3_credentials\", \"true\") .option(\"url\", \"jdbc:redshift:\/\/redshift-host:5439\/?user=&amp;password=\") .option(\"dbtable\", \"public.temporary_devices\") .option(\"tempdir\", \"s3a:\/\/temporary-holding-bucket\/\") .option(\"tempformat\", \"CSV GZIP\") \/\/ EXPERIMENTAL, but CSV is higher performance than AVRO for loading into redshift .mode(\"error\") .save()<\/pre>\n<p>* Note: coalesce(64) is called to reduce the number of output files to the s3 staging directory, because renaming files from their temporary location in S3 can be slow. <a href=\"https:\/\/github.com\/rdblue\/s3committer\">This S3Committer<\/a> should help alleviate that issue.<\/p>\n<p>Resources<\/p>\n<p>http:\/\/deploymentzone.com\/2015\/12\/20\/s3a-on-spark-on-aws-ec2\/<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pre-requisites AWS S3 Hadoop AWS Jar AWS Java SDK Jar * Note: These AWS jars should not be necessary if you&#8217;re using Amazon EMR. Amazon Redshift JDBC Driver Spark-Redshift package * * The Spark-redshift package provided by Databricks is critical particularly if you wish to WRITE to Redshift, because it does bulk file operations instead&hellip; <a href=\"https:\/\/garrens.com\/blog\/2017\/04\/09\/connecting-apache-spark-to-external-data-sources\/\" title=\"Read More\" class=\"read-more\">Continue reading<span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[6,3,4,5,2],"_links":{"self":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/126"}],"collection":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/comments?post=126"}],"version-history":[{"count":5,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/126\/revisions"}],"predecessor-version":[{"id":131,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/posts\/126\/revisions\/131"}],"wp:attachment":[{"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/media?parent=126"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/categories?post=126"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/garrens.com\/blog\/wp-json\/wp\/v2\/tags?post=126"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}