[Docs] Databricks App Submission (#100)

2019-05-10 09:11:26 -07:00 · 2019-05-10 09:11:26 -07:00 · 468bfb47a9
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,8 @@
+###############################################################################
+# Set default behavior to automatically normalize line endings.
+###############################################################################
+* text=auto
+
+# Force bash scripts to always use lf line endings so that if a repo is accessed
+# in Unix via a file share from Windows, the scripts will work.
+*.sh text eol=lf
--- a/deployment/README.md
+++ b/deployment/README.md
@ -63,7 +63,7 @@ Microsoft.Spark.Worker is a backend component that lives on the individual worke
 ## Azure HDInsight Spark
 [Azure HDInsight Spark](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview) is the Microsoft implementation of Apache Spark in the cloud that allows users to launch and configure Spark clusters in Azure. You can use HDInsight Spark clusters to process your data stored in Azure (e.g., [Azure Storage](https://azure.microsoft.com/en-us/services/storage/) and [Azure Data Lake Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)).

-> **Note**: Azure HDInsight Spark is Linux-based. Therefore, if you are interested in deploying your app to Azure HDInsight Spark, make sure your app is .NET Standard compatible and that you use [.NET Core compiler](https://dotnet.microsoft.com/download) to compile your app.
+> **Note:** Azure HDInsight Spark is Linux-based. Therefore, if you are interested in deploying your app to Azure HDInsight Spark, make sure your app is .NET Standard compatible and that you use [.NET Core compiler](https://dotnet.microsoft.com/download) to compile your app.

 ### Deploy Microsoft.Spark.Worker
 *Note that this step is required only once*
@ -115,7 +115,7 @@ EOF
 ## Amazon EMR Spark
 [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) is a managed cluster platform that simplifies running big data frameworks on AWS.

-> **Note**: AWS EMR Spark is Linux-based. Therefore, if you are interested in deploying your app to AWS EMR Spark, make sure your app is .NET Standard compatible and that you use [.NET Core compiler](https://dotnet.microsoft.com/download) to compile your app.
+> **Note:** AWS EMR Spark is Linux-based. Therefore, if you are interested in deploying your app to AWS EMR Spark, make sure your app is .NET Standard compatible and that you use [.NET Core compiler](https://dotnet.microsoft.com/download) to compile your app.

 ### Deploy Microsoft.Spark.Worker
 *Note that this step is only required at cluster creation*
@ -160,29 +160,29 @@ foo@bar:~$ aws emr add-steps \
 ## Databricks
 [Databricks](http://databricks.com) is a platform that provides cloud-based big data processing using Apache Spark.

-> **Note**: [Azure](https://azure.microsoft.com/en-us/services/databricks/) and [AWS](https://databricks.com/aws) Databricks is Linux-based. Therefore, if you are interested in deploying your app to Databricks, make sure your app is .NET Standard compatible and that you use [.NET Core compiler](https://dotnet.microsoft.com/download) to compile your app.
+> **Note:** [Azure](https://azure.microsoft.com/en-us/services/databricks/) and [AWS](https://databricks.com/aws) Databricks is Linux-based. Therefore, if you are interested in deploying your app to Databricks, make sure your app is .NET Standard compatible and that you use [.NET Core compiler](https://dotnet.microsoft.com/download) to compile your app.

 Databricks allows you to submit Spark .NET apps to an existing active cluster or create a new cluster everytime you launch a job. This requires the **Microsoft.Spark.Worker** to be installed **first** before you submit a Spark .NET app.

 ### Deploy Microsoft.Spark.Worker
 *Note that this step is required only once*

-  1. Download **[db-init.sh](../deployment/db-init.sh)** and **[install-worker.sh](../deployment/install-worker.sh)** onto your local machine
-  2. Modify **db-init.sh** appropriately to point to the Microsoft.Spark.Worker release you want to download and install on your cluster
-  3. Download and install [Databricks CLI](https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html)
-  4. [Setup authentication](https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html#set-up-authentication) details for the Databricks CLI appropriately
-  5. Upload the files you downloaded and modified to your Databricks cluster
-     ```
-     cd <path-to-db-init-and-install-worker>
-     databricks fs cp db-init.sh dbfs:/spark-dotnet/db-init.sh
-     databricks fs cp install-worker.sh dbfs:/spark-dotnet/install-worker.sh
-     ```
-  6. Go to to your Databricks cluster homepage -> Clusters (on the left-side menu) -> Create Cluster
-  7. After configuring the cluster appropriately, set the init script (see the image below) and create the cluster.
-     
-     <img src="../docs/img/deployment-databricks-init-script.PNG" alt="ScriptActionImage" width="600"/>
-     
-  > Note: If everything went well, your cluster creation should have been successful. You can check this by clicking on the cluster -> Event Logs.
+   1. Download **[db-init.sh](../deployment/db-init.sh)** and **[install-worker.sh](../deployment/install-worker.sh)** onto your local machine
+   2. Modify **db-init.sh** appropriately to point to the Microsoft.Spark.Worker release you want to download and install on your cluster
+   3. Download and install [Databricks CLI](https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html)
+   4. [Setup authentication](https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html#set-up-authentication) details for the Databricks CLI appropriately
+   5. Upload the files you downloaded and modified to your Databricks cluster
+      ```shell
+      cd <path-to-db-init-and-install-worker>
+      databricks fs cp db-init.sh dbfs:/spark-dotnet/db-init.sh
+      databricks fs cp install-worker.sh dbfs:/spark-dotnet/install-worker.sh
+      ```
+   6. Go to to your Databricks cluster homepage -> Clusters (on the left-side menu) -> Create Cluster
+   7. After configuring the cluster appropriately, set the init script (see the image below) and create the cluster.
+
+      <img src="../docs/img/deployment-databricks-init-script.PNG" alt="ScriptActionImage" width="600"/>
+
+   > **Note:** If everything went well, your cluster creation should have been successful. You can check this by clicking on the cluster -> Event Logs.

 ### Run your app on the cloud!

@ -191,30 +191,43 @@ Databricks allows you to submit Spark .NET apps to an existing active cluster or
 > **Note:** This approach allows job submission to an existing active cluster.

 One-time Setup:
-  1. Go to your Databricks cluster -> Jobs (on the left-side menu) -> Set JAR
-  2. Upload the appropriate `microsoft-spark-<spark-version>-<spark-dotnet-version>.jar`
-  3. Set the params appropriately:
-     ```
-     Main Class: org.apache.spark.deploy.DotnetRunner
-     Arguments /dbfs/app/<your-app-name>.zip <your-app-main-class>
-     ```
-  4. Configure the Cluster to point to an existing cluster (that you already set the init script for - see previous section).
-  
+   1. Go to your Databricks cluster -> Jobs (on the left-side menu) -> Set JAR
+   2. Upload the appropriate `microsoft-spark-<spark-version>-<spark-dotnet-version>.jar`
+   3. Set the params appropriately:
+      ```
+      Main Class: org.apache.spark.deploy.DotnetRunner
+      Arguments /dbfs/apps/<your-app-name>.zip <your-app-main-class>
+      ```
+   4. Configure the Cluster to point to an existing cluster (that you already set the init script for - see previous section).
+
 Publishing your App & Running:
-  1. You should first [publish your app](#preparing-your-spark-net-app). 
-  2. Use [Databricks CLI](https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html) to upload your application to Databricks cluster. For instance, 
-     ```
-     cd <path-to-your-app-publish-directory>
-     databricks fs cp <your-app-name>.zip dbfs:/apps/<your-app-name>.zip
-     ```
-  3. Now, go to your Databricks cluster -> Jobs -> <Job-name> -> Run Now to run your job!
+   1. You should first [publish your app](#preparing-your-spark-net-app). 
+      > **Note:** Do not use `SparkSession.Stop()` in your application code when submitting jobs to an existing active cluster.
+   2. Use [Databricks CLI](https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html) to upload your application to Databricks cluster. For instance, 
+      ```shell
+      cd <path-to-your-app-publish-directory>
+      databricks fs cp <your-app-name>.zip dbfs:/apps/<your-app-name>.zip
+      ```
+   3. This step is only required if app assemblies (e.g., DLLs that contain user-defined functions along with their dependencies) need to be placed in the working directory of each Microsoft.Spark.Worker.
+      - Upload your application assemblies to your Databricks cluster
+         ```shell
+         cd <path-to-your-app-publish-directory>
+         databricks fs cp <assembly>.dll dbfs:/apps/dependencies
+         ```
+      - Uncomment and modify the app dependencies section in **[db-init.sh](../deployment/db-init.sh)** to point to your app dependencies path and upload to your Databricks cluster.
+         ```shell
+         cd <path-to-db-init-and-install-worker>
+         databricks fs cp db-init.sh dbfs:/spark-dotnet/db-init.sh
+         ```
+      - Restart your cluster.
+   4. Now, go to your `Databricks cluster -> Jobs -> [Job-name] -> Run Now` to run your job!

 #### Using [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html)

 > **Note:** This approach allows submission ONLY to cluster that gets created on-demand.

-  1. [Create a Job](https://docs.databricks.com/user-guide/jobs.html) and select *Configure spark-submit*.
-  2. Configure `spark-submit` with the following parameters:
-     ```shell
-     ["--files","/dbfs/<path-to>/<app assembly/file to deploy to worker>","--class"," org.apache.spark.deploy.DotnetRunner","/dbfs/<path-to>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar","/dbfs/<path-to>/<app name>.zip","<app bin name>","app arg1","app arg2"]
-     ```
+   1. [Create a Job](https://docs.databricks.com/user-guide/jobs.html) and select *Configure spark-submit*.
+   2. Configure `spark-submit` with the following parameters:
+      ```shell
+      ["--files","/dbfs/<path-to>/<app assembly/file to deploy to worker>","--class","org.apache.spark.deploy.DotnetRunner","/dbfs/<path-to>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar","/dbfs/<path-to>/<app name>.zip","<app bin name>","app arg1","app arg2"]
+      ```
--- a/deployment/db-init.sh
+++ b/deployment/db-init.sh
@ -2,8 +2,8 @@

 ##############################################################################
 # Description:
-# This is a wrapper script to install the worker binaries on your Databricks Spark cluster.
-# It exists only because Databricks does not allow parameters for an init-script.
+# This script installs the worker binaries and your app dependencies onto
+# your Databricks Spark cluster.
 #
 # Usage:
 # Change the variables below appropriately. 
@ -25,3 +25,18 @@ DOTNET_SPARK_WORKER_INSTALLATION_PATH=/usr/local/bin

 set +e
 /bin/bash $DBFS_INSTALLATION_ROOT/install-worker.sh github $DOTNET_SPARK_RELEASE $DOTNET_SPARK_WORKER_INSTALLATION_PATH
+
+
+
+##############################################################################
+# Uncomment below to deploy application dependencies to workers if submitting
+# jobs using the "Set Jar" task (https://docs.databricks.com/user-guide/jobs.html#jar-jobs)
+# Change the variables below appropriately
+##############################################################################
+################################# CHANGE THESE ###############################
+
+#APP_DEPENDENCIES=/dbfs/apps/dependencies
+#WORKER_PATH=`readlink $DOTNET_SPARK_WORKER_INSTALLATION_PATH/Microsoft.Spark.Worker`
+#if [ -f $WORKER_PATH ] && [ -d $APP_DEPENDENCIES ]; then
+#    sudo cp -fR $APP_DEPENDENCIES/. `dirname $WORKER_PATH`
+#fi
--- a/docs/developer-guide.md
+++ b/docs/developer-guide.md
@ -5,10 +5,11 @@
 ### Debugging .NET application

 Open a new command prompt window, run the following:
-```
+```shell
 spark-submit \
  --class org.apache.spark.deploy.DotnetRunner \
  --master local \
+  <path-to-microsoft-spark-jar> \
  debug
 ```
 and you will see the followng output:
@ -25,7 +26,7 @@ Now you can run your .NET application with any debugger to debug your applicatio

 If you need to debug the Scala side code (`DotnetRunner`, `DotnetBackendHandler`, etc.), you can use the following command, and attach a debugger to the running process using [Intellij](https://www.jetbrains.com/help/idea/attaching-to-local-process.html):

-```
+```shell
 spark-submit \
  --driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 \
  --class org.apache.spark.deploy.DotnetRunner \