Feature/docs (#51)

* update configuration.cfg to secrets.cfg * fix node count pretty print * add more unit tests * configure pytest to only run tests from the 'tests' directory * add missing space * remove static call to loading secrets file * initial docs layout * rename files and add images to getting started * fix typos * Add next steps to clusters docs * Add custom scripts doc * Initial text for spark submit * fix merge conflict * finish spark submit docs * cloud storage docs * fix typo in storage docs * refactor some docs * pr feedback * add link to Getting Started in creating account details in main readme
2017-08-22 09:13:10 -07:00 · 2017-08-22 09:13:10 -07:00 · ae26453ba5
--- a/README.md
+++ b/README.md
@ -13,13 +13,15 @@ A suite of distributed tools to help engineers scale their work into Azure.
 ```bash
    pip install -e .
 ```
-4. Rename 'secrets.cfg.template' to 'secrets.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.
+4. Rename 'secrets.cfg.template' to 'secrets.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal, and in the [Getting Started](./docs/00-getting-started.md) docs.

   To complete this step, you will need an Azure account that has a Batch account and Storage account:
    - To create an Azure account: https://azure.microsoft.com/free/
    - To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
    - To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account

+
+
 ## Getting Started

 The entire experience of this package is centered around a few commands.
@ -46,16 +48,6 @@ azb spark cluster create \

 By default, this package runs Spark in docker from an ubuntu16.04 base image on a ubuntu16.04 VM. More info on this image can be found in the **docker-images** folder in this repo.

-You can opt out of using this image and use the Azure CentOS DSVM instead - the Azure CentOS DSVM has Spark 2.0.2 pre-installed (*as of 07/24/17*). To do this, use the --no-docker flag, and it will default to using the Azure DSVM.
-
-```
-azb spark cluster create \
-    --id <my-cluster-id> \
-    --size-low-pri <number of low-pri nodes> \
-    --vm-size <vm-size> \
-    --no-docker
-```
-
 You can also add a user directly in this command using the same inputs as the `add-user` command described bellow.

 #### Add a user to your cluster to connect
@ -84,6 +76,8 @@ azb spark cluster add-user \

 NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.

+More information regarding using a cluster can be found in the [cluster documentation](./documentation/10%20-%20Clusters.md)
+

 ### Submit a Spark job

@ -110,24 +104,7 @@ azb spark app logs \
    -- name <my-job-name>
    [--tail] # If you want it to tail the log if the task is still runing
 ```
-
-### Interact with your Spark cluster
-
-To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:
-```
-azb spark cluster ssh \
-    --id <my-cluster-id> \
-    --masterui <local-port> \
-    --username <user-name>
-```
-
-Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:
-```
-azb spark cluster ssh  \
-    --id <my-cluster-id> \
-    --masterui <local-port> \
-    --jupyter <local-port>
-```
+More information regarding using a cluster can be found in the [spark submit documentation](./documentation/20%20-%20Spark%20Submit.md)

 ### Connect your cluster to Azure Blob Storage (WASB connection)

@ -161,3 +138,6 @@ Finally, you can delete any specified cluster:
 ```
 azb spark cluster delete --id <my-cluster-id>
 ```
+
+## Next Steps
+You can find more documentation [here](./documentation)
--- a/docs/00-getting-started.md
+++ b/docs/00-getting-started.md
@ -0,0 +1,72 @@
+# Azure Batch with Spark Documentation
+Spark on Azure Batch in a project that enables submission of single or batch spark jobs to the cloud. Azure Batch
+will elastically scale and manage the compute resources required to run the jobs, and spin them down when no longer
+needed.
+
+## Getting Started
+The minimum requirements to get started with Spark on Azure Batch are:
+- An Azure account
+- An Azure Batch account
+- An Azure Storage account
+
+### Setting up your accounts
+1. Log into Azure
+If you do not already have an Azure account, go to [https://azure.microsoft.com/](https://azure.microsoft.com/) to get
+started for free today.
+
+Once you have one, simply log in and go to the [Azure Portal](https://portal.azure.com) to start creating the resources you'll need to get going.
+
+
+2. Create a Storage account
+
+- Click the '+' button at the top left of the screen and search for 'Storage'. Select 'Storage account - blob, file, table, queue' and click 'Create'
+
+![](./misc/Storage_1.png)
+
+- Fill in the form and create the Storage account.
+
+![](./misc/Storage_2.png)
+
+3. Create a Batch account
+
+- Click the '+' button at the top left of the screen and search for 'Compute'. Select 'Batch' and click 'Create'
+
+![](./misc/Batch_1.png)
+
+- Fill in the form and create the Batch account.
+
+![](./misc/Batch_2.png)
+
+4. Save your account credentials into the secrets.cfg file
+
+- Copy the secrets.cfg.template file to secrests.cfg
+
+Windows
+```sh
+copy secrets.cfg.template secrets.cfg
+```
+
+Linux and Mac OS
+```sh
+cp secrets.cfg.template secrets.cfg
+```
+
+- Go to the accounts in the Azure portal and copy pase the account names, keys and other information needed into the
+secrets file.
+
+#### Storage account
+
+For the Storage account, copy the name and one of the two keys:
+
+![](./misc/Storage_secrets.png)
+
+#### Batch account
+
+For the Batch account, copy the name, the url and one of the two keys:
+
+![](./misc/Batch_secrets.png)
+
+
+## Next Steps
+- [Create a cluster](./10-clusters.md)
+- [Run a Spark job](./20-spark-submit.md)
--- a/docs/10-clusters.md
+++ b/docs/10-clusters.md
@ -0,0 +1,123 @@
+# Clusters
+A Cluster in Azure Batch is primarily designed to run Spark jobs. This document describes how to create a cluster
+to use for spark job submissions. Alternitavely for getting started and debugging you can also use the cluster
+in _interactive mode_ which will allow you to log into the master node, use Jupyter and view the Spark UI.
+
+## Creating a Cluster
+Create a spark cluster only takes a few simple steps after which you will be
+able to log into the master node of the cluster an interact with Spark in a
+Jupyter notebook, view the Spark UI on you local machine or submit a job.
+
+### Commands
+Create a cluster:
+
+```sh
+azb spark cluster create --id <your_cluster_id> --vm-size <vm_size_name> --size <number_of_nodes>
+```
+
+For example, to create a cluster of 4 Standard_A2 nodes called 'spark' you can run:
+```sh
+azb spark cluster create --id spark --vm-size standard_a2 --size 4
+```
+
+You can find more information on VM sizes [here](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes)
+
+NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
+
+#### Low priority nodes
+You can create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using **--size-low-pri** instead of **--size**. Note that these are great for experimental use, but can be taken away at any time. We recommend against this option when doing long running jobs or for critical workloads.
+
+### Listing clusters
+You can list all clusters currently running in your account by running
+
+```sh
+azb spark cluster list
+```
+
+### Viewing a cluster
+To view details about a particular cluster run:
+
+```sh
+asb spark cluster get --id <your_cluster_id>
+```
+
+Note that the cluster is not fully usable until a master node has been selected and it's state is 'idle'.
+
+For example here cluster 'spark' has 2 nodes and node tvm-257509324_2-20170820t200959z is the mastesr and ready to run a job.
+
+```sh
+Cluster         spark
+------------------------------------------
+State:          active
+Node Size:      standard_a2
+Nodes:          2
+| Dedicated:    2
+| Low priority: 0
+
+Nodes                               | State           | IP:Port              | Master
+------------------------------------|-----------------|----------------------|--------
+tvm-257509324_1-20170820t200959z    | idle            | 40.83.254.90:50001   |
+tvm-257509324_2-20170820t200959z    | idle            | 40.83.254.90:50000   | *
+```
+
+### Deleting a cluster
+To delete a cluster run:
+
+```sh
+azb spark cluster delete --id <your_cluster_id>
+```
+
+__You are charged for the cluster as long as the nodes are provisioned in your account.__ Make sure to delete any clusters you are not using to avoid unwanted costs.
+
+### Interactive Mode
+All interaction to the cluster is done via SSH and SSH tunneling. The first step to enable this is to add
+a user to the master node.
+
+Using a __SSH key__
+```sh
+azb spark cluster add-user --id spark --username admin --ssh-key <your_key_OR_path_to_key>
+```
+
+Alternatively, updating the secrets.cfg with a the SSH key or path to the SSH will allow the tool to automatically
+pick it up.
+
+```sh
+asb spark clsuter add-user --id spark --username admin
+```
+
+Using a __password__
+```sh
+azb spark cluster add-user --id spark --username admin --password my_password
+```
+
+_Using passwords is discouraged over using a SSH key._
+
+### SSH and Port Forwarding
+After a user has been created, SSH into the master node with:
+
+```sh
+azb spark cluster ssh --id spark --username admin
+```
+
+For interactive use with the cluster, port forwarding of certain ports is required to enable viewing the Spark Master UI,
+Spark Jobs UI and the Jupyter notebook in the cluster:
+
+```sh
+azb spark cluster ssh --id spark --username admin --masterui 8080 --webui 4040 --jupyter 8888
+```
+
+_Additional ports can be forwarded using standard ssh port forwarding syntax._
+
+### Interact with your Spark cluster
+
+### Jupyter
+Once the appropriate ports have been forwarded, simply navigate to the local ports for viewing. In
+this case, if you used port 8888 for Jupyter then navigate to [http://localhost:8888.](http://localhost:8888)
+
+__The notebooks will only be persisted to the local cluster.__ Once the cluster is deleted, all notebooks
+will be deleted with them. We recommend saving off the notebooks elsewhere if you do not want them
+deleted.
+
+## Next Steps
+- [Run a Spark job](./20-spark-submit.md)
+- [Configure the Spark cluster using custom commands](./11-custom-scripts.md)
--- a/docs/11-custom-scripts.md
+++ b/docs/11-custom-scripts.md
@ -0,0 +1,12 @@
+# Custom scripts
+Custom scripts allow for additional cluster setup steps when the cluster is being provisioned. This is useful
+when you want to modify the default cluster configuration for things such as modifying spark.conf, adding jars
+or downloading any files you need in the cluster.
+
+## Scripting considerations
+
+- The default OS is Ubuntu 16.04.
+- The script run on every node in the cluster _after_ Spark has been installed and is running.
+- The environment variable $SPARK_HOME points to the root Spark directory.
+- The environment variable $IS\_MASTER identifies if this is the node running the master role. The node running the master role _also_ runs a worker role on it.
+- The Spark cluster is set up using Standalone Mode
--- a/docs/20-spark-submit.md
+++ b/docs/20-spark-submit.md
@ -0,0 +1,32 @@
+# Submitting a Job
+Submitting a job to your Spark cluster in Batch mimics the experience of a typical standalone cluster. A spark job will be submitted to the system and run to completion.
+
+## Spark-Submit
+The spark-submit experience is mostly the same as any regular Spark cluster with a few minor differences. You can take a look at azb spark app --help for more detailed information and options.
+
+Run a Spark job:
+```sh
+azb spark app submit --id <name_of_spark_cluster> --name <name_of_spark_job> <executable> <executable_params>
+```
+
+For example, run a local pi.py file on a Spark cluster
+```sh
+azb spark app submit --id spark --name pipy example-jobs/python/pi.py 100
+```
+
+NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters.
+
+## Monitoring job
+If you have set up a [SSH tunnel](./10-clusters.md#SSH%20and%20Port%20Forwarding) with port fowarding, you can naviate to http://localhost:8080 and http://localhost:4040 to view the progess of the job using the Spark UI
+
+
+## Getting output logs
+The default setting when running a job is --wait. This will simply submit a job to the cluster and wait for the job to run. If you want to just submit the job and not wait, use the --no-wait flag and tail the logs manually:
+
+```sh
+azb spark app submit --id spark --name pipy --no-wait example-jobs/pi.py 1000
+```
+
+```sh
+azb spark app logs --id spark --name pipy --tail
+```
--- a/docs/30-cloud-storage.md
+++ b/docs/30-cloud-storage.md
@ -0,0 +1,18 @@
+# Cloud storage
+Cloud stoarge for spark enables you to have a persisted storage system backed by a cloud provider. Spark supports this by placing the appropriate storage jars and updating the core-site.xml file accordingly.
+
+## Azure Storage Blobs (WASB)
+The default cloud storage system for Spark on Batch is Azure Storage Blobs (a.k.a. WASB). The WASB jars are automatically placed in the spark cluster and the permissions are pulled from you secrests file.
+
+Reading and writing to and from Azure blobs is easily achieved by using the wasb syntax. For example, reading a csv file using pyspark would be:
+
+```python
+# read csv data into data
+dataframe = spark.read.csv('wasb://MY_CONTAINER@MY_STORAGE_ACCOUNt.blob.core.windows.net/MY_INPUT_DATA.csv')
+
+# print off the first 5 rows
+dataframe.show(5)
+
+# write the csv back to storage
+dataframe.write.csv('wasb://MY_CONTAINER@MY_STORAGE_ACCOUNt.blob.core.windows.net/MY_DATA.csv')
+```
--- a/docs/misc/Batch_1.png
+++ b/docs/misc/Batch_1.png
--- a/docs/misc/Batch_2.png
+++ b/docs/misc/Batch_2.png
--- a/docs/misc/Batch_secrets.png
+++ b/docs/misc/Batch_secrets.png
--- a/docs/misc/Storage_1.png
+++ b/docs/misc/Storage_1.png
--- a/docs/misc/Storage_2.png
+++ b/docs/misc/Storage_2.png
--- a/docs/misc/Storage_secrets.png
+++ b/docs/misc/Storage_secrets.png
--- a/docs/readme.md
+++ b/docs/readme.md
@ -0,0 +1,14 @@
+## [Getting Started](./00-getting-started.md)
+Set up your azure account and resources.
+
+## [Clusters](./10-clusters.md)
+Create, manage and interact with a cluster. Also learn about connecting to the master node, using Jupyter and viewing the Spark UI.
+
+## [Custom Scripts](./11-custom-scripts.md)
+Add custom configuraiton scripts to your cluster.
+
+## [Spark Submit](./20-spark-submit.md)
+Submit a Spark job to the cluster.
+
+## [Cloud Storage](./30-cloud-storage)
+Using cloud storage to load and save data to and from persisted data stores.