aztk/README.md

# Distributed Tools for Data Engineering (DTDE)
A suite of distributed tools to help engineers scale their work into Azure.

# Spark on DTDE

## Setup
1. Clone the repo
2. Use pip to install required packages:
```bash
    pip install -r requirements.txt
```
3. Use setuptools:
```bash
    pip install -e .
```
4. Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.

   To complete this step, you will need an Azure account that has a Batch account and Storage account:
    - To create an Azure account: https://azure.microsoft.com/free/
    - To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
    - To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account

## Getting Started

The entire experience of this package is centered around a few commands.

### Create and setup your cluster

First, create your cluster:
```bash
azb spark cluster create \
    --id <my-cluster-id> \
    --size <number of nodes> \
    --vm-size <vm-size> \
    --custom-script <path to custom bash script to run on each node> (optional) \
    --wait/--no-wait (optional)
```

You can also create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using **--size-low-pri** instead of **--size**:
```
azb spark cluster create \
    --id <my-cluster-id> \
    --size-low-pri <number of low-pri nodes> \
    --vm-size <vm-size>
```

By default, this package runs Spark in docker from an ubuntu16.04 base image on a ubuntu16.04 VM. More info on this image can be found in the **docker-images** folder in this repo.

You can opt out of using this image and use the Azure CentOS DSVM instead - the Azure CentOS DSVM has Spark 2.0.2 pre-installed (*as of 07/24/17*). To do this, use the --no-docker flag, and it will default to using the Azure DSVM.

```
azb spark cluster create \
    --id <my-cluster-id> \
    --size-low-pri <number of low-pri nodes> \
    --vm-size <vm-size> \
    --no-docker
```

You can also add a user directly in this command using the same inputs as the `add-user` command described bellow.

#### Add a user to your cluster to connect
When your cluster is ready, create a user for your cluster (if you didn't already do so when creating your cluster):
```bash
# **Recommended usage**
# Add a user with a ssh public key. It will use the value specified in the configuration.cfg (Either path to the file or the actual key)
azb spark cluster add-user \
    --id <my-cluster-id> \
    --username <username>

# You can also explicity specify the ssh public key(Path or actual key)
azb spark cluster add-user \
    --id <my-cluster-id> \
    --username <username> \
    --ssh-key ~/.ssh/id_rsa.pub

# **Not recommended**
# You can also just specify a password
azb spark cluster add-user \
    --id <my-cluster-id> \
    --username <username> \
    --password <password>

```

NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.


### Submit a Spark job

Now you can submit jobs to run against the cluster:
```
azb spark app submit \
    --id <my-cluster-id> \
    --name <my-job-name> \
    [options] \
    <app jar | python file> \
    [app arguments]
```
NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters.

The output of spark-submit will be streamed to the console. Use the `--no-wait` option to return immediately.

### Read the output of your spark job.

If you decided not to tail the log when submiting the job or want to read it again you can use this command.

```bash
azb spark app logs \
    --id <my-cluster-id> \
    -- name <my-job-name>
    [--tail] # If you want it to tail the log if the task is still runing
```

### Interact with your Spark cluster

To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:
```
azb spark cluster ssh \
    --id <my-cluster-id> \
    --masterui <local-port> \
    --username <user-name>
```

Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:
```
azb spark cluster ssh  \
    --id <my-cluster-id> \
    --masterui <local-port> \
    --jupyter <local-port>
```

### Manage your Spark cluster

You can also see your clusters from the CLI:
```
azb spark cluster list
```

And get the state of any specified cluster:
```
azb spark cluster get --id <my-cluster-id>
```

Finally, you can delete any specified cluster:
```
azb spark cluster delete --id <my-cluster-id>
```
Update README.md 2017-06-16 08:46:01 +03:00			`# Distributed Tools for Data Engineering (DTDE)`
			`A suite of distributed tools to help engineers scale their work into Azure.`

			`# Spark on DTDE`
readme.md 2017-04-13 04:47:44 +03:00
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			`## Setup`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`1. Clone the repo`
			`2. Use pip to install required packages:`
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			```bash
remove deprecated instructions (#39) * remove deprecated instructions * Update README.md 2017-08-10 21:00:43 +03:00			`pip install -r requirements.txt`
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			```
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`3. Use setuptools:`
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			```bash
remove deprecated instructions (#39) * remove deprecated instructions * Update README.md 2017-08-10 21:00:43 +03:00			`pip install -e .`
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			```
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			`4. Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.`
Merged PR 5: Merge feature/batch2.0.0 to master 2017-05-12 23:25:47 +03:00
			`To complete this step, you will need an Azure account that has a Batch account and Storage account:`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`- To create an Azure account: https://azure.microsoft.com/free/`
			`- To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal`
			`- To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account`

			`## Getting Started`

Feature: docker Feature: docker 2017-08-17 19:21:00 +03:00			`The entire experience of this package is centered around a few commands.`

Update README.md 2017-06-20 08:03:11 +03:00			`### Create and setup your cluster`

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`First, create your cluster:`
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			```bash
			`azb spark cluster create \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
			`--size <number of nodes> \`
			`--vm-size <vm-size> \`
Update README.md 2017-06-20 08:03:11 +03:00			`--custom-script <path to custom bash script to run on each node> (optional) \`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`--wait/--no-wait (optional)`
			```

Update README.md 2017-06-20 08:03:11 +03:00			`You can also create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using --size-low-pri instead of --size:`
			```
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			`azb spark cluster create \`
Update README.md 2017-06-20 08:03:11 +03:00			`--id <my-cluster-id> \`
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			`--size-low-pri <number of low-pri nodes> \`
Update README.md 2017-06-20 08:03:11 +03:00			`--vm-size <vm-size>`
			```

Feature: docker Feature: docker 2017-08-17 19:21:00 +03:00			`By default, this package runs Spark in docker from an ubuntu16.04 base image on a ubuntu16.04 VM. More info on this image can be found in the docker-images folder in this repo.`

			`You can opt out of using this image and use the Azure CentOS DSVM instead - the Azure CentOS DSVM has Spark 2.0.2 pre-installed (as of 07/24/17). To do this, use the --no-docker flag, and it will default to using the Azure DSVM.`

			```
			`azb spark cluster create \`
			`--id <my-cluster-id> \`
			`--size-low-pri <number of low-pri nodes> \`
			`--vm-size <vm-size> \`
			`--no-docker`
			```

Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			You can also add a user directly in this command using the same inputs as the `add-user` command described bellow.

			`#### Add a user to your cluster to connect`
Update README.md 2017-06-27 01:49:40 +03:00			`When your cluster is ready, create a user for your cluster (if you didn't already do so when creating your cluster):`
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			```bash
			`# Recommended usage`
			`# Add a user with a ssh public key. It will use the value specified in the configuration.cfg (Either path to the file or the actual key)`
			`azb spark cluster add-user \`
			`--id <my-cluster-id> \`
			`--username <username>`

			`# You can also explicity specify the ssh public key(Path or actual key)`
			`azb spark cluster add-user \`
			`--id <my-cluster-id> \`
			`--username <username> \`
			`--ssh-key ~/.ssh/id_rsa.pub`

			`# Not recommended`
			`# You can also just specify a password`
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			`azb spark cluster add-user \`
Corrections to README (#8) 2017-07-05 07:27:46 +03:00			`--id <my-cluster-id> \`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`--username <username> \`
			`--password <password>`
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00
			`NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00
Feature: docker Feature: docker 2017-08-17 19:21:00 +03:00
Update README.md 2017-06-20 08:03:11 +03:00			`### Submit a Spark job`

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`Now you can submit jobs to run against the cluster:`
			```
Feature: read/tail spark submit logs (#40) * Tail log command * tail log is working * Cleanup code * fix cr * Fix cr * Submit * Comment * fix tail * Sleep 2017-08-18 22:35:04 +03:00			`azb spark app submit \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
			`--name <my-job-name> \`
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			`[options] \`
			`<app jar \| python file> \`
Corrections to README (#8) 2017-07-05 07:27:46 +03:00			`[app arguments]`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```
added naming condition to spark jobs & clusters (#14) * added naming condition to spark jobs * Update README.md 2017-07-06 21:33:09 +03:00			`NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters.`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00
Feature: read/tail spark submit logs (#40) * Tail log command * tail log is working * Cleanup code * fix cr * Fix cr * Submit * Comment * fix tail * Sleep 2017-08-18 22:35:04 +03:00			The output of spark-submit will be streamed to the console. Use the `--no-wait` option to return immediately.

			`### Read the output of your spark job.`

			`If you decided not to tail the log when submiting the job or want to read it again you can use this command.`

			```bash
			`azb spark app logs \`
			`--id <my-cluster-id> \`
			`-- name <my-job-name>`
			`[--tail] # If you want it to tail the log if the task is still runing`
			```

Update README.md 2017-06-20 08:03:11 +03:00			`### Interact with your Spark cluster`

Update README.md 2017-06-16 08:46:01 +03:00			`To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			`azb spark cluster ssh \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
Corrections to README (#8) 2017-07-05 07:27:46 +03:00			`--masterui <local-port> \`
			`--username <user-name>`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```

			`Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:`
			```
Feature: Setup user with ssh public key instead of password (#32) * Login with ssh public key * Update to have ssh-key command * Fix readme * Fix Cr * remove unused files * Change head to master 2017-07-27 20:45:31 +03:00			`azb spark cluster ssh \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
Update README.md 2017-06-27 01:49:40 +03:00			`--masterui <local-port> \`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`--jupyter <local-port>`
			```

Update README.md 2017-06-20 08:03:11 +03:00			`### Manage your Spark cluster`

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`You can also see your clusters from the CLI:`
			```
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			`azb spark cluster list`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```

Update README.md 2017-06-16 08:46:01 +03:00			`And get the state of any specified cluster:`
			```
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			`azb spark cluster get --id <my-cluster-id>`
Update README.md 2017-06-16 08:46:01 +03:00			```

			`Finally, you can delete any specified cluster:`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```
Simplify the commands to all be under azb and make windows experience better (#29) * Added some commands * More * Missing file * Refactor * update * Constant exe name * List and get command * Added user * update cluster create, and job submit * Remove legacy scripts * Fix cr comments * remove leftover print 2017-07-19 20:48:23 +03:00			`azb spark cluster delete --id <my-cluster-id>`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```