aztk/README.md

# Distributed Tools for Data Engineering (DTDE)
A suite of distributed tools to help engineers scale their work into Azure.

# Spark on DTDE

## Setup  
1. Clone the repo
2. Use pip to install required packages:
    ```
    pip3 install -r requirements.txt
    ```
3. Use setuptools:
    ```
    python3 setup.py install
    ```
4. Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal. 

   To complete this step, you will need an Azure account that has a Batch account and Storage account:
    - To create an Azure account: https://azure.microsoft.com/free/
    - To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
    - To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account

## Getting Started

The entire experience of this package is centered around a few commands in the bin folder.

### Create and setup your cluster

First, create your cluster:
```
./bin/spark-cluster-create \
    --id <my-cluster-id> \
    --size <number of nodes> \
    --vm-size <vm-size> \
    --custom-script <path to custom bash script to run on each node> (optional) \
    --wait/--no-wait (optional)
```

You can also create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using **--size-low-pri** instead of **--size**:
```
./bin/spark-cluster-create \
    --id <my-cluster-id> \
    --size-low-pri <number of low-pri nodes>
    --vm-size <vm-size>
```

When your cluster is ready, create a user for your cluster:
```
./bin/spark-cluster-create-user \
    ---id <my-cluster-id> \
    --username <username> \
    --password <password>
```

### Submit a Spark job

Now you can submit jobs to run against the cluster:
```
./bin/spark-submit \
    --id <my-cluster-id> \
    --name <my-job-name> \
    [list of options] \
    --application <path-to-spark-job>
```

### Interact with your Spark cluster

To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:
```
./bin/spark-cluster-ssh \ 
    --id <my-cluster-id> \
    --masterui <local-port>
```

Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:
```
./bin/spark-cluster-ssh \ 
    --id <my-cluster-id> \
    --jupyter <local-port>
```

### Manage your Spark cluster

You can also see your clusters from the CLI:
```
./bin/spark-cluster-list
```

And get the state of any specified cluster:
```
./bin/spark-cluster-get --id <my-cluster-id>
```

Finally, you can delete any specified cluster:
```
./bin/spark-cluster-delete --id <my-cluster-id>
```
Update README.md 2017-06-16 08:46:01 +03:00			`# Distributed Tools for Data Engineering (DTDE)`
			`A suite of distributed tools to help engineers scale their work into Azure.`

			`# Spark on DTDE`
readme.md 2017-04-13 04:47:44 +03:00
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`## Setup`
			`1. Clone the repo`
			`2. Use pip to install required packages:`
			```
			`pip3 install -r requirements.txt`
readme.md 2017-04-13 04:47:44 +03:00			```
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`3. Use setuptools:`
readme.md 2017-04-13 04:47:44 +03:00			```
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`python3 setup.py install`
			```
Merged PR 5: Merge feature/batch2.0.0 to master 2017-05-12 23:25:47 +03:00			`4. Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.`

			`To complete this step, you will need an Azure account that has a Batch account and Storage account:`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`- To create an Azure account: https://azure.microsoft.com/free/`
			`- To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal`
			`- To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account`

			`## Getting Started`

Merged PR 5: Merge feature/batch2.0.0 to master 2017-05-12 23:25:47 +03:00			`The entire experience of this package is centered around a few commands in the bin folder.`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00
Update README.md 2017-06-20 08:03:11 +03:00			`### Create and setup your cluster`

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`First, create your cluster:`
			```
			`./bin/spark-cluster-create \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
			`--size <number of nodes> \`
			`--vm-size <vm-size> \`
Update README.md 2017-06-20 08:03:11 +03:00			`--custom-script <path to custom bash script to run on each node> (optional) \`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`--wait/--no-wait (optional)`
			```

Update README.md 2017-06-20 08:03:11 +03:00			`You can also create your cluster with [low-priority](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) VMs at an 80% discount by using --size-low-pri instead of --size:`
			```
			`./bin/spark-cluster-create \`
			`--id <my-cluster-id> \`
			`--size-low-pri <number of low-pri nodes>`
			`--vm-size <vm-size>`
			```

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`When your cluster is ready, create a user for your cluster:`
			```
			`./bin/spark-cluster-create-user \`
Update README.md 2017-06-16 08:46:01 +03:00			`---id <my-cluster-id> \`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`--username <username> \`
			`--password <password>`
			```

Update README.md 2017-06-20 08:03:11 +03:00			`### Submit a Spark job`

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`Now you can submit jobs to run against the cluster:`
			```
Update README.md 2017-06-16 08:46:01 +03:00			`./bin/spark-submit \`
			`--id <my-cluster-id> \`
			`--name <my-job-name> \`
			`[list of options] \`
			`--application <path-to-spark-job>`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```

Update README.md 2017-06-20 08:03:11 +03:00			`### Interact with your Spark cluster`

Update README.md 2017-06-16 08:46:01 +03:00			`To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```
			`./bin/spark-cluster-ssh \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
			`--masterui <local-port>`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```

			`Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:`
			```
			`./bin/spark-cluster-ssh \`
Update README.md 2017-06-16 08:46:01 +03:00			`--id <my-cluster-id> \`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`--jupyter <local-port>`
			```

Update README.md 2017-06-20 08:03:11 +03:00			`### Manage your Spark cluster`

Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			`You can also see your clusters from the CLI:`
			```
			`./bin/spark-cluster-list`
			```

Update README.md 2017-06-16 08:46:01 +03:00			`And get the state of any specified cluster:`
			```
Update README.md 2017-06-20 08:03:11 +03:00			`./bin/spark-cluster-get --id <my-cluster-id>`
Update README.md 2017-06-16 08:46:01 +03:00			```

			`Finally, you can delete any specified cluster:`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```
Update README.md 2017-06-20 08:03:11 +03:00			`./bin/spark-cluster-delete --id <my-cluster-id>`
Merged PR 4: Merge readme-update to master 2017-04-21 02:54:36 +03:00			```