AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

azure-batch azure-storage docker spark spark-jobs

Перейти к файлу

Timothee Guerin f7b22be1dc fix		2017-07-12 13:57:24 -07:00
.vscode	Pep8 config file	2017-07-12 13:23:16 -07:00
bin	picks master from the create	2017-07-07 14:53:08 -07:00
custom-scripts	updates to custom wasb template	2017-07-11 18:36:00 -04:00
dtde	Pep8 config file	2017-07-12 13:23:16 -07:00
example-jobs	Merged PR 7: Merge submit-app to master	2017-05-19 17:11:22 +00:00
node_scripts	fix	2017-07-12 13:57:24 -07:00
.editorconfig	Added editorconfig and settings	2017-07-12 12:41:15 -07:00
.gitignore	Merge master	2017-07-12 12:25:29 -07:00
README.md	added naming condition to spark jobs & clusters (#14 )	2017-07-06 11:33:09 -07:00
configuration.cfg.template	Merged PR 4: Merge readme-update to master	2017-04-20 23:54:36 +00:00
linux_memory.py	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
pylintrc	more fixes	2017-07-10 17:03:04 -07:00
requirements.txt	Feature/low pri (#4 )	2017-06-19 21:58:56 -07:00
setup.cfg	Pep8 config file	2017-07-12 13:23:16 -07:00
setup.py	refactor + user agent (#1 )	2017-06-09 22:16:52 -07:00

README.md

Distributed Tools for Data Engineering (DTDE)

A suite of distributed tools to help engineers scale their work into Azure.

Spark on DTDE

Setup

Clone the repo
Use pip to install required packages:
```
pip3 install -r requirements.txt
```
Use setuptools:
```
python3 setup.py install
```
Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.

To complete this step, you will need an Azure account that has a Batch account and Storage account:
- To create an Azure account: https://azure.microsoft.com/free/
- To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
- To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account

Getting Started

The entire experience of this package is centered around a few commands in the bin folder.

Create and setup your cluster

First, create your cluster:

./bin/spark-cluster-create \
    --id <my-cluster-id> \
    --size <number of nodes> \
    --vm-size <vm-size> \
    --custom-script <path to custom bash script to run on each node> (optional) \
    --wait/--no-wait (optional)

You can also create your cluster with low-priority VMs at an 80% discount by using --size-low-pri instead of --size:

./bin/spark-cluster-create \
    --id <my-cluster-id> \
    --size-low-pri <number of low-pri nodes>
    --vm-size <vm-size>

When your cluster is ready, create a user for your cluster (if you didn't already do so when creating your cluster):

./bin/spark-cluster-create-user \
    --id <my-cluster-id> \
    --username <username> \
    --password <password>

NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.

Submit a Spark job

Now you can submit jobs to run against the cluster:

./bin/spark-submit \
    --id <my-cluster-id> \
    --name <my-job-name> \
    [options] 
    <app jar | python file> 
    [app arguments]

NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters.

Interact with your Spark cluster

To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:

./bin/spark-cluster-ssh \ 
    --id <my-cluster-id> \
    --masterui <local-port> \
    --username <user-name>

Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:

./bin/spark-cluster-ssh \ 
    --id <my-cluster-id> \
    --masterui <local-port> \
    --jupyter <local-port>

Manage your Spark cluster

You can also see your clusters from the CLI:

./bin/spark-cluster-list

And get the state of any specified cluster:

./bin/spark-cluster-get --id <my-cluster-id>

Finally, you can delete any specified cluster:

./bin/spark-cluster-delete --id <my-cluster-id>