AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

azure-batch azure-storage docker spark spark-jobs

Перейти к файлу

jiata d75789296a ubuntu-image -- currently not supported by batch		2017-06-11 23:24:30 -04:00
.vscode	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
bin	ubuntu-image -- currently not supported by batch	2017-06-11 23:24:30 -04:00
dtde	ubuntu-image -- currently not supported by batch	2017-06-11 23:24:30 -04:00
example-jobs	Merged PR 7: Merge submit-app to master	2017-05-19 17:11:22 +00:00
.gitignore	refactor + user agent (#1 )	2017-06-09 22:16:52 -07:00
README.md	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
configuration.cfg.template	Merged PR 4: Merge readme-update to master	2017-04-20 23:54:36 +00:00
linux_memory.py	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
requirements.txt	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
setup.py	refactor + user agent (#1 )	2017-06-09 22:16:52 -07:00

README.md

Redbull

Run Spark Standalone on Azure Batch

Setup

Clone the repo
Use pip to install required packages:
```
pip3 install -r requirements.txt
```
Use setuptools:
```
python3 setup.py install
```
Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.

To complete this step, you will need an Azure account that has a Batch account and Storage account:
- To create an Azure account: https://azure.microsoft.com/free/
- To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
- To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account

Getting Started

The entire experience of this package is centered around a few commands in the bin folder.

First, create your cluster:

./bin/spark-cluster-create \
    --cluster-id <my-cluster-id> \
    --cluster-size <number of nodes> \
    --cluster-vm-size <vm-size> \
    --wait/--no-wait (optional)

When your cluster is ready, create a user for your cluster:

./bin/spark-cluster-create-user \
    --cluster-id <my-cluster-id> \
    --username <username> \
    --password <password>

Now you can submit jobs to run against the cluster:

./bin/spark-app-submit \
    --cluster-id <my-cluster-id> \
    --app-id <my-application> \
    --file <my-spark-job>

To view the spark UI, open up an ssh tunnel with the "webui" option and a local port to map to:

./bin/spark-cluster-ssh \ 
    --cluster-id <my-cluster-id> \
    --webui <local-port>

Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:

./bin/spark-cluster-ssh \ 
    --cluster-id <my-cluster-id> \
    --webui <local-port> \
    --jupyter <local-port>

You can also see your clusters from the CLI:

./bin/spark-cluster-list

Finally, you can get the state of any specified cluster:

./bin/spark-cluster-get \
    --cluster-id <my-cluster-id>