AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

azure-batch azure-storage docker spark spark-jobs

Перейти к файлу

Timothee Guerin 97ac1fc76f udate cluster		2017-09-22 11:39:18 -07:00
.vscode	Revert "Revert "Feature: Setup the cluster with the start task." (#26 )" (#28 )	2017-07-14 12:00:47 -04:00
config	udate cluster	2017-09-22 11:39:18 -07:00
custom-scripts	removed deprecated wasb-dsvm custom script & removed depcredated wasb docker image (#49 )	2017-08-23 11:29:27 -04:00
docker-image	Feature: custom docker images (#68 )	2017-09-15 20:57:16 -07:00
docs	Merge master	2017-09-22 10:51:43 -07:00
dtde	udate cluster	2017-09-22 11:39:18 -07:00
examples	Feature/samples (#50 )	2017-08-22 18:36:25 -04:00
node_scripts	Feature: upload custom spark.conf (#72 )	2017-09-18 14:34:37 -07:00
tests	Feature/secrets.cfg (#43 )	2017-08-21 12:36:45 -07:00
.editorconfig	Provide a list of ports to open on cluster create	2017-09-22 11:13:09 -07:00
.gitignore	Feature: config files (#65 )	2017-09-19 14:08:26 -07:00
.travis.yml	Initial structure for test framework and CI (#30 )	2017-07-18 14:07:11 -07:00
CHANGELOG.md	Only list spark clusters (#31 )	2017-07-24 09:17:42 -07:00
README.md	Bug: Move commands under app to cluster (#82 )	2017-09-19 08:29:23 -07:00
linux_memory.py	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
pylintrc	Initial structure for test framework and CI (#30 )	2017-07-18 14:07:11 -07:00
pytest.ini	Feature/secrets.cfg (#43 )	2017-08-21 12:36:45 -07:00
requirements.txt	Feature: config files (#65 )	2017-09-19 14:08:26 -07:00
setup.cfg	Revert "Revert "Feature: Setup the cluster with the start task." (#26 )" (#28 )	2017-07-14 12:00:47 -04:00
setup.py	Simplify the commands to all be under azb and make windows experience better (#29 )	2017-07-19 10:48:23 -07:00

README.md

Azure Thunderbolt

Azure Thunderbolt is a python CLI application for provisioning dockerized Spark clusters in Azure. This package is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Azure Thunderbolt is designed to run batch Spark jobs that require additional on-demand compute. This package is not ideal for long-standing clusters for applications such as Spark streaming.

Notable Features

Spark cluster provision time of 3-5 minutes on average
Azure Thunderbolt clusters run in Docker containers
Users can bring their own Docker image
Ability to use low-priority VMs for an 80% discount
Built in support for Azure Blob Storage connection
Built in Jupyter notebook for interactive experience
Ability to run spark submit directly from your local machine's CLI

Setup

Clone the repo
Use pip to install required packages (requires python 3.5+ and pip 9.0.1+)

    pip install -r requirements.txt

Use setuptools:

    pip install -e .

Rename 'secrets.cfg.template' to 'secrets.cfg' and fill in the fields for your Batch account and Storage account.

Thunerbolt is built on top of two core Azure services, Azure Batch and Azure Storage. Create those resources via the portal (see Getting Started).

Quickstart Guide

The entire experience of this package is centered around a few commands.

Create and setup your cluster

First, create your cluster:

azb spark cluster create \
    --id <my-cluster-id> \
    --size <number of nodes> \
    --vm-size <vm-size> \
    --custom-script <path to custom bash script to run on each node> (optional) \
    --wait/--no-wait (optional)

You can also create your cluster with low-priority VMs at an 80% discount by using --size-low-pri instead of --size:

azb spark cluster create \
    --id <my-cluster-id> \
    --size-low-pri <number of low-pri nodes> \
    --vm-size <vm-size>

By default, this package runs Spark in docker from an ubuntu16.04 base image on a ubuntu16.04 VM. More info on this image can be found in the docker-images folder in this repo.

You can also add a user directly in this command using the same inputs as the add-user command described bellow.

Add a user to your cluster to connect

When your cluster is ready, create a user for your cluster (if you didn't already do so when creating your cluster):

# **Recommended usage**
# Add a user with a ssh public key. It will use the value specified in the secrets.cfg (Either path to the file or the actual key)
azb spark cluster add-user \
    --id <my-cluster-id> \
    --username <username>

# You can also explicity specify the ssh public key(Path or actual key)
azb spark cluster add-user \
    --id <my-cluster-id> \
    --username <username> \
    --ssh-key ~/.ssh/id_rsa.pub

# **Not recommended**
# You can also just specify a password
azb spark cluster add-user \
    --id <my-cluster-id> \
    --username <username> \
    --password <password>

NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.

More information regarding using a cluster can be found in the cluster documentation

Submit a Spark job

Now you can submit jobs to run against the cluster:

azb spark cluster submit \
    --id <my-cluster-id> \
    --name <my-job-name> \
    [options] \
    <app jar | python file> \
    [app arguments]

NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters.

The output of spark-submit will be streamed to the console. Use the --no-wait option to return immediately.

Read the output of your spark job.

If you decided not to tail the log when submiting the job or want to read it again you can use this command.

azb spark cluster logs \
    --id <my-cluster-id> \
    -- name <my-job-name>
    [--tail] # If you want it to tail the log if the task is still running

More information regarding using a cluster can be found in the spark submit documentation

Connect your cluster to Azure Blob Storage (WASB connection)

Pre-built into this package is native support for connecting your spark cluster to Azure Blob Storage. To do so, make sure that the storage fields in your secrets.cfg file are properly filled out.

Even if you are just testing and have no need to connect with Azure Blob Storage, you still need to correctly fill out the storage fields in your secrets.cfg folder as it is a requirement for this package.

Once you have correctly filled out the secrets.cfg with your storage credentials, you will be able to access said storage account from your Spark job.

Please note: If you want to access another Azure Blob Storage account, you will need to recreate your cluster with an updated secrets.cfg file with the appropriate storage credentials.

Here's an example of how you may access your data in Blob Storage:

df = spark.read.csv("wasbs://<STORAGE_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/<BLOB_NAME>")

Manage your Spark cluster

You can also see your clusters from the CLI:

azb spark cluster list

And get the state of any specified cluster:

azb spark cluster get --id <my-cluster-id>

Finally, you can delete any specified cluster:

azb spark cluster delete --id <my-cluster-id>

Examples

Please see the samples folder for a curated list of samples from Spark-2.2.0.

Next Steps

You can find more documentation here