AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

azure-batch azure-storage docker spark spark-jobs

Перейти к файлу

Jacob Freck dd5a5c938a Feature: ssh directly into container (#142 ) * add-user accept ssh-key path and prompt for password * change ssh command to connect directly to container * add --host flag for ssh * update docs * cleaner checking of config options		2017-10-04 15:51:22 -07:00
.vscode	Revert "Revert "Feature: Setup the cluster with the start task." (#26 )" (#28 )	2017-07-14 12:00:47 -04:00
aztk	Feature: ssh directly into container (#142 )	2017-10-04 15:51:22 -07:00
config	Feature/yaml space note (#134 )	2017-10-03 22:55:57 -07:00
custom-scripts	Feature: Support for multiple custom scripts and master only scrips (#93 )	2017-10-02 13:20:11 -07:00
docker-image	Feature/update docker repo (#130 )	2017-09-30 22:29:53 -07:00
docs	Feature: ssh directly into container (#142 )	2017-10-04 15:51:22 -07:00
examples	Feature/samples (#50 )	2017-08-22 18:36:25 -04:00
node_scripts	Enable docker authentication for private images (#92 )	2017-10-03 13:30:12 -07:00
tests	Bug: simplify secrets loading (#133 )	2017-10-03 16:22:48 -07:00
.editorconfig	Initial structure for test framework and CI (#30 )	2017-07-18 14:07:11 -07:00
.gitignore	Bug: rename dtde to aztk, rename thunderbolt to aztk, rename azb to aztk (#125 )	2017-09-29 17:13:22 -07:00
.travis.yml	Bug: rename dtde to aztk, rename thunderbolt to aztk, rename azb to aztk (#125 )	2017-09-29 17:13:22 -07:00
CHANGELOG.md	Bug: rename dtde to aztk, rename thunderbolt to aztk, rename azb to aztk (#125 )	2017-09-29 17:13:22 -07:00
README.md	Update README.md (#136 )	2017-10-03 17:30:09 -07:00
linux_memory.py	Merged PR 5: Merge feature/batch2.0.0 to master	2017-05-12 20:25:47 +00:00
pylintrc	Feature: Support for multiple custom scripts and master only scrips (#93 )	2017-10-02 13:20:11 -07:00
pytest.ini	Feature/secrets.cfg (#43 )	2017-08-21 12:36:45 -07:00
requirements.txt	Feature: config files (#65 )	2017-09-19 14:08:26 -07:00
setup.cfg	Revert "Revert "Feature: Setup the cluster with the start task." (#26 )" (#28 )	2017-07-14 12:00:47 -04:00
setup.py	Bug: rename dtde to aztk, rename thunderbolt to aztk, rename azb to aztk (#125 )	2017-09-29 17:13:22 -07:00

README.md

Azure Distributed Data Engineering Toolkit

Azure Distributed Data Engineering Toolkit is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Currently, this toolkit is designed to run batch Spark jobs that require additional on-demand compute. Eventually we plan to support other distributed data engineering frameworks in a similar vein. Please let us know which frameworks you'd like for us to support in the future.

Notable Features

Spark cluster provision time of 5 minutes on average
Spark clusters run in Docker containers
Users can bring their own Docker image
Ability to use low-priority VMs for an 80% discount
Built in support for Azure Blob Storage connection
Built in Jupyter notebook for interactive experience
Ability to run spark submit directly from your local machine's CLI

Setup

Clone the repo

    git clone -b stable https://www.github.com/azure/aztk
    
    # You can also clone directly from master to get the latest bits
    git clone https://www.github.com/azure/aztk

Use pip to install required packages (requires python 3.5+ and pip 9.0.1+)

    pip install -r requirements.txt

Use setuptools:

    pip install -e .

Initialize the project in a directory [This will automatically create a .aztk folder with config files in your working directory]:

    aztk spark init

Fill in the fields for your Batch account and Storage account in your .aztk/secrets.yaml file. (We'd also recommend that you enter SSH key info in this file)

This package is built on top of two core Azure services, Azure Batch and Azure Storage. Create those resources via the portal (see Getting Started).

Quickstart Guide

The core experience of this package is centered around a few commands.

# create your cluster
aztk spark cluster create

# monitor and manage your clusters
aztk spark cluster get
aztk spark cluster list
aztk spark cluster delete

# login and submit jobs to your cluster
aztk spark cluster ssh
aztk spark cluster submit

Create and setup your cluster

First, create your cluster:

aztk spark cluster create \
    --id <my_cluster_id> \
    --size <number_of_nodes> \
    --vm-size <vm_size>

You can find more information on VM sizes here. Please note that you must use the official SKU name when setting your VM size - they usually come in the form: "standard_d2_v2".

You can also create your cluster with low-priority VMs at an 80% discount by using --size-low-pri instead of --size (we have to set --size 0 as we currently do not support mixed low-priority and dedicated VMs):

aztk spark cluster create \
    --id <my_cluster_id> \
    --size 0 \
    --size-low-pri <number_of_low-pri_nodes> \
    --vm-size <vm_size>

By default, this package runs Spark 2.2.0 with Python 3.5 on an Ubuntu16.04 Docker image. More info on this image can be found in the docker-images folder in this repo.

NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.

More information regarding using a cluster can be found in the cluster documentation

Check on your cluster status

To check your cluster status, use the get command:

aztk spark cluster get --id <my_cluster_id>

Submit a Spark job

When your cluster is up, you can submit jobs to run against the cluster:

aztk spark cluster submit \
    --id <my_cluste_id> \
    --name <my_job_name> \
    [options] \
    <app jar | python file> \
    [app arguments]

NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters. Each job you submit must have a unique name.

The output of spark-submit will be streamed to the console. Use the --no-wait option to return immediately. More information regarding monitoring your job can be found in the spark submit documentation.

To start testing this package, you can start by trying out a Spark job from the ./examples folder. The examples are a curated list of samples from Spark-2.2.0.

Log in and Interact with your Spark Cluster

Most users will want to work interactively with their Spark clusters. With the aztk spark cluster ssh command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:

aztk spark cluster ssh --id <my_cluster_id>

By default, we port forward the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and Jupyter to localhost:8888.

You can configure these settings in the .aztk/ssh.yaml file.

Manage your Spark cluster

You can also see your clusters from the CLI:

aztk spark cluster list

And get the state of any specified cluster:

aztk spark cluster get --id <my_cluster_id>

Finally, you can delete any specified cluster:

aztk spark cluster delete --id <my_cluster_id>

FAQs

Next Steps

You can find more documentation here