AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure
Перейти к файлу
Jacob Freck 9489301372
Update stable to v0.6.0 (#418)
* Feature: on node user creation (#303)

* client side on node user creation

* start create user on node implementation

* fix on node user creation

* remove debug statements

* remove commented code

* line too long

* fix spinner password prompt ui bug

* set wait to false by default, formatting

* encrypt password on client, decrypt on node

* update docs, log warning if password used

* Fix list-apps crash (#364)

* Allow submitting jobs into a VNET (#365)

* Add subnet_id to job submission cluster config

* add some docs

* Feature: Spark mixed mode support (#350)

* add support for aad creds for storage on node

* add mixed mode support

* add docs

* switch error order

* add dedicated to get_cluster

* remove mixed mode in print_cluster_conf

* Feature: spark init docker repo customization (#358)

* customize docker_repo based on init args

* whitespace

* add some docs

* r-base to r

* case insensitive r flag, typo fix

* Bug: Load default Jars for job submission CLI (#367)

* load jars in .aztk/ by default

* rewrite loading config files

* Feature: Cluster Run and Copy (#304)

* start implementation of cluster run

* fix cluster_run

* start debug sequential user add and delete

* parallelize user creation and deletion, start implementation of cluster scp

* continue cluster_scp implementation

* debug statements, disconnect error: permission denied

* untesteed parakimo implementation of clus_run

* continue debugging user creation bug

* fix bug with pool user creation, start concurrent implementation

* start fix of paramiko cluster_run and cluster_copy

* working paramiko cluster_run implementation, start cluster_scp

* fix cluster_scp command

* update requirements, rename cluster_run function

* remove unused shell functions

* parallelize run and scp, add container_name, create logs wrapper

* change scp to copy, clean up

* sort imports

* remove asyncssh from node requirements

* remove old import

* remove bad error handling

* make cluster user management methods private

* remove comment

* remove accidental commit

* fix merge, move delete to finally clause

* add docs

* formatting

* Feature: Refactor cluster config to use ClusterConfiguration model (#343)

* Bug: fix core-site.xml typo (#378)

* fix typo

* crlf->lf

* Bug: fix regex for is_gpu_enabled (#380)

* fix regex for is_gpu_enabled

* crlf->lf

* Bug: spark SDK example fix (#383)

* start fix sdk

* fix sdk example

* crlf->lf

* Fix: Custom scripts not read from cluster.yaml (#388)

* Feature: spark shuffle service (#374)

* start shuffle service by default

* whitespace, delete misplaced file

* crlf->lf

* crlf->lf

* move spark scratch space off os drive

* Feature: enable dynamic allocation by default (#386)

* Bug: stop using mutable default parameters (#392)

* Bug: always upload spark job logs errors (#395)

* Bug: spark submit upload error log type error (#397)

* Bug: Spark Job list apps exit code 0 (#396)

* Bug: fix spark-submit cores args (#399)

* Fix: Trying to add user before master is ready show better error (#402)

* Bug: move spark.local.dir to location usable by rstudioserver (#407)

* Feature: SDK support for file-like configuration objects (#373)

* add support for filelike objects for conifguration files

* fix custom scripts

* remove os.pathlike

* merge error

* Feature: Basic Cluster and Job Submission SDK Tests (#344)

* add initial cluster tests

* add cluster tests, add simple job submission test scenario

* sort imports

* fix job tests

* fix job tests

* remove pytest from travis build

* cluster per test, parallel pytest plugin

* delete cluster after tests, wait until deleted

* fix bugs

* catch right error, change cluster_id to base_cluster_id

* fix test name

* fixes

*  move tests to intregration_tests dir

* update travis to run non-integration tests

* directory structure, decoupled job tests

* fix job tests, issue with submit_job

* fix bug

* add test docs

* add cluster and job delete to finally clause

* Feature: Spark add worker on master option (#415)

* Add worker_on_master to ClusterConfiguration

* add worker_on_master to JobConfiguration

* Feature: task affinity to master node (#413)

* Release: v0.6.0 (#416)

* update changelog and version

* underscores to stars
2018-02-23 15:18:03 -08:00
.vscode Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
aztk Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
cli Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
config Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
custom-scripts always upgrade jupyter to the latest version on install (#336) 2018-01-18 19:46:37 -08:00
docker-image Bug: Fixed R Dockerfiles (#357) 2018-01-25 08:25:39 -08:00
docs Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
examples Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
node_scripts Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
tests Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
.editorconfig Feature: Azure Data Lake Store support (#170) 2017-10-19 09:16:14 -07:00
.gitignore Bug: rename dtde to aztk, rename thunderbolt to aztk, rename azb to aztk (#125) 2017-09-29 17:13:22 -07:00
.style.yapf Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
.travis.yml Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
CHANGELOG.md Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
LICENSE add MIT license (#323) 2018-01-11 10:19:14 -08:00
README.md Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
linux_memory.py Merged PR 5: Merge feature/batch2.0.0 to master 2017-05-12 20:25:47 +00:00
pylintrc Feature: Support for multiple custom scripts and master only scrips (#93) 2017-10-02 13:20:11 -07:00
pytest.ini Feature/secrets.cfg (#43) 2017-08-21 12:36:45 -07:00
requirements.txt Update stable to v0.6.0 (#418) 2018-02-23 15:18:03 -08:00
setup.py Feature: Rename SDK (#231) 2017-12-01 13:42:55 -08:00

README.md

Azure Distributed Data Engineering Toolkit (AZTK)

Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.

This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Notable Features

Setup

  1. Clone the repo
    git clone -b stable https://www.github.com/azure/aztk

    # You can also clone directly from master to get the latest bits
    git clone https://www.github.com/azure/aztk
  1. Use pip to install required packages (requires python 3.5+ and pip 9.0.1+)
    pip install -r requirements.txt
  1. Use setuptools:
    pip install -e .
  1. Initialize the project in a directory [This will automatically create a .aztk folder with config files in your working directory]:
    aztk spark init
  1. Fill in the fields for your Batch account and Storage account in your .aztk/secrets.yaml file. (We'd also recommend that you enter SSH key info in this file)

    This package is built on top of two core Azure services, Azure Batch and Azure Storage. Create those resources via the portal (see Getting Started).

Quickstart Guide

The core experience of this package is centered around a few commands.

# create your cluster
aztk spark cluster create
aztk spark cluster add-user
# monitor and manage your clusters
aztk spark cluster get
aztk spark cluster list
aztk spark cluster delete
# login and submit jobs to your cluster
aztk spark cluster ssh
aztk spark cluster submit

1. Create and setup your cluster

First, create your cluster:

aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
  • See our available VM sizes here.
  • The --vm-size argument must be the official SKU name which usually come in the form: "standard_d2_v2"
  • You can create low-priority VMs at an 80% discount by using --size-low-pri instead of --size
  • By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info here
  • By default, AZTK will create a user (with the username spark) for your cluster
  • The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
  • By default, you cannot create clusters of more than 20 cores in total. Visit this page to request a core quota increase.

More information regarding using a cluster can be found in the cluster documentation

2. Check on your cluster status

To check your cluster status, use the get command:

aztk spark cluster get --id my_cluster

3. Submit a Spark job

When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:

// submit a java application
aztk spark cluster submit \
    --id my_cluster \
    --name my_java_job \
    --class org.apache.spark.examples.SparkPi \
    --executor-memory 20G \
    path\to\examples.jar 1000
    
// submit a python application
aztk spark cluster submit \
    --id my_cluster \
    --name my_python_job \
    --executor-memory 20G \
    path\to\pi.py 1000
  • The aztk spark cluster submit command takes the same parameters as the standard spark-submit command, except instead of specifying --master, AZTK requires that you specify your cluster --id and a unique job --name
  • The job name, --name, argument must be atleast 3 characters long
    • It can only contain alphanumeric characters including hypens but excluding underscores
    • It cannot contain uppercase letters
  • Each job you submit must have a unique name
  • Use the --no-wait option for your command to return immediately

Learn more about the spark submit command here

4. Log in and Interact with your Spark Cluster

Most users will want to work interactively with their Spark clusters. With the aztk spark cluster ssh command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:

aztk spark cluster ssh --id my_cluster --user spark

By default, we port forward the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and the Spark History Server to localhost:18080.

You can configure these settings in the .aztk/ssh.yaml file.

NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server depending on whether or not you are a python or R user. To do so, you need to setup your cluster with the appropriate docker image and custom scripts:

5. Manage and Monitor your Spark Cluster

You can also see your clusters from the CLI:

aztk spark cluster list

And get the state of any specified cluster:

aztk spark cluster get --id <my_cluster_id>

Finally, you can delete any specified cluster:

aztk spark cluster delete --id <my_cluster_id>

FAQs

Next Steps

You can find more documentation here