2017-06-16 08:46:01 +03:00
# Distributed Tools for Data Engineering (DTDE)
A suite of distributed tools to help engineers scale their work into Azure.
# Spark on DTDE
2017-04-13 04:47:44 +03:00
2017-07-27 20:45:31 +03:00
## Setup
2017-04-21 02:54:36 +03:00
1. Clone the repo
2. Use pip to install required packages:
2017-07-19 20:48:23 +03:00
```bash
2017-08-10 21:00:43 +03:00
pip install -r requirements.txt
2017-07-19 20:48:23 +03:00
```
2017-04-21 02:54:36 +03:00
3. Use setuptools:
2017-07-19 20:48:23 +03:00
```bash
2017-08-10 21:00:43 +03:00
pip install -e .
2017-07-19 20:48:23 +03:00
```
2017-07-27 20:45:31 +03:00
4. Rename 'configuration.cfg.template' to 'configuration.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal.
2017-05-12 23:25:47 +03:00
To complete this step, you will need an Azure account that has a Batch account and Storage account:
2017-04-21 02:54:36 +03:00
- To create an Azure account: https://azure.microsoft.com/free/
- To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
- To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account
## Getting Started
2017-08-17 19:21:00 +03:00
The entire experience of this package is centered around a few commands.
2017-06-20 08:03:11 +03:00
### Create and setup your cluster
2017-04-21 02:54:36 +03:00
First, create your cluster:
2017-07-19 20:48:23 +03:00
```bash
azb spark cluster create \
2017-06-16 08:46:01 +03:00
--id < my-cluster-id > \
--size < number of nodes > \
--vm-size < vm-size > \
2017-06-20 08:03:11 +03:00
--custom-script < path to custom bash script to run on each node > (optional) \
2017-04-21 02:54:36 +03:00
--wait/--no-wait (optional)
```
2017-06-20 08:03:11 +03:00
You can also create your cluster with [low-priority ](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms ) VMs at an 80% discount by using ** --size-low-pri** instead of ** --size**:
```
2017-07-19 20:48:23 +03:00
azb spark cluster create \
2017-06-20 08:03:11 +03:00
--id < my-cluster-id > \
2017-07-27 20:45:31 +03:00
--size-low-pri < number of low-pri nodes > \
2017-06-20 08:03:11 +03:00
--vm-size < vm-size >
```
2017-08-17 19:21:00 +03:00
By default, this package runs Spark in docker from an ubuntu16.04 base image on a ubuntu16.04 VM. More info on this image can be found in the **docker-images** folder in this repo.
You can opt out of using this image and use the Azure CentOS DSVM instead - the Azure CentOS DSVM has Spark 2.0.2 pre-installed (*as of 07/24/17*). To do this, use the --no-docker flag, and it will default to using the Azure DSVM.
```
azb spark cluster create \
--id < my-cluster-id > \
--size-low-pri < number of low-pri nodes > \
--vm-size < vm-size > \
--no-docker
```
2017-07-27 20:45:31 +03:00
You can also add a user directly in this command using the same inputs as the `add-user` command described bellow.
#### Add a user to your cluster to connect
2017-06-27 01:49:40 +03:00
When your cluster is ready, create a user for your cluster (if you didn't already do so when creating your cluster):
2017-07-27 20:45:31 +03:00
```bash
# **Recommended usage**
# Add a user with a ssh public key. It will use the value specified in the configuration.cfg (Either path to the file or the actual key)
azb spark cluster add-user \
--id < my-cluster-id > \
--username < username >
# You can also explicity specify the ssh public key(Path or actual key)
azb spark cluster add-user \
--id < my-cluster-id > \
--username < username > \
--ssh-key ~/.ssh/id_rsa.pub
# **Not recommended**
# You can also just specify a password
2017-07-19 20:48:23 +03:00
azb spark cluster add-user \
2017-07-05 07:27:46 +03:00
--id < my-cluster-id > \
2017-04-21 02:54:36 +03:00
--username < username > \
--password < password >
2017-07-27 20:45:31 +03:00
2017-04-21 02:54:36 +03:00
```
2017-07-27 20:45:31 +03:00
NOTE: The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
2017-04-21 02:54:36 +03:00
2017-08-17 19:21:00 +03:00
2017-06-20 08:03:11 +03:00
### Submit a Spark job
2017-04-21 02:54:36 +03:00
Now you can submit jobs to run against the cluster:
```
2017-08-18 22:35:04 +03:00
azb spark app submit \
2017-06-16 08:46:01 +03:00
--id < my-cluster-id > \
--name < my-job-name > \
2017-07-27 20:45:31 +03:00
[options] \
< app jar | python file > \
2017-07-05 07:27:46 +03:00
[app arguments]
2017-04-21 02:54:36 +03:00
```
2017-07-06 21:33:09 +03:00
NOTE: The job name (--name) must be atleast 3 characters long, can only contain alphanumeric characters including hyphens but excluding underscores, and cannot contain uppercase letters.
2017-04-21 02:54:36 +03:00
2017-08-18 22:35:04 +03:00
The output of spark-submit will be streamed to the console. Use the `--no-wait` option to return immediately.
### Read the output of your spark job.
If you decided not to tail the log when submiting the job or want to read it again you can use this command.
```bash
azb spark app logs \
--id < my-cluster-id > \
-- name < my-job-name >
[--tail] # If you want it to tail the log if the task is still runing
```
2017-06-20 08:03:11 +03:00
### Interact with your Spark cluster
2017-06-16 08:46:01 +03:00
To view the spark UI, open up an ssh tunnel with the "masterui" option and a local port to map to:
2017-04-21 02:54:36 +03:00
```
2017-07-27 20:45:31 +03:00
azb spark cluster ssh \
2017-06-16 08:46:01 +03:00
--id < my-cluster-id > \
2017-07-05 07:27:46 +03:00
--masterui < local-port > \
--username < user-name >
2017-04-21 02:54:36 +03:00
```
Optionally, you can also open up a jupyter notebook with the "jupyter" option to work in:
```
2017-07-27 20:45:31 +03:00
azb spark cluster ssh \
2017-06-16 08:46:01 +03:00
--id < my-cluster-id > \
2017-06-27 01:49:40 +03:00
--masterui < local-port > \
2017-04-21 02:54:36 +03:00
--jupyter < local-port >
```
2017-06-20 08:03:11 +03:00
### Manage your Spark cluster
2017-04-21 02:54:36 +03:00
You can also see your clusters from the CLI:
```
2017-07-19 20:48:23 +03:00
azb spark cluster list
2017-04-21 02:54:36 +03:00
```
2017-06-16 08:46:01 +03:00
And get the state of any specified cluster:
```
2017-07-19 20:48:23 +03:00
azb spark cluster get --id < my-cluster-id >
2017-06-16 08:46:01 +03:00
```
Finally, you can delete any specified cluster:
2017-04-21 02:54:36 +03:00
```
2017-07-19 20:48:23 +03:00
azb spark cluster delete --id < my-cluster-id >
2017-04-21 02:54:36 +03:00
```