How to use PySpark on Azure with AZTK

This document is a guide for how to create an Apache Spark cluster for the PySpark user. In this tutorial, we install and setup Anaconda3-5.0.0 (python3.6.2), a Jupyter Notebook, and (optionally) mmlspark for you.

This guide does not require any knowledge about Azure or cloud infrastructure.

The goal for this tutorial is to:

Provision a Spark cluster to use PySpark with
To interact with said Spark cluster with a Jupyter Notebook
To achieve the above goals quickly, cheaply, and easily

For this tutorial, we assume that you have the following requirements:

Setup AZTK

In your working directory, run aztk spark init to initialize AZTK. This command will create a .aztk/ folder in your working directory. Fill out the .aztk/secrets.yaml file with your Azure Batch and Azure Storage account secrets. We also recommend setting your ssh-key here.

For more details, see this section

For this tutorial, please copy the folder aztk/custom-scripts into your working directory. This should be located in AZTK repo that you cloned when installing AZTK. (If you are working directly in the cloned repo, you can skip this step as the /custom-scripts folder is already there.)

Provision your cluster

Set up your Spark cluster with the aztk-python Docker image and a custom script

To provision your Spark cluster for Pyspark, you will need to edit .aztk/cluster.yaml. You need to set the parameters docker_repo and custom_scripts as follows:


# .aztk/cluster.yaml

...
docker_repo: aztk/python:spark2.2.0-python3.6.2-base
custom_scripts:
  - script: custom-scripts/jupyter.sh
    runOn: master
...

This will tell AZTK to use the aztk-python image to build your cluster as well as configure Jupyter Notebook to run seamlessly after the cluster is set up. This requires custom-scripts/jupyter.sh to be a valid path from your working directory

Feel free to modify the other parameters as needed.

(Optional) Set up mmlspark

To install mmlspark, you will need to add the following line into at the end of .aztk/spark-defaults.conf:

...
spark.jars.packages    Azure:mmlspark:0.12
...

Create cluster

You can now run the aztk cluster create command:

aztk spark cluster create --id <my_spark_cluster> --size 10

This command will automatically use the contents that we modified in .aztk/cluster.yaml (and .aztk/spark-defaults.conf).

Interact with PySpark

After you've run your aztk spark cluster create command, you will need to wait a few minutes for your cluster to be ready.

Once it is ready, you can use the following command to ssh into your cluster's master node. This command will also port forward the necessary ports to start using Jupyter and the standard Spark UIs:

aztk spark cluster ssh --id <my_spark_cluster>

Once you've ssh'ed in, you'll be in the master node of your cluster. You can simply type pyspark to get going.

Once you've ssh'ed in, you can also visit localhost:8080 to see the SparkUI and monitor the state of your cluster.

Interact with Jupyter Notebook

Optionally, you can use Jupyter Notebook to interact with your cluster. On your local machine, open up your favorite browser and go to localhost:8888 to use Jupyter. When creating a new file, select PySpark to automatically load your SparkSession into the variable spark.

(Optional) Start using mmlspark!

In your Jupyter Notebook, you can start using mmlspark by simply importing the library:

import mmlspark
...

Getting Started

Using PySpark on AZTK

Using SparklyR on AZTK

CLI

Cluster configuration

SDK

Use or define a plugin