Feature/python container (#210)

* added python container, jupyter install script, vanilla container * Update README.md * Create README.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update jupyter.sh * Update README.md * Update README.md * python * readme update * docker updates * Update README.md * Update 12-docker-image.md * Update constants.py * add image files for wiki * update imageS * . * dockerfile typo * dockerfiles * Removed r * readme * update constants.py * update readme * readme updates * readme updates
2017-11-09 00:43:32 -08:00 · 2017-11-09 00:43:32 -08:00 · 65a56dd657
--- a/README.md
+++ b/README.md
@ -1,5 +1,5 @@
-# Azure Distributed Data Engineering Toolkit
-Azure Distributed Data Engineering Toolkit is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.
+# Azure Distributed Data Engineering Toolkit (AZTK)
+Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.

 This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

@ -10,8 +10,10 @@ Currently, this toolkit is designed to run batch Spark jobs that require additio
 - Spark clusters run in Docker containers
 - Users can bring their own Docker image
 - Ability to use low-priority VMs for an 80% discount
- Built in support for Azure Blob Storage connection
- Built in Jupyter notebook for interactive experience
+- Built in support for Azure Blob Storage and Azure Data Lake connection
+- Optional Jupyter Notebook for pythonic interactive experience
+- [coming soon] Optional RStudio Server for an interactive experience in R
+- Tailored Docker image for PySpark and [coming soon] SparklyR
 - Ability to run _spark submit_ directly from your local machine's CLI

 ## Setup
@ -78,7 +80,7 @@ aztk spark cluster create \
    --vm-size <vm_size>
 ```

-By default, this package runs Spark 2.2.0 with Python 3.5 on an Ubuntu16.04 Docker image. More info on this image can be found in the [docker-images](/docker-image) folder in this repo.
+By default, this package runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info on this image can be found in the [docker-images](/docker-image) folder in this repo.

 NOTE: The cluster id (`--id`) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.

@ -112,7 +114,7 @@ Most users will want to work interactively with their Spark clusters. With the `
 ```bash
 aztk spark cluster ssh --id <my_cluster_id>
 ```
-By default, we port forward the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and Jupyter to *localhost:8888*.
+By default, we port forward the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and the Spark History Server to *localhost:18080*.

 You can configure these settings in the *.aztk/ssh.yaml* file.

@ -135,7 +137,7 @@ aztk spark cluster delete --id <my_cluster_id>

 ## FAQs
 - [How do I connect to Azure Storage (WASB)?](./docs/30-cloud-storage.md)
- [I want to use a different version of Spark / Python](./docs/12-docker-image.md)
+- [I want to use a different version of Spark](./docs/12-docker-image.md)
 - [How do I SSH into my Spark cluster's master node?](./docs/10-clusters.md#ssh-and-port-forwarding)
 - [How do I interact with my Spark cluster using a password instead of an SSH-key?](./docs/10-clusters.md#interactive-mode)
 - [How do I change my cluster default settings?](./docs/13-configuration.md)
--- a/aztk_sdk/utils/constants.py
+++ b/aztk_sdk/utils/constants.py
@ -3,7 +3,7 @@ import os
 """
    DOCKER
 """
-DEFAULT_DOCKER_REPO = "jiata/aztk:0.1.0-spark2.2.0-python3.5.4"
+DEFAULT_DOCKER_REPO = "jiata/aztk-base:0.1.0-spark2.2.0"
 DOCKER_SPARK_CONTAINER_NAME = "spark"

 # DOCKER
--- a/config/cluster.yaml
+++ b/config/cluster.yaml
@ -15,7 +15,7 @@ size: 2
 username: spark

 # docker_repo: <name of docker image repo (for more information, see https://github.com/Azure/aztk/blob/master/docs/12-docker-image.md)>
-docker_repo: jiata/aztk:0.1.0-spark2.2.0-python3.5.4
+docker_repo: jiata/aztk-base:0.1.0-spark2.2.0

 # # optional custom scripts to run on the Spark master, Spark worker or all nodes in the cluster
 # custom_scripts: 
--- a/custom-scripts/jupyter.sh
+++ b/custom-scripts/jupyter.sh
@ -0,0 +1,55 @@
+#!/bin/bash
+
+# This custom script only works on images where jupyter is pre-installed on the Docker image
+# 
+# This custom script has been tested to work on the following docker images:
+#  - jiata/aztk-python:0.1.0-spark2.2.0-python3.6.2
+#  - jiata/aztk-python:0.1.0-spark2.1.0-python3.6.2
+#  - jiata/aztk-python:0.1.0-spark1.6.3-python3.6.2
+
+if  [ "$IS_MASTER" = "1" ]; then
+
+    PYSPARK_DRIVER_PYTHON="/.pyenv/versions/${USER_PYTHON_VERSION}/bin/jupyter"
+    JUPYTER_KERNELS="/.pyenv/versions/${USER_PYTHON_VERSION}/share/jupyter/kernels"
+
+    # disable password/token on jupyter notebook
+    jupyter notebook --generate-config --allow-root
+    JUPYTER_CONFIG='/.jupyter/jupyter_notebook_config.py'
+    echo >> $JUPYTER_CONFIG
+    echo -e 'c.NotebookApp.token=""' >> $JUPYTER_CONFIG
+    echo -e 'c.NotebookApp.password=""' >> $JUPYTER_CONFIG
+
+    # get master ip
+    MASTER_IP=$(hostname -i)
+
+    # remove existing kernels
+    rm -rf $JUPYTER_KERNELS/*
+
+    # set up jupyter to use pyspark 
+    mkdir $JUPYTER_KERNELS/pyspark
+    touch $JUPYTER_KERNELS/pyspark/kernel.json
+    cat << EOF > $JUPYTER_KERNELS/pyspark/kernel.json
+{
+    "display_name": "PySpark",
+    "language": "python",
+    "argv": [
+        "python",
+        "-m",
+        "ipykernel",
+        "-f",
+        "{connection_file}"
+    ],
+    "env": {
+        "SPARK_HOME": "$SPARK_HOME",
+        "PYSPARK_PYTHON": "python",
+        "PYSPARK_SUBMIT_ARGS": "--master spark://$MASTER_IP:7077 pyspark-shell"
+    }
+}
+EOF
+
+    # start jupyter notebook from /jupyter
+    cd /jupyter
+    (PYSPARK_DRIVER_PYTHON=$PYSPARK_DRIVER_PYTHON PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8888 --allow-root" pyspark &)
+fi
+
+
--- a/docker-image/Dockerfile
+++ b/docker-image/Dockerfile
@ -1,60 +0,0 @@
-# Ubuntu 16.04 (Xenial)
-FROM ubuntu:16.04
-
-# set version of python required for thunderbolt application
-ENV AZTK_PYTHON_VERSION=3.5.4
-
-# modify these ARGs on build time to specify your desired versions of Spark/Hadoop and Python
-ARG SPARK_VERSION_KEY=spark-2.1.0-bin-hadoop2.7
-ARG PYTHON_VERSION=3.5.4
-
-# set up apt-get
-RUN apt-get update -y && \
-    apt-get install -y --no-install-recommends make build-essential libssl-dev zlib1g-dev \
-    libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm git libncurses5-dev \
-    libncursesw5-dev xz-utils tk-dev 
-
-# installing [software-properties-common] so that we can use [apt-add-repository] to add the repository [ppa:webupd8team/java] form which we install Java8
-RUN apt-get install -y --no-install-recommends software-properties-common && \
-    apt-add-repository ppa:webupd8team/java -y && \
-    apt-get update -y
-
-# installing java
-RUN apt-get install -y --no-install-recommends default-jdk
-
-# install and setup pyenv
-RUN git clone git://github.com/yyuu/pyenv.git .pyenv
-RUN git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv
-ENV HOME /
-ENV PYENV_ROOT $HOME/.pyenv
-ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
-RUN eval "$(pyenv init -)"
-RUN echo 'eval "$(pyenv init -)"' >> ~/.bashrc
-
-# install thunderbolt required python version & the user specified version of python
-RUN pyenv install -f $AZTK_PYTHON_VERSION && \
-    pyenv install -s $PYTHON_VERSION
-RUN pyenv global $PYTHON_VERSION
-
-# install jupyter
-RUN pip install jupyter && \
-    pip install --upgrade jupyter
-
-# install p4j
-RUN curl https://pypi.python.org/packages/1f/b0/882c144fe70cc3f1e55d62b8611069ff07c2d611d99f228503606dd1aee4/py4j-0.10.0.tar.gz | tar xvz -C /home && \
-    cd /home/py4j-0.10.0 && \
-    python setup.py install 
-
-# install spark
-RUN curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home
-
-# set up symlink for SPARK HOME
-RUN ln -s /home/$SPARK_VERSION_KEY /home/spark-current
-
-# set env vars
-ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
-ENV SPARK_HOME /home/spark-current
-ENV USER_PYTHON_VERSION $PYTHON_VERSION
-ENV PATH $SPARK_HOME/bin:$PATH
-
-CMD ["/bin/bash"]
--- a/docker-image/README.md
+++ b/docker-image/README.md
@ -1,17 +1,40 @@
 # Docker Image Gallery
 Azure Distributed Data Engineering Toolkit uses Docker containers to run Spark. 

-Please refer to the docs for details [how to select a docker-repo at cluster creation time](../docs/12-docker-image.md).
+Please refer to the docs for details on [how to select a docker-repo at cluster creation time](../docs/12-docker-image.md).

-### Supported Base Images
-We support several base images:
- [Docker Hub] jiata/aztk:<image-version>-spark2.2.0-python3.5.4
- [Docker Hub] jiata/aztk:<image-version>-spark2.1.0-python3.5.4
- [Docker Hub] jiata/aztk:<image-version>-spark1.6.3-python3.5.4
- [Docker Hub] jiata/aztk:<image-version>-spark2.1.0-python2.7.13
- [Docker Hub] jiata/aztk:<image-version>-spark1.6.3-python2.7.13
+## Supported Images
+By default, this toolkit will use the base Spark image, __aztk-base__. This image contains the bare mininum to get Spark up and running in standalone mode.

-NOTE: Replace **<image-version>** with the version of the image you wish to use. For example: **jiata/aztk:0.1.0-spark2.2.0-python3.5.4**
+On top of that, we also provide additional flavors of Spark images, one geared towards the Python user (PySpark), and the other - coming soon - geared towards the R user (SparklyR or SparkR).
+
+Docker Image | Image Type | User Language(s) | What's Included? 
+:-- | :-- | :-- | :-- 
+[aztk-base](https://hub.docker.com/r/jiata/aztk-base/) | Base | Java, Scala |  `Spark`
+[aztk-python](https://hub.docker.com/r/jiata/aztk-python/) | Pyspark | Python | `Anaconda`</br>`Jupyter Notebooks` </br> `PySpark`
+(coming soon) [aztk-r](https://hub.docker.com/r/jiata/aztk-r/) | SparklyR | R | `CRAN`</br>`RStudio Server`</br>`SparklyR and SparkR`
+
+__aztk-python__ and __aztk-r__ images are built on top of the __aztk-base__ image.
+
+Today, all the AZTK images are hosted on Docker Hub under [jiata](https://hub.docker.com/r/jiata).
+
+### Matrix of Supported Container Images:
+
+Docker Repo (hosted on Docker Hub) | Spark Version | Python Version | R Version 
+:-- | :-- | :-- | :-- 
+jiata/aztk-base:0.1.0-spark2.2.0 __(defaul)__ | v2.2.0 | -- | -- 
+jiata/aztk-base:0.1.0-spark2.1.0 | v2.1.0 | -- | -- 
+jiata/aztk-base:0.1.0-spark1.6.3 | v1.6.3 | -- | -- 
+jiata/aztk-python:0.1.0-spark2.2.0-anaconda3-5.0.0 | v2.2.0 | v3.6.2 | -- 
+jiata/aztk-python:0.1.0-spark2.1.0-anaconda3-5.0.0 | v2.1.0 | v3.6.2 | -- 
+jiata/aztk-python:0.1.0-spark1.6.3-anaconda3-5.0.0 | v1.6.3 | v3.6.2 | -- 
+[coming soon] jiata/aztk-r:0.1.0-spark2.2.0-r3.4.1 | v2.2.0 | -- | v3.4.1 
+[coming soon] jiata/aztk-r:0.1.0-spark2.1.0-r3.4.1 | v2.1.0 | -- | v3.4.1 
+[coming soon] jiata/aztk-r:0.1.0-spark1.6.3-r3.4.1 | v1.6.3 | -- | v3.4.1 
+
+If you have requests to add to the list of supported images, please file a Github issue.
+
+NOTE: Spark clusters that use the __aztk-python__ and __aztk-r__ images take longer to provision because these Docker images are significantly larger than the __aztk-base__ image. 

 ### Gallery of 3rd Party Images
 Since this toolkit uses Docker containers to run Spark, users can bring their own images. Here's a list of 3rd party images:
@ -19,94 +42,68 @@ Since this toolkit uses Docker containers to run Spark, users can bring their ow

 (See below for a how-to guide on building your own images for the Azure Distributed Data Engineering Toolkit)

-# How to use my own Docker Image
+# How do I use my own Docker Image?
+Building your own Docker Image to use with this toolkit has many advantages for users who want more customization over their environment. For some, this may look like installing specific, and even private, libraries that their Spark jobs require. For others, it may just be setting up a version of Spark, Python or R that fits their particular needs.
+
 This section is for users who want to build their own docker images.

-## Base Docker Images to build with
-By default, the Azure Distributed Data Engineering Toolkit uses **Spark2.2.0-Python3.5.4** as its base image. However, you can build from any of the following supported base images:
-
- Spark2.2.0 and Hadoop2.7 and Python3.5.4
- Spark2.1.0 and Hadoop2.7 and Python3.5.4
- Spark2.1.0 and Hadoop2.7 and Python2.7.13
- Spark1.6.3 and Hadoop2.6 and Python3.5.4
- Spark1.6.3 and Hadoop2.6 and Python2.7.13
-
-All the base images above are built on a vanilla ubuntu16.04-LTS image and comes pre-baked with Jupyter Notebook and a connection to Azure Blob Storage (WASB).
-
-Currently, the images are hosted on [Docker Hub (jiata/aztk)](https://hub.docker.com/r/jiata/aztk).
-
-If you have requests to add to the list of supported base images, please file a new Github issue.
-
 ## Building Your Own Docker Image
-Azure Distributed Data Engineering Toolkit supports custom Docker images. To guarantee that your Spark deployment works, you can either build your own Docker image on top or beneath one of our supported base images _OR_ you can modify this supported Dockerfile and build your own image.
+The Azure Distributed Data Engineering Toolkit supports custom Docker images. To guarantee that your Spark deployment works, we recommend that you build on top of one of our __aztk-base__ images. You can also build on top of our __aztk-python__ or __aztk-r__ images, but note that they are also built on top of the __aztk_base__ image.
+
+To build your own image, can either build _on top_ or _beneath_ one of our supported images _OR_ you can just modify one of the supported Dockerfiles to build your own.

 ### Building on top 
-You can choose to build on top of one of our base images by using the **FROM** keyword in your Dockerfile:
-```python
+You can build on top of our images by referencing the __aztk_base__ image in the **FROM** keyword of your Dockerfile:
+```sh
 # Your custom Dockerfile

-FROM jiata/aztk:<aztk-image-version>-spark2.2.0-python3.5.4
+FROM jiata/aztk-base:0.1.0-spark2.2.0
 ...

 ```

 ### Building beneath 
-You can alternatively build beneath one of our base images by pulling down one of base images' Dockerfile and setting the **FROM** keyword to pull from your Docker image's location:
-```python
-# The Dockerfile from one of supported base image
+To build beneath one of our images, modify one of our Dockerfiles so that the **FROM** keyword pulls from your Docker image's location (as opposed to the default which is a base Ubuntu image):
+```sh
+# One of the Dockerfiles that AZTK supports
 # Change the FROM statement to point to your hosted image repo

 FROM my_username/my_repo:latest
 ...
 ```

-NOTE: Currently, we do not supported private Docker repos.
+Please note that for this method to work, your Docker image must have been built on Ubuntu.

-## About the Dockerfile
-The Dockerfile is used to build the Docker images used by this toolkit. 
+## Required Environment Variables
+When layering your own Docker image, make sure your image does not intefere with the environment variables set in the __aztk_base__ Dockerfile, otherwise it may not work on AZTK.

-You can modify this Dockerfile to build your own image. If you plan to do so, please continue reading the below sections.
+Please make sure that the following environment variables are set: 
+- AZTK_PYTHON_VERSION
+- JAVA_HOME
+- SPARK_HOME

-### Specifying Spark and Python Version
-This Dockerfile takes in a few variables at build time that allow you to specify your desired Spark and Python versions: **PYTHON_VERSION** and **SPARK_VERSION_KEY**.
-
-```sh
-# For example, if I want to use Python 2.7.13 with Spark 1.6.3 I would build the image as follows:
-docker build \
-    --build-arg PYTHON_VERSION=2.7.13 \
-    --build-arg SPARK_VERSION_KEY=spark-1.6.3-bin-hadoop2.6 \
-    -t <my_image_tag> .
-```
-
-**SPARK_VERSION_KEY** is used to locate which version of Spark to download. These are the values that have been tested:
- spark-1.6.3-bin-hadoop2.6
- spark-2.1.0-bin-hadoop2.7
- spark-2.2.0-bin-hadoop2.7
-
-For a full list of supported keys, please see this [page](https://d3kbcqa49mib13.cloudfront.net)
-
-NOTE: Do not include the '.tgz' suffix as part of the Spark version key.
-
-**PYTHON_VERSION** is used to set the version of Python for your cluster. These are the values that have been tested:
- 3.5.4
- 2.7.13
-
-NOTE: Most version of Python will work. However, when selecting your Python version, please make sure that the it is compatible with your selected version of Spark. Today, it is also a requirement that your selected verion of Python can run Jupyter Notebook.
-
-### Required Environment Variables
-When layering your own Docker image, make sure your image does not intefere with the environment variables set in this Dockerfile, otherwise it may not work.
-
-If you want to use your own version of Spark, please make sure that the following environment variables are set. 
+You also need to make sure that __PATH__ is correctly configured with $SPARK_HOME
+- PATH=$SPARK_HOME/bin:$PATH

+By default, these are set as follows:
 ``` sh
-# An example of required environment variables
+ENV AZTK_PYTHON_VERSION 3.5.4
 ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
 ENV SPARK_HOME /home/spark-current
-ENV PYSPARK_PYTHON python
-ENV USER_PYTHON_VERSION $PYTHON_VERSION
 ENV PATH $SPARK_HOME/bin:$PATH
 ```

 If you are using your own version of Spark, make that it is symlinked by "/home/spark-current". **$SPARK_HOME**, must also point to "/home/spark-current".

+## Hosting your Docker Image
+By default, this toolkit assumes that your Docker images are publicly hosted on Docker Hub. However, we also support hosting your images privately.
+
+See [here](https://github.com/Azure/aztk/blob/master/docs/12-docker-image.md#using-a-custom-docker-image-that-is-privately-hosted) to learn more about using privately hosted Docker Images.
+
+## Learn More 
+The Dockerfiles in this directory are used to build the Docker images used by this toolkit. Please reference the individual directories for more information on each Dockerfile:
+- [Base](./base)
+- [Python](./python)
+- [coming soon] R
+

--- a/docker-image/base/README.md
+++ b/docker-image/base/README.md
@ -0,0 +1,26 @@
+# Base AZTK Docker image
+
+This Dockerfile is used to build the __aztk-base__ image used by this toolkit. This Dockerfile produces the Docker image that is selected by AZTK by default.
+
+You can modify this Dockerfile to build your own image. 
+
+## How to build this image
+This Dockerfile takes in a single variable at build time that allows you to specify your desired Spark version: **SPARK_VERSION_KEY**.
+
+By default, this image will also be installed with python v3.5.4 as a requirement for this toolkit.
+
+```sh
+# For example, if I want to use Spark 1.6.3 I would build the image as follows:
+docker build \
+    --build-arg SPARK_VERSION_KEY=spark-1.6.3-bin-hadoop2.6 \
+    -t <my_image_tag> .
+```
+
+**SPARK_VERSION_KEY** is used to locate which version of Spark to download. These are the values that have been tested and known to work:
+- spark-1.6.3-bin-hadoop2.6
+- spark-2.1.0-bin-hadoop2.7
+- spark-2.2.0-bin-hadoop2.7
+
+For a full list of supported keys, please see this [page](https://d3kbcqa49mib13.cloudfront.net)
+
+NOTE: Do not include the '.tgz' suffix as part of the Spark version key.
--- a/docker-image/base/spark1.6.3/Dockerfile
+++ b/docker-image/base/spark1.6.3/Dockerfile
@ -0,0 +1,61 @@
+# Ubuntu 16.04 (Xenial)
+FROM ubuntu:16.04
+
+# set version of python required for thunderbolt application
+ENV AZTK_PYTHON_VERSION=3.5.4
+
+# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
+ARG SPARK_VERSION_KEY=spark-1.6.3-bin-hadoop2.6
+
+# set up env vars for pyenv
+ENV HOME /
+ENV PYENV_ROOT $HOME/.pyenv
+ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
+
+RUN apt-get clean \
+    && apt-get update -y \
+    # install dependency packages
+    && apt-get install -y --no-install-recommends \
+       make \
+       build-essential \
+       zlib1g-dev \
+       libssl-dev \
+       libbz2-dev \
+       libreadline-dev \
+       libsqlite3-dev \
+       wget \
+       curl \
+       llvm \
+       git \
+       libncurses5-dev \
+       libncursesw5-dev \
+       xz-utils \
+       tk-dev \
+    && apt-get update -y \
+    # install [software-properties-common] 
+    # so we can use [apt-add-repository] to add the repository [ppa:webupd8team/java] 
+    # from which we install Java8
+    && apt-get install -y --no-install-recommends software-properties-common \
+    && apt-add-repository ppa:webupd8team/java -y \
+    && apt-get update -y \
+    # install java
+    && apt-get install -y --no-install-recommends default-jdk \
+    # download pyenv
+    && git clone git://github.com/yyuu/pyenv.git .pyenv \
+    && git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv \
+    # install & setup pyenv
+    && eval "$(pyenv init -)" \
+    && echo 'eval "$(pyenv init -)"' >> ~/.bashrc \
+    # install aztk required python version 
+    && pyenv install -f $AZTK_PYTHON_VERSION \
+    && pyenv global $AZTK_PYTHON_VERSION \
+    # install spark & setup symlink to SPARK_HOME
+    && curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home \
+    && ln -s /home/$SPARK_VERSION_KEY /home/spark-current
+
+# set env vars
+ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
+ENV SPARK_HOME /home/spark-current
+ENV PATH $SPARK_HOME/bin:$PATH
+
+CMD ["/bin/bash"]
--- a/docker-image/base/spark2.1.0/Dockerfile
+++ b/docker-image/base/spark2.1.0/Dockerfile
@ -0,0 +1,61 @@
+# Ubuntu 16.04 (Xenial)
+FROM ubuntu:16.04
+
+# set version of python required for thunderbolt application
+ENV AZTK_PYTHON_VERSION=3.5.4
+
+# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
+ARG SPARK_VERSION_KEY=spark-2.1.0-bin-hadoop2.7
+
+# set up env vars for pyenv
+ENV HOME /
+ENV PYENV_ROOT $HOME/.pyenv
+ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
+
+RUN apt-get clean \
+    && apt-get update -y \
+    # install dependency packages
+    && apt-get install -y --no-install-recommends \
+       make \
+       build-essential \
+       zlib1g-dev \
+       libssl-dev \
+       libbz2-dev \
+       libreadline-dev \
+       libsqlite3-dev \
+       wget \
+       curl \
+       llvm \
+       git \
+       libncurses5-dev \
+       libncursesw5-dev \
+       xz-utils \
+       tk-dev \
+    && apt-get update -y \
+    # install [software-properties-common] 
+    # so we can use [apt-add-repository] to add the repository [ppa:webupd8team/java] 
+    # from which we install Java8
+    && apt-get install -y --no-install-recommends software-properties-common \
+    && apt-add-repository ppa:webupd8team/java -y \
+    && apt-get update -y \
+    # install java
+    && apt-get install -y --no-install-recommends default-jdk \
+    # download pyenv
+    && git clone git://github.com/yyuu/pyenv.git .pyenv \
+    && git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv \
+    # install & setup pyenv
+    && eval "$(pyenv init -)" \
+    && echo 'eval "$(pyenv init -)"' >> ~/.bashrc \
+    # install aztk required python version 
+    && pyenv install -f $AZTK_PYTHON_VERSION \
+    && pyenv global $AZTK_PYTHON_VERSION \
+    # install spark & setup symlink to SPARK_HOME
+    && curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home \
+    && ln -s /home/$SPARK_VERSION_KEY /home/spark-current
+
+# set env vars
+ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
+ENV SPARK_HOME /home/spark-current
+ENV PATH $SPARK_HOME/bin:$PATH
+
+CMD ["/bin/bash"]
--- a/docker-image/base/spark2.2.0/Dockerfile
+++ b/docker-image/base/spark2.2.0/Dockerfile
@ -0,0 +1,61 @@
+# Ubuntu 16.04 (Xenial)
+FROM ubuntu:16.04
+
+# set version of python required for thunderbolt application
+ENV AZTK_PYTHON_VERSION=3.5.4
+
+# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
+ARG SPARK_VERSION_KEY=spark-2.2.0-bin-hadoop2.7
+
+# set up env vars for pyenv
+ENV HOME /
+ENV PYENV_ROOT $HOME/.pyenv
+ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
+
+RUN apt-get clean \
+    && apt-get update -y \
+    # install dependency packages
+    && apt-get install -y --no-install-recommends \
+       make \
+       build-essential \
+       zlib1g-dev \
+       libssl-dev \
+       libbz2-dev \
+       libreadline-dev \
+       libsqlite3-dev \
+       wget \
+       curl \
+       llvm \
+       git \
+       libncurses5-dev \
+       libncursesw5-dev \
+       xz-utils \
+       tk-dev \
+    && apt-get update -y \
+    # install [software-properties-common] 
+    # so we can use [apt-add-repository] to add the repository [ppa:webupd8team/java] 
+    # from which we install Java8
+    && apt-get install -y --no-install-recommends software-properties-common \
+    && apt-add-repository ppa:webupd8team/java -y \
+    && apt-get update -y \
+    # install java
+    && apt-get install -y --no-install-recommends default-jdk \
+    # download pyenv
+    && git clone git://github.com/yyuu/pyenv.git .pyenv \
+    && git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv \
+    # install & setup pyenv
+    && eval "$(pyenv init -)" \
+    && echo 'eval "$(pyenv init -)"' >> ~/.bashrc \
+    # install aztk required python version 
+    && pyenv install -f $AZTK_PYTHON_VERSION \
+    && pyenv global $AZTK_PYTHON_VERSION \
+    # install spark & setup symlink to SPARK_HOME
+    && curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home \
+    && ln -s /home/$SPARK_VERSION_KEY /home/spark-current
+
+# set env vars
+ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
+ENV SPARK_HOME /home/spark-current
+ENV PATH $SPARK_HOME/bin:$PATH
+
+CMD ["/bin/bash"]
--- a/docker-image/python/README.md
+++ b/docker-image/python/README.md
@ -0,0 +1,23 @@
+# Python
+This Dockerfile is used to build the __aztk-python__ Docker image used by this toolkit. This image uses Anaconda, providing access to a wide range of popular python packages.
+
+You can modify these Dockerfiles to build your own image. However, in mose cases, building on top of the __aztk-base__ image is recommended.
+
+NOTE: If you plan to use Jupyter Notebooks with your Spark cluster, we recommend using this image as Jupyter Notebook comes pre-installed with Anaconda. 
+
+## How to build this image
+This Dockerfile takes in a variable at build time that allow you to specify your desired Anaconda versions: **ANACONDA_VERSION** 
+
+By default, we set **ANACONDA_VERSION=anaconda3-5.0.0**.
+
+For example, if I wanted to use Anaconda3 v5.0.0 with Spark v2.1.0, I would select the appropriate Dockerfile and build the image as follows:
+```sh
+# spark2.1.0/Dockerfile
+docker build \
+    --build-arg ANACONDA_VERSION=anaconda3-5.0.0 \
+    -t <my_image_tag> .
+```
+
+**ANACONDA_VERSION** is used to set the version of Anaconda for your cluster. 
+
+NOTE: Most versions of Python will work. However, when selecting your Python version, please make sure that the it is compatible with your selected version of Spark. 
--- a/docker-image/python/spark1.6.3/Dockerfile
+++ b/docker-image/python/spark1.6.3/Dockerfile
@ -0,0 +1,14 @@
+# Ubuntu 16.04 (Xenial)
+FROM jiata/aztk-base:0.1.0-spark1.6.3
+
+# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
+ARG ANACONDA_VERSION=anaconda3-5.0.0
+
+# install user specificed version of anaconda
+RUN pyenv install -f $ANACONDA_VERSION \
+    && pyenv global $ANACONDA_VERSION
+
+# set env vars
+ENV USER_PYTHON_VERSION $ANACONDA_VERSION
+
+CMD ["/bin/bash"]
--- a/docker-image/python/spark2.1.0/Dockerfile
+++ b/docker-image/python/spark2.1.0/Dockerfile
@ -0,0 +1,14 @@
+# Ubuntu 16.04 (Xenial)
+FROM jiata/aztk-base:0.1.0-spark2.1.0
+
+# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
+ARG ANACONDA_VERSION=anaconda3-5.0.0
+
+# install user specificed version of anaconda
+RUN pyenv install -f $ANACONDA_VERSION \
+    && pyenv global $ANACONDA_VERSION 
+
+# set env vars
+ENV USER_PYTHON_VERSION $ANACONDA_VERSION
+
+CMD ["/bin/bash"]
--- a/docker-image/python/spark2.2.0/Dockerfile
+++ b/docker-image/python/spark2.2.0/Dockerfile
@ -0,0 +1,14 @@
+# Ubuntu 16.04 (Xenial)
+FROM jiata/aztk-base:0.1.0-spark2.2.0
+
+# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
+ARG ANACONDA_VERSION=anaconda3-5.0.0
+
+# install user specificed version of anaconda
+RUN pyenv install -f $ANACONDA_VERSION \
+    && pyenv global $ANACONDA_VERSION 
+
+# set env vars
+ENV USER_PYTHON_VERSION $ANACONDA_VERSION
+
+CMD ["/bin/bash"]
--- a/docs/10-clusters.md
+++ b/docs/10-clusters.md
@ -115,7 +115,7 @@ cd $SPARK_HOME
 ```

 ### Interact with your Spark cluster
-By default, the `aztk spark cluster ssh` command port forwards the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and Jupyter to your *locahost:8888*. This can be [configured in *.aztb/ssh.yaml*](../docs/13-configuration.md##sshyaml).
+By default, the `aztk spark cluster ssh` command port forwards the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and Spark History Server to your *locahost:18080*. This can be [configured in *.aztb/ssh.yaml*](../docs/13-configuration.md##sshyaml).

 ### Jupyter
 Once the appropriate ports have been forwarded, simply navigate to the local ports for viewing. In this case, if you used port 8888 (the default) for Jupyter then navigate to [http://localhost:8888.](http://localhost:8888)
--- a/docs/12-docker-image.md
+++ b/docs/12-docker-image.md
@ -1,19 +1,25 @@
 # Docker
 Azure Distributed Data Engineering Toolkit runs Spark on Docker.

-Supported Azure Distributed Data Engineering Toolkit images are hosted publicly on [Docker Hub](https://hub.docker.com/r/jiata/aztk/tags).
+Supported Azure Distributed Data Engineering Toolkit images are hosted publicly on [Docker Hub](https://hub.docker.com/r/jiata/aztk-base/tags).

 ## Versioning with Docker
-The default base image that this package uses is a Docker image with **Spark v2.2.0** and **Python v2.7.13**.
+The default image that this package uses is a the __aztk-base__ Docker image that comes with **Spark v2.2.0**.

-However, the Azure Distributed Data Engineering Toolkit supports several base images that you can toggle between:
- Spark v2.2.0 and Python v3.5.4 (default)
- Spark v2.1.0 and Python v3.5.4
- Spark v2.1.0 and Python v2.7.13
- Spark v1.6.3 and Python v3.5.4
- Spark v1.6.3 and Python v2.7.13
+You can use several versions of the __aztk-base__ image:
+- Spark 2.2.0 - jiata/aztk-base:0.1.0-spark2.2.0 (default)
+- Spark 2.1.0 - jiata/aztk-base:0.1.0-spark2.1.0
+- Spark 1.6.3 - jiata/aztk-base:0.1.0-spark1.6.3

-*Today, these supported base images are hosted on Docker Hub under the repo ["jiata/aztk:<tag>"](https://hub.docker.com/r/jiata/aztk/tags).*
+We also provide two other image types tailored for the Python and R users: __aztk-r__ and __aztk-python__. You can choose between the following:
+- Anaconda3-5.0.0 (Python 3.6.2) / Spark 2.2.0 - jiata/aztk-python:0.1.0-spark2.2.0-python3.6.2
+- Anaconda3-5.0.0 (Python 3.6.2) / Spark 2.1.0 - jiata/aztk-python:0.1.0-spark2.1.0-python3.6.2
+- Anaconda3-5.0.0 (Python 3.6.2) / Spark 1.6.3 - jiata/aztk-python:0.1.0-spark1.6.3-python3.6.2
+- [coming soon] R 3.4.0 / Spark v2.2.0 - jiata/aztk-r:0.1.0-spark2.2.0-r3.4.1
+- [coming soon] R 3.4.0 / Spark v2.1.0 - jiata/aztk-r:0.1.0-spark2.1.0-r3.4.1
+- [coming soon] R 3.4.0 / Spark v1.6.3 - jiata/aztk-r:0.1.0-spark1.6.3-r3.4.1
+
+*Today, these supported images are hosted on Docker Hub under the repo ["jiata/aztk-base/r/python:<tag>"](https://hub.docker.com/r/jiata).*

 To select an image other than the default, you can set your Docker image at cluster creation time with the optional **--docker-repo** parameter:

@ -21,13 +27,13 @@ To select an image other than the default, you can set your Docker image at clus
 aztk spark cluster create ... --docker-repo <name_of_docker_image_repo>
 ```

-For example, if I am using the image version 0.1.0, and wanted to use Spark v1.6.3 with Python v2.7.13, I could run the following cluster create command:
+For example, if I am using the image version 0.1.0, and wanted to use Spark v1.6.3, I could run the following cluster create command:
 ```sh
-aztk spark cluster create ... --docker-repo jiata/aztk:0.1.0-spark1.6.3-python3.5.4
+aztk spark cluster create ... --docker-repo jiata/aztk-base:0.1.0-spark1.6.3
 ```

 ## Using a custom Docker Image
-What if I wanted to use my own Docker image? _What if I want to use Spark v2.0.1 with Python v3.6.2?_
+What if I wanted to use my own Docker image?

 You can build your own Docker image on top or beneath one of our supported base images _OR_ you can modify the [supported Dockerfile](../docker-image) and build your own image that way.

--- a/docs/misc/PySpark
+++ b/docs/misc/PySpark
--- a/docs/misc/PySpark
+++ b/docs/misc/PySpark
--- a/node_scripts/docker_main.sh
+++ b/node_scripts/docker_main.sh
@ -1,12 +1,8 @@
 #!/bin/bash

 # This file is the entry point of the docker container.
-# It will setup WASB and start Spark.
-# This script uses the storage account configured in .thunderbolt/secrets.yaml
-# This script uses the specificied user python version ($USER_PYTHON_VERSION)

 set -e
-aztk_python_version=3.5.4

 # --------------------
 # Setup custom scripts
@ -15,6 +11,7 @@ custom_script_dir=$DOCKER_WORKING_DIR/custom-scripts

 # -----------------------
 # Preload jupyter samples
+# TODO: remove when we support uploading random (non-executable) files as part custom-scripts
 # -----------------------
 mkdir /jupyter
 mkdir /jupyter/samples
@ -29,10 +26,10 @@ done
 # ----------------------------
 # use python v3.5.4 to run aztk software
 echo "Starting setup using Docker"
-$(pyenv root)/versions/$aztk_python_version/bin/pip install -r $(dirname $0)/requirements.txt
+$(pyenv root)/versions/$AZTK_PYTHON_VERSION/bin/pip install -r $(dirname $0)/requirements.txt

 echo "Running main.py script"
-$(pyenv root)/versions/$aztk_python_version/bin/python $(dirname $0)/main.py install
+$(pyenv root)/versions/$AZTK_PYTHON_VERSION/bin/python $(dirname $0)/main.py install

 # sleep to keep container running
 while true; do sleep 1; done
--- a/node_scripts/install/spark.py
+++ b/node_scripts/install/spark.py
@ -15,11 +15,7 @@ from install import pick_master
 batch_client = config.batch_client

 spark_home = "/home/spark-current"
-pyspark_driver_python = "/.pyenv/versions/{}/bin/jupyter" \
-                        .format(os.environ["USER_PYTHON_VERSION"])
 spark_conf_folder = os.path.join(spark_home, "conf")
-default_python_version = os.environ["USER_PYTHON_VERSION"]
-

 def get_pool() -> batchmodels.CloudPool:
    return batch_client.pool.get(config.pool_id)
@ -56,81 +52,6 @@ def setup_connection():
    master_file.close()


-def generate_jupyter_config():
-    master_node = get_node(config.node_id)
-    master_node_ip = master_node.ip_address
-
-    return dict(
-        display_name="PySpark",
-        language="python",
-        argv=[
-            "python",
-            "-m",
-            "ipykernel",
-            "-f",
-            "{connection_file}",
-        ],
-        env=dict(
-            SPARK_HOME=spark_home,
-            PYSPARK_PYTHON="python",
-            PYSPARK_SUBMIT_ARGS="--master spark://{0}:7077 pyspark-shell" \
-                    .format(master_node_ip),
-        )
-    )
-
-
-def setup_jupyter():
-    print("Setting up jupyter.")
-
-    jupyter_config_file = os.path.join(os.path.expanduser(
-        "~"), ".jupyter/jupyter_notebook_config.py")
-    if os.path.isfile(jupyter_config_file):
-        print("Jupyter config is already set. Skipping setup. \
-               (Start task is probably reruning after reboot)")
-        return
-
-    generate_jupyter_config_cmd = ["jupyter", "notebook", "--generate-config"]
-    generate_jupyter_config_cmd.append("--allow-root")
-
-    call(generate_jupyter_config_cmd)
-
-    jupyter_kernels_path = '/.pyenv/versions/{}/share/jupyter/kernels'. \
-            format(default_python_version)
-
-    with open(jupyter_config_file, "a") as config_file:
-        config_file.write('\n')
-        config_file.write('c.NotebookApp.token=""\n')
-        config_file.write('c.NotebookApp.password=""\n')
-    shutil.rmtree(jupyter_kernels_path)
-    os.makedirs(jupyter_kernels_path + '/pyspark', exist_ok=True)
-
-    with open(jupyter_kernels_path + '/pyspark/kernel.json', 'w') as outfile:
-        data = generate_jupyter_config()
-        json.dump(data, outfile, indent=2)
-
-
-def start_jupyter():
-    jupyter_port = config.spark_jupyter_port
-
-    pyspark_driver_python_opts = "notebook --no-browser --port='{0}'" \
-            .format(jupyter_port)
-    pyspark_driver_python_opts += " --allow-root"
-
-    my_env = os.environ.copy()
-    my_env["PYSPARK_DRIVER_PYTHON"] = pyspark_driver_python
-    my_env["PYSPARK_DRIVER_PYTHON_OPTS"] = pyspark_driver_python_opts
-
-    pyspark_wd = os.path.join(os.getcwd(), "jupyter")
-    if not os.path.exists(pyspark_wd):
-        os.mkdir(pyspark_wd)
-
-    print("Starting pyspark")
-    process = Popen([
-        os.path.join(spark_home, "bin/pyspark")
-    ], env=my_env, cwd=pyspark_wd)
-    print("Started pyspark with pid {0}".format(process.pid))
-
-
 def wait_for_master():
    print("Waiting for master to be ready.")
    master_node_id = pick_master.get_master_node_id(
@ -157,8 +78,6 @@ def start_spark_master():
    print("Starting master with '{0}'".format(" ".join(cmd)))
    call(cmd)

-    setup_jupyter()
-    start_jupyter()
    start_history_server()