* added python container, jupyter install script, vanilla container

* Update README.md

* Create README.md

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update jupyter.sh

* Update README.md

* Update README.md

* python

* readme update

* docker updates

* Update README.md

* Update 12-docker-image.md

* Update constants.py

* add image files for wiki

* update imageS

* .

* dockerfile typo

* dockerfiles

* Removed r

* readme

* update constants.py

* update readme

* readme updates

* readme updates
This commit is contained in:
JS 2017-11-09 00:43:32 -08:00 коммит произвёл GitHub
Родитель d44f0ebf08
Коммит 65a56dd657
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
20 изменённых файлов: 428 добавлений и 238 удалений

Просмотреть файл

@ -1,5 +1,5 @@
# Azure Distributed Data Engineering Toolkit
Azure Distributed Data Engineering Toolkit is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.
# Azure Distributed Data Engineering Toolkit (AZTK)
Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.
This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.
@ -10,8 +10,10 @@ Currently, this toolkit is designed to run batch Spark jobs that require additio
- Spark clusters run in Docker containers
- Users can bring their own Docker image
- Ability to use low-priority VMs for an 80% discount
- Built in support for Azure Blob Storage connection
- Built in Jupyter notebook for interactive experience
- Built in support for Azure Blob Storage and Azure Data Lake connection
- Optional Jupyter Notebook for pythonic interactive experience
- [coming soon] Optional RStudio Server for an interactive experience in R
- Tailored Docker image for PySpark and [coming soon] SparklyR
- Ability to run _spark submit_ directly from your local machine's CLI
## Setup
@ -78,7 +80,7 @@ aztk spark cluster create \
--vm-size <vm_size>
```
By default, this package runs Spark 2.2.0 with Python 3.5 on an Ubuntu16.04 Docker image. More info on this image can be found in the [docker-images](/docker-image) folder in this repo.
By default, this package runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info on this image can be found in the [docker-images](/docker-image) folder in this repo.
NOTE: The cluster id (`--id`) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
@ -112,7 +114,7 @@ Most users will want to work interactively with their Spark clusters. With the `
```bash
aztk spark cluster ssh --id <my_cluster_id>
```
By default, we port forward the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and Jupyter to *localhost:8888*.
By default, we port forward the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and the Spark History Server to *localhost:18080*.
You can configure these settings in the *.aztk/ssh.yaml* file.
@ -135,7 +137,7 @@ aztk spark cluster delete --id <my_cluster_id>
## FAQs
- [How do I connect to Azure Storage (WASB)?](./docs/30-cloud-storage.md)
- [I want to use a different version of Spark / Python](./docs/12-docker-image.md)
- [I want to use a different version of Spark](./docs/12-docker-image.md)
- [How do I SSH into my Spark cluster's master node?](./docs/10-clusters.md#ssh-and-port-forwarding)
- [How do I interact with my Spark cluster using a password instead of an SSH-key?](./docs/10-clusters.md#interactive-mode)
- [How do I change my cluster default settings?](./docs/13-configuration.md)

Просмотреть файл

@ -3,7 +3,7 @@ import os
"""
DOCKER
"""
DEFAULT_DOCKER_REPO = "jiata/aztk:0.1.0-spark2.2.0-python3.5.4"
DEFAULT_DOCKER_REPO = "jiata/aztk-base:0.1.0-spark2.2.0"
DOCKER_SPARK_CONTAINER_NAME = "spark"
# DOCKER

Просмотреть файл

@ -15,7 +15,7 @@ size: 2
username: spark
# docker_repo: <name of docker image repo (for more information, see https://github.com/Azure/aztk/blob/master/docs/12-docker-image.md)>
docker_repo: jiata/aztk:0.1.0-spark2.2.0-python3.5.4
docker_repo: jiata/aztk-base:0.1.0-spark2.2.0
# # optional custom scripts to run on the Spark master, Spark worker or all nodes in the cluster
# custom_scripts:

55
custom-scripts/jupyter.sh Normal file
Просмотреть файл

@ -0,0 +1,55 @@
#!/bin/bash
# This custom script only works on images where jupyter is pre-installed on the Docker image
#
# This custom script has been tested to work on the following docker images:
# - jiata/aztk-python:0.1.0-spark2.2.0-python3.6.2
# - jiata/aztk-python:0.1.0-spark2.1.0-python3.6.2
# - jiata/aztk-python:0.1.0-spark1.6.3-python3.6.2
if [ "$IS_MASTER" = "1" ]; then
PYSPARK_DRIVER_PYTHON="/.pyenv/versions/${USER_PYTHON_VERSION}/bin/jupyter"
JUPYTER_KERNELS="/.pyenv/versions/${USER_PYTHON_VERSION}/share/jupyter/kernels"
# disable password/token on jupyter notebook
jupyter notebook --generate-config --allow-root
JUPYTER_CONFIG='/.jupyter/jupyter_notebook_config.py'
echo >> $JUPYTER_CONFIG
echo -e 'c.NotebookApp.token=""' >> $JUPYTER_CONFIG
echo -e 'c.NotebookApp.password=""' >> $JUPYTER_CONFIG
# get master ip
MASTER_IP=$(hostname -i)
# remove existing kernels
rm -rf $JUPYTER_KERNELS/*
# set up jupyter to use pyspark
mkdir $JUPYTER_KERNELS/pyspark
touch $JUPYTER_KERNELS/pyspark/kernel.json
cat << EOF > $JUPYTER_KERNELS/pyspark/kernel.json
{
"display_name": "PySpark",
"language": "python",
"argv": [
"python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "$SPARK_HOME",
"PYSPARK_PYTHON": "python",
"PYSPARK_SUBMIT_ARGS": "--master spark://$MASTER_IP:7077 pyspark-shell"
}
}
EOF
# start jupyter notebook from /jupyter
cd /jupyter
(PYSPARK_DRIVER_PYTHON=$PYSPARK_DRIVER_PYTHON PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8888 --allow-root" pyspark &)
fi

Просмотреть файл

@ -1,60 +0,0 @@
# Ubuntu 16.04 (Xenial)
FROM ubuntu:16.04
# set version of python required for thunderbolt application
ENV AZTK_PYTHON_VERSION=3.5.4
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop and Python
ARG SPARK_VERSION_KEY=spark-2.1.0-bin-hadoop2.7
ARG PYTHON_VERSION=3.5.4
# set up apt-get
RUN apt-get update -y && \
apt-get install -y --no-install-recommends make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm git libncurses5-dev \
libncursesw5-dev xz-utils tk-dev
# installing [software-properties-common] so that we can use [apt-add-repository] to add the repository [ppa:webupd8team/java] form which we install Java8
RUN apt-get install -y --no-install-recommends software-properties-common && \
apt-add-repository ppa:webupd8team/java -y && \
apt-get update -y
# installing java
RUN apt-get install -y --no-install-recommends default-jdk
# install and setup pyenv
RUN git clone git://github.com/yyuu/pyenv.git .pyenv
RUN git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv
ENV HOME /
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN eval "$(pyenv init -)"
RUN echo 'eval "$(pyenv init -)"' >> ~/.bashrc
# install thunderbolt required python version & the user specified version of python
RUN pyenv install -f $AZTK_PYTHON_VERSION && \
pyenv install -s $PYTHON_VERSION
RUN pyenv global $PYTHON_VERSION
# install jupyter
RUN pip install jupyter && \
pip install --upgrade jupyter
# install p4j
RUN curl https://pypi.python.org/packages/1f/b0/882c144fe70cc3f1e55d62b8611069ff07c2d611d99f228503606dd1aee4/py4j-0.10.0.tar.gz | tar xvz -C /home && \
cd /home/py4j-0.10.0 && \
python setup.py install
# install spark
RUN curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home
# set up symlink for SPARK HOME
RUN ln -s /home/$SPARK_VERSION_KEY /home/spark-current
# set env vars
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV SPARK_HOME /home/spark-current
ENV USER_PYTHON_VERSION $PYTHON_VERSION
ENV PATH $SPARK_HOME/bin:$PATH
CMD ["/bin/bash"]

Просмотреть файл

@ -1,17 +1,40 @@
# Docker Image Gallery
Azure Distributed Data Engineering Toolkit uses Docker containers to run Spark.
Please refer to the docs for details [how to select a docker-repo at cluster creation time](../docs/12-docker-image.md).
Please refer to the docs for details on [how to select a docker-repo at cluster creation time](../docs/12-docker-image.md).
### Supported Base Images
We support several base images:
- [Docker Hub] jiata/aztk:<image-version>-spark2.2.0-python3.5.4
- [Docker Hub] jiata/aztk:<image-version>-spark2.1.0-python3.5.4
- [Docker Hub] jiata/aztk:<image-version>-spark1.6.3-python3.5.4
- [Docker Hub] jiata/aztk:<image-version>-spark2.1.0-python2.7.13
- [Docker Hub] jiata/aztk:<image-version>-spark1.6.3-python2.7.13
## Supported Images
By default, this toolkit will use the base Spark image, __aztk-base__. This image contains the bare mininum to get Spark up and running in standalone mode.
NOTE: Replace **<image-version>** with the version of the image you wish to use. For example: **jiata/aztk:0.1.0-spark2.2.0-python3.5.4**
On top of that, we also provide additional flavors of Spark images, one geared towards the Python user (PySpark), and the other - coming soon - geared towards the R user (SparklyR or SparkR).
Docker Image | Image Type | User Language(s) | What's Included?
:-- | :-- | :-- | :--
[aztk-base](https://hub.docker.com/r/jiata/aztk-base/) | Base | Java, Scala | `Spark`
[aztk-python](https://hub.docker.com/r/jiata/aztk-python/) | Pyspark | Python | `Anaconda`</br>`Jupyter Notebooks` </br> `PySpark`
(coming soon) [aztk-r](https://hub.docker.com/r/jiata/aztk-r/) | SparklyR | R | `CRAN`</br>`RStudio Server`</br>`SparklyR and SparkR`
__aztk-python__ and __aztk-r__ images are built on top of the __aztk-base__ image.
Today, all the AZTK images are hosted on Docker Hub under [jiata](https://hub.docker.com/r/jiata).
### Matrix of Supported Container Images:
Docker Repo (hosted on Docker Hub) | Spark Version | Python Version | R Version
:-- | :-- | :-- | :--
jiata/aztk-base:0.1.0-spark2.2.0 __(defaul)__ | v2.2.0 | -- | --
jiata/aztk-base:0.1.0-spark2.1.0 | v2.1.0 | -- | --
jiata/aztk-base:0.1.0-spark1.6.3 | v1.6.3 | -- | --
jiata/aztk-python:0.1.0-spark2.2.0-anaconda3-5.0.0 | v2.2.0 | v3.6.2 | --
jiata/aztk-python:0.1.0-spark2.1.0-anaconda3-5.0.0 | v2.1.0 | v3.6.2 | --
jiata/aztk-python:0.1.0-spark1.6.3-anaconda3-5.0.0 | v1.6.3 | v3.6.2 | --
[coming soon] jiata/aztk-r:0.1.0-spark2.2.0-r3.4.1 | v2.2.0 | -- | v3.4.1
[coming soon] jiata/aztk-r:0.1.0-spark2.1.0-r3.4.1 | v2.1.0 | -- | v3.4.1
[coming soon] jiata/aztk-r:0.1.0-spark1.6.3-r3.4.1 | v1.6.3 | -- | v3.4.1
If you have requests to add to the list of supported images, please file a Github issue.
NOTE: Spark clusters that use the __aztk-python__ and __aztk-r__ images take longer to provision because these Docker images are significantly larger than the __aztk-base__ image.
### Gallery of 3rd Party Images
Since this toolkit uses Docker containers to run Spark, users can bring their own images. Here's a list of 3rd party images:
@ -19,94 +42,68 @@ Since this toolkit uses Docker containers to run Spark, users can bring their ow
(See below for a how-to guide on building your own images for the Azure Distributed Data Engineering Toolkit)
# How to use my own Docker Image
# How do I use my own Docker Image?
Building your own Docker Image to use with this toolkit has many advantages for users who want more customization over their environment. For some, this may look like installing specific, and even private, libraries that their Spark jobs require. For others, it may just be setting up a version of Spark, Python or R that fits their particular needs.
This section is for users who want to build their own docker images.
## Base Docker Images to build with
By default, the Azure Distributed Data Engineering Toolkit uses **Spark2.2.0-Python3.5.4** as its base image. However, you can build from any of the following supported base images:
- Spark2.2.0 and Hadoop2.7 and Python3.5.4
- Spark2.1.0 and Hadoop2.7 and Python3.5.4
- Spark2.1.0 and Hadoop2.7 and Python2.7.13
- Spark1.6.3 and Hadoop2.6 and Python3.5.4
- Spark1.6.3 and Hadoop2.6 and Python2.7.13
All the base images above are built on a vanilla ubuntu16.04-LTS image and comes pre-baked with Jupyter Notebook and a connection to Azure Blob Storage (WASB).
Currently, the images are hosted on [Docker Hub (jiata/aztk)](https://hub.docker.com/r/jiata/aztk).
If you have requests to add to the list of supported base images, please file a new Github issue.
## Building Your Own Docker Image
Azure Distributed Data Engineering Toolkit supports custom Docker images. To guarantee that your Spark deployment works, you can either build your own Docker image on top or beneath one of our supported base images _OR_ you can modify this supported Dockerfile and build your own image.
The Azure Distributed Data Engineering Toolkit supports custom Docker images. To guarantee that your Spark deployment works, we recommend that you build on top of one of our __aztk-base__ images. You can also build on top of our __aztk-python__ or __aztk-r__ images, but note that they are also built on top of the __aztk_base__ image.
To build your own image, can either build _on top_ or _beneath_ one of our supported images _OR_ you can just modify one of the supported Dockerfiles to build your own.
### Building on top
You can choose to build on top of one of our base images by using the **FROM** keyword in your Dockerfile:
```python
You can build on top of our images by referencing the __aztk_base__ image in the **FROM** keyword of your Dockerfile:
```sh
# Your custom Dockerfile
FROM jiata/aztk:<aztk-image-version>-spark2.2.0-python3.5.4
FROM jiata/aztk-base:0.1.0-spark2.2.0
...
```
### Building beneath
You can alternatively build beneath one of our base images by pulling down one of base images' Dockerfile and setting the **FROM** keyword to pull from your Docker image's location:
```python
# The Dockerfile from one of supported base image
To build beneath one of our images, modify one of our Dockerfiles so that the **FROM** keyword pulls from your Docker image's location (as opposed to the default which is a base Ubuntu image):
```sh
# One of the Dockerfiles that AZTK supports
# Change the FROM statement to point to your hosted image repo
FROM my_username/my_repo:latest
...
```
NOTE: Currently, we do not supported private Docker repos.
Please note that for this method to work, your Docker image must have been built on Ubuntu.
## About the Dockerfile
The Dockerfile is used to build the Docker images used by this toolkit.
## Required Environment Variables
When layering your own Docker image, make sure your image does not intefere with the environment variables set in the __aztk_base__ Dockerfile, otherwise it may not work on AZTK.
You can modify this Dockerfile to build your own image. If you plan to do so, please continue reading the below sections.
Please make sure that the following environment variables are set:
- AZTK_PYTHON_VERSION
- JAVA_HOME
- SPARK_HOME
### Specifying Spark and Python Version
This Dockerfile takes in a few variables at build time that allow you to specify your desired Spark and Python versions: **PYTHON_VERSION** and **SPARK_VERSION_KEY**.
```sh
# For example, if I want to use Python 2.7.13 with Spark 1.6.3 I would build the image as follows:
docker build \
--build-arg PYTHON_VERSION=2.7.13 \
--build-arg SPARK_VERSION_KEY=spark-1.6.3-bin-hadoop2.6 \
-t <my_image_tag> .
```
**SPARK_VERSION_KEY** is used to locate which version of Spark to download. These are the values that have been tested:
- spark-1.6.3-bin-hadoop2.6
- spark-2.1.0-bin-hadoop2.7
- spark-2.2.0-bin-hadoop2.7
For a full list of supported keys, please see this [page](https://d3kbcqa49mib13.cloudfront.net)
NOTE: Do not include the '.tgz' suffix as part of the Spark version key.
**PYTHON_VERSION** is used to set the version of Python for your cluster. These are the values that have been tested:
- 3.5.4
- 2.7.13
NOTE: Most version of Python will work. However, when selecting your Python version, please make sure that the it is compatible with your selected version of Spark. Today, it is also a requirement that your selected verion of Python can run Jupyter Notebook.
### Required Environment Variables
When layering your own Docker image, make sure your image does not intefere with the environment variables set in this Dockerfile, otherwise it may not work.
If you want to use your own version of Spark, please make sure that the following environment variables are set.
You also need to make sure that __PATH__ is correctly configured with $SPARK_HOME
- PATH=$SPARK_HOME/bin:$PATH
By default, these are set as follows:
``` sh
# An example of required environment variables
ENV AZTK_PYTHON_VERSION 3.5.4
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV SPARK_HOME /home/spark-current
ENV PYSPARK_PYTHON python
ENV USER_PYTHON_VERSION $PYTHON_VERSION
ENV PATH $SPARK_HOME/bin:$PATH
```
If you are using your own version of Spark, make that it is symlinked by "/home/spark-current". **$SPARK_HOME**, must also point to "/home/spark-current".
## Hosting your Docker Image
By default, this toolkit assumes that your Docker images are publicly hosted on Docker Hub. However, we also support hosting your images privately.
See [here](https://github.com/Azure/aztk/blob/master/docs/12-docker-image.md#using-a-custom-docker-image-that-is-privately-hosted) to learn more about using privately hosted Docker Images.
## Learn More
The Dockerfiles in this directory are used to build the Docker images used by this toolkit. Please reference the individual directories for more information on each Dockerfile:
- [Base](./base)
- [Python](./python)
- [coming soon] R

Просмотреть файл

@ -0,0 +1,26 @@
# Base AZTK Docker image
This Dockerfile is used to build the __aztk-base__ image used by this toolkit. This Dockerfile produces the Docker image that is selected by AZTK by default.
You can modify this Dockerfile to build your own image.
## How to build this image
This Dockerfile takes in a single variable at build time that allows you to specify your desired Spark version: **SPARK_VERSION_KEY**.
By default, this image will also be installed with python v3.5.4 as a requirement for this toolkit.
```sh
# For example, if I want to use Spark 1.6.3 I would build the image as follows:
docker build \
--build-arg SPARK_VERSION_KEY=spark-1.6.3-bin-hadoop2.6 \
-t <my_image_tag> .
```
**SPARK_VERSION_KEY** is used to locate which version of Spark to download. These are the values that have been tested and known to work:
- spark-1.6.3-bin-hadoop2.6
- spark-2.1.0-bin-hadoop2.7
- spark-2.2.0-bin-hadoop2.7
For a full list of supported keys, please see this [page](https://d3kbcqa49mib13.cloudfront.net)
NOTE: Do not include the '.tgz' suffix as part of the Spark version key.

Просмотреть файл

@ -0,0 +1,61 @@
# Ubuntu 16.04 (Xenial)
FROM ubuntu:16.04
# set version of python required for thunderbolt application
ENV AZTK_PYTHON_VERSION=3.5.4
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
ARG SPARK_VERSION_KEY=spark-1.6.3-bin-hadoop2.6
# set up env vars for pyenv
ENV HOME /
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN apt-get clean \
&& apt-get update -y \
# install dependency packages
&& apt-get install -y --no-install-recommends \
make \
build-essential \
zlib1g-dev \
libssl-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
git \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
&& apt-get update -y \
# install [software-properties-common]
# so we can use [apt-add-repository] to add the repository [ppa:webupd8team/java]
# from which we install Java8
&& apt-get install -y --no-install-recommends software-properties-common \
&& apt-add-repository ppa:webupd8team/java -y \
&& apt-get update -y \
# install java
&& apt-get install -y --no-install-recommends default-jdk \
# download pyenv
&& git clone git://github.com/yyuu/pyenv.git .pyenv \
&& git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv \
# install & setup pyenv
&& eval "$(pyenv init -)" \
&& echo 'eval "$(pyenv init -)"' >> ~/.bashrc \
# install aztk required python version
&& pyenv install -f $AZTK_PYTHON_VERSION \
&& pyenv global $AZTK_PYTHON_VERSION \
# install spark & setup symlink to SPARK_HOME
&& curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home \
&& ln -s /home/$SPARK_VERSION_KEY /home/spark-current
# set env vars
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV SPARK_HOME /home/spark-current
ENV PATH $SPARK_HOME/bin:$PATH
CMD ["/bin/bash"]

Просмотреть файл

@ -0,0 +1,61 @@
# Ubuntu 16.04 (Xenial)
FROM ubuntu:16.04
# set version of python required for thunderbolt application
ENV AZTK_PYTHON_VERSION=3.5.4
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
ARG SPARK_VERSION_KEY=spark-2.1.0-bin-hadoop2.7
# set up env vars for pyenv
ENV HOME /
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN apt-get clean \
&& apt-get update -y \
# install dependency packages
&& apt-get install -y --no-install-recommends \
make \
build-essential \
zlib1g-dev \
libssl-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
git \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
&& apt-get update -y \
# install [software-properties-common]
# so we can use [apt-add-repository] to add the repository [ppa:webupd8team/java]
# from which we install Java8
&& apt-get install -y --no-install-recommends software-properties-common \
&& apt-add-repository ppa:webupd8team/java -y \
&& apt-get update -y \
# install java
&& apt-get install -y --no-install-recommends default-jdk \
# download pyenv
&& git clone git://github.com/yyuu/pyenv.git .pyenv \
&& git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv \
# install & setup pyenv
&& eval "$(pyenv init -)" \
&& echo 'eval "$(pyenv init -)"' >> ~/.bashrc \
# install aztk required python version
&& pyenv install -f $AZTK_PYTHON_VERSION \
&& pyenv global $AZTK_PYTHON_VERSION \
# install spark & setup symlink to SPARK_HOME
&& curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home \
&& ln -s /home/$SPARK_VERSION_KEY /home/spark-current
# set env vars
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV SPARK_HOME /home/spark-current
ENV PATH $SPARK_HOME/bin:$PATH
CMD ["/bin/bash"]

Просмотреть файл

@ -0,0 +1,61 @@
# Ubuntu 16.04 (Xenial)
FROM ubuntu:16.04
# set version of python required for thunderbolt application
ENV AZTK_PYTHON_VERSION=3.5.4
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
ARG SPARK_VERSION_KEY=spark-2.2.0-bin-hadoop2.7
# set up env vars for pyenv
ENV HOME /
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN apt-get clean \
&& apt-get update -y \
# install dependency packages
&& apt-get install -y --no-install-recommends \
make \
build-essential \
zlib1g-dev \
libssl-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
git \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
&& apt-get update -y \
# install [software-properties-common]
# so we can use [apt-add-repository] to add the repository [ppa:webupd8team/java]
# from which we install Java8
&& apt-get install -y --no-install-recommends software-properties-common \
&& apt-add-repository ppa:webupd8team/java -y \
&& apt-get update -y \
# install java
&& apt-get install -y --no-install-recommends default-jdk \
# download pyenv
&& git clone git://github.com/yyuu/pyenv.git .pyenv \
&& git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv \
# install & setup pyenv
&& eval "$(pyenv init -)" \
&& echo 'eval "$(pyenv init -)"' >> ~/.bashrc \
# install aztk required python version
&& pyenv install -f $AZTK_PYTHON_VERSION \
&& pyenv global $AZTK_PYTHON_VERSION \
# install spark & setup symlink to SPARK_HOME
&& curl https://d3kbcqa49mib13.cloudfront.net/$SPARK_VERSION_KEY.tgz | tar xvz -C /home \
&& ln -s /home/$SPARK_VERSION_KEY /home/spark-current
# set env vars
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV SPARK_HOME /home/spark-current
ENV PATH $SPARK_HOME/bin:$PATH
CMD ["/bin/bash"]

Просмотреть файл

@ -0,0 +1,23 @@
# Python
This Dockerfile is used to build the __aztk-python__ Docker image used by this toolkit. This image uses Anaconda, providing access to a wide range of popular python packages.
You can modify these Dockerfiles to build your own image. However, in mose cases, building on top of the __aztk-base__ image is recommended.
NOTE: If you plan to use Jupyter Notebooks with your Spark cluster, we recommend using this image as Jupyter Notebook comes pre-installed with Anaconda.
## How to build this image
This Dockerfile takes in a variable at build time that allow you to specify your desired Anaconda versions: **ANACONDA_VERSION**
By default, we set **ANACONDA_VERSION=anaconda3-5.0.0**.
For example, if I wanted to use Anaconda3 v5.0.0 with Spark v2.1.0, I would select the appropriate Dockerfile and build the image as follows:
```sh
# spark2.1.0/Dockerfile
docker build \
--build-arg ANACONDA_VERSION=anaconda3-5.0.0 \
-t <my_image_tag> .
```
**ANACONDA_VERSION** is used to set the version of Anaconda for your cluster.
NOTE: Most versions of Python will work. However, when selecting your Python version, please make sure that the it is compatible with your selected version of Spark.

Просмотреть файл

@ -0,0 +1,14 @@
# Ubuntu 16.04 (Xenial)
FROM jiata/aztk-base:0.1.0-spark1.6.3
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
ARG ANACONDA_VERSION=anaconda3-5.0.0
# install user specificed version of anaconda
RUN pyenv install -f $ANACONDA_VERSION \
&& pyenv global $ANACONDA_VERSION
# set env vars
ENV USER_PYTHON_VERSION $ANACONDA_VERSION
CMD ["/bin/bash"]

Просмотреть файл

@ -0,0 +1,14 @@
# Ubuntu 16.04 (Xenial)
FROM jiata/aztk-base:0.1.0-spark2.1.0
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
ARG ANACONDA_VERSION=anaconda3-5.0.0
# install user specificed version of anaconda
RUN pyenv install -f $ANACONDA_VERSION \
&& pyenv global $ANACONDA_VERSION
# set env vars
ENV USER_PYTHON_VERSION $ANACONDA_VERSION
CMD ["/bin/bash"]

Просмотреть файл

@ -0,0 +1,14 @@
# Ubuntu 16.04 (Xenial)
FROM jiata/aztk-base:0.1.0-spark2.2.0
# modify these ARGs on build time to specify your desired versions of Spark/Hadoop
ARG ANACONDA_VERSION=anaconda3-5.0.0
# install user specificed version of anaconda
RUN pyenv install -f $ANACONDA_VERSION \
&& pyenv global $ANACONDA_VERSION
# set env vars
ENV USER_PYTHON_VERSION $ANACONDA_VERSION
CMD ["/bin/bash"]

Просмотреть файл

@ -115,7 +115,7 @@ cd $SPARK_HOME
```
### Interact with your Spark cluster
By default, the `aztk spark cluster ssh` command port forwards the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and Jupyter to your *locahost:8888*. This can be [configured in *.aztb/ssh.yaml*](../docs/13-configuration.md##sshyaml).
By default, the `aztk spark cluster ssh` command port forwards the Spark Web UI to *localhost:8080*, Spark Jobs UI to *localhost:4040*, and Spark History Server to your *locahost:18080*. This can be [configured in *.aztb/ssh.yaml*](../docs/13-configuration.md##sshyaml).
### Jupyter
Once the appropriate ports have been forwarded, simply navigate to the local ports for viewing. In this case, if you used port 8888 (the default) for Jupyter then navigate to [http://localhost:8888.](http://localhost:8888)

Просмотреть файл

@ -1,19 +1,25 @@
# Docker
Azure Distributed Data Engineering Toolkit runs Spark on Docker.
Supported Azure Distributed Data Engineering Toolkit images are hosted publicly on [Docker Hub](https://hub.docker.com/r/jiata/aztk/tags).
Supported Azure Distributed Data Engineering Toolkit images are hosted publicly on [Docker Hub](https://hub.docker.com/r/jiata/aztk-base/tags).
## Versioning with Docker
The default base image that this package uses is a Docker image with **Spark v2.2.0** and **Python v2.7.13**.
The default image that this package uses is a the __aztk-base__ Docker image that comes with **Spark v2.2.0**.
However, the Azure Distributed Data Engineering Toolkit supports several base images that you can toggle between:
- Spark v2.2.0 and Python v3.5.4 (default)
- Spark v2.1.0 and Python v3.5.4
- Spark v2.1.0 and Python v2.7.13
- Spark v1.6.3 and Python v3.5.4
- Spark v1.6.3 and Python v2.7.13
You can use several versions of the __aztk-base__ image:
- Spark 2.2.0 - jiata/aztk-base:0.1.0-spark2.2.0 (default)
- Spark 2.1.0 - jiata/aztk-base:0.1.0-spark2.1.0
- Spark 1.6.3 - jiata/aztk-base:0.1.0-spark1.6.3
*Today, these supported base images are hosted on Docker Hub under the repo ["jiata/aztk:<tag>"](https://hub.docker.com/r/jiata/aztk/tags).*
We also provide two other image types tailored for the Python and R users: __aztk-r__ and __aztk-python__. You can choose between the following:
- Anaconda3-5.0.0 (Python 3.6.2) / Spark 2.2.0 - jiata/aztk-python:0.1.0-spark2.2.0-python3.6.2
- Anaconda3-5.0.0 (Python 3.6.2) / Spark 2.1.0 - jiata/aztk-python:0.1.0-spark2.1.0-python3.6.2
- Anaconda3-5.0.0 (Python 3.6.2) / Spark 1.6.3 - jiata/aztk-python:0.1.0-spark1.6.3-python3.6.2
- [coming soon] R 3.4.0 / Spark v2.2.0 - jiata/aztk-r:0.1.0-spark2.2.0-r3.4.1
- [coming soon] R 3.4.0 / Spark v2.1.0 - jiata/aztk-r:0.1.0-spark2.1.0-r3.4.1
- [coming soon] R 3.4.0 / Spark v1.6.3 - jiata/aztk-r:0.1.0-spark1.6.3-r3.4.1
*Today, these supported images are hosted on Docker Hub under the repo ["jiata/aztk-base/r/python:<tag>"](https://hub.docker.com/r/jiata).*
To select an image other than the default, you can set your Docker image at cluster creation time with the optional **--docker-repo** parameter:
@ -21,13 +27,13 @@ To select an image other than the default, you can set your Docker image at clus
aztk spark cluster create ... --docker-repo <name_of_docker_image_repo>
```
For example, if I am using the image version 0.1.0, and wanted to use Spark v1.6.3 with Python v2.7.13, I could run the following cluster create command:
For example, if I am using the image version 0.1.0, and wanted to use Spark v1.6.3, I could run the following cluster create command:
```sh
aztk spark cluster create ... --docker-repo jiata/aztk:0.1.0-spark1.6.3-python3.5.4
aztk spark cluster create ... --docker-repo jiata/aztk-base:0.1.0-spark1.6.3
```
## Using a custom Docker Image
What if I wanted to use my own Docker image? _What if I want to use Spark v2.0.1 with Python v3.6.2?_
What if I wanted to use my own Docker image?
You can build your own Docker image on top or beneath one of our supported base images _OR_ you can modify the [supported Dockerfile](../docker-image) and build your own image that way.

Двоичные данные
docs/misc/PySpark Jupypter (wiki).png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 127 KiB

Двоичные данные
docs/misc/PySpark Shell (wiki).png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 202 KiB

Просмотреть файл

@ -1,12 +1,8 @@
#!/bin/bash
# This file is the entry point of the docker container.
# It will setup WASB and start Spark.
# This script uses the storage account configured in .thunderbolt/secrets.yaml
# This script uses the specificied user python version ($USER_PYTHON_VERSION)
set -e
aztk_python_version=3.5.4
# --------------------
# Setup custom scripts
@ -15,6 +11,7 @@ custom_script_dir=$DOCKER_WORKING_DIR/custom-scripts
# -----------------------
# Preload jupyter samples
# TODO: remove when we support uploading random (non-executable) files as part custom-scripts
# -----------------------
mkdir /jupyter
mkdir /jupyter/samples
@ -29,10 +26,10 @@ done
# ----------------------------
# use python v3.5.4 to run aztk software
echo "Starting setup using Docker"
$(pyenv root)/versions/$aztk_python_version/bin/pip install -r $(dirname $0)/requirements.txt
$(pyenv root)/versions/$AZTK_PYTHON_VERSION/bin/pip install -r $(dirname $0)/requirements.txt
echo "Running main.py script"
$(pyenv root)/versions/$aztk_python_version/bin/python $(dirname $0)/main.py install
$(pyenv root)/versions/$AZTK_PYTHON_VERSION/bin/python $(dirname $0)/main.py install
# sleep to keep container running
while true; do sleep 1; done

Просмотреть файл

@ -15,11 +15,7 @@ from install import pick_master
batch_client = config.batch_client
spark_home = "/home/spark-current"
pyspark_driver_python = "/.pyenv/versions/{}/bin/jupyter" \
.format(os.environ["USER_PYTHON_VERSION"])
spark_conf_folder = os.path.join(spark_home, "conf")
default_python_version = os.environ["USER_PYTHON_VERSION"]
def get_pool() -> batchmodels.CloudPool:
return batch_client.pool.get(config.pool_id)
@ -56,81 +52,6 @@ def setup_connection():
master_file.close()
def generate_jupyter_config():
master_node = get_node(config.node_id)
master_node_ip = master_node.ip_address
return dict(
display_name="PySpark",
language="python",
argv=[
"python",
"-m",
"ipykernel",
"-f",
"{connection_file}",
],
env=dict(
SPARK_HOME=spark_home,
PYSPARK_PYTHON="python",
PYSPARK_SUBMIT_ARGS="--master spark://{0}:7077 pyspark-shell" \
.format(master_node_ip),
)
)
def setup_jupyter():
print("Setting up jupyter.")
jupyter_config_file = os.path.join(os.path.expanduser(
"~"), ".jupyter/jupyter_notebook_config.py")
if os.path.isfile(jupyter_config_file):
print("Jupyter config is already set. Skipping setup. \
(Start task is probably reruning after reboot)")
return
generate_jupyter_config_cmd = ["jupyter", "notebook", "--generate-config"]
generate_jupyter_config_cmd.append("--allow-root")
call(generate_jupyter_config_cmd)
jupyter_kernels_path = '/.pyenv/versions/{}/share/jupyter/kernels'. \
format(default_python_version)
with open(jupyter_config_file, "a") as config_file:
config_file.write('\n')
config_file.write('c.NotebookApp.token=""\n')
config_file.write('c.NotebookApp.password=""\n')
shutil.rmtree(jupyter_kernels_path)
os.makedirs(jupyter_kernels_path + '/pyspark', exist_ok=True)
with open(jupyter_kernels_path + '/pyspark/kernel.json', 'w') as outfile:
data = generate_jupyter_config()
json.dump(data, outfile, indent=2)
def start_jupyter():
jupyter_port = config.spark_jupyter_port
pyspark_driver_python_opts = "notebook --no-browser --port='{0}'" \
.format(jupyter_port)
pyspark_driver_python_opts += " --allow-root"
my_env = os.environ.copy()
my_env["PYSPARK_DRIVER_PYTHON"] = pyspark_driver_python
my_env["PYSPARK_DRIVER_PYTHON_OPTS"] = pyspark_driver_python_opts
pyspark_wd = os.path.join(os.getcwd(), "jupyter")
if not os.path.exists(pyspark_wd):
os.mkdir(pyspark_wd)
print("Starting pyspark")
process = Popen([
os.path.join(spark_home, "bin/pyspark")
], env=my_env, cwd=pyspark_wd)
print("Started pyspark with pid {0}".format(process.pid))
def wait_for_master():
print("Waiting for master to be ready.")
master_node_id = pick_master.get_master_node_id(
@ -157,8 +78,6 @@ def start_spark_master():
print("Starting master with '{0}'".format(" ".join(cmd)))
call(cmd)
setup_jupyter()
start_jupyter()
start_history_server()