2017-09-16 06:57:16 +03:00
# Docker
2017-10-03 23:30:12 +03:00
Azure Distributed Data Engineering Toolkit runs Spark on Docker.
2017-09-16 06:57:16 +03:00
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
Supported Azure Distributed Data Engineering Toolkit images are hosted publicly on [Docker Hub ](https://hub.docker.com/r/aztk/spark/ ).
2017-09-16 06:57:16 +03:00
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
By default, the `aztk/spark:v0.1.0-spark2.3.0-base` image will be used.
2017-09-16 06:57:16 +03:00
To select an image other than the default, you can set your Docker image at cluster creation time with the optional ** --docker-repo** parameter:
```sh
2017-09-30 05:23:14 +03:00
aztk spark cluster create ... --docker-repo < name_of_docker_image_repo >
2017-09-16 06:57:16 +03:00
```
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
For example, if I wanted to use Spark v2.2.0, I could run the following cluster create command:
2017-09-16 06:57:16 +03:00
```sh
2017-12-05 00:28:05 +03:00
aztk spark cluster create ... --docker-repo aztk/base:spark1.6.3
2017-09-16 06:57:16 +03:00
```
## Using a custom Docker Image
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
You can build your own Docker image on top or beneath one of our supported base images _OR_ you can modify the [supported Dockerfiles ](https://github.com/Azure/aztk/tree/v0.7.0/docker-image ) and build your own image that way.
2017-09-16 06:57:16 +03:00
2017-09-30 05:23:14 +03:00
Once you have your Docker image built and hosted publicly, you can then use the ** --docker-repo** parameter in your **aztk spark cluster create** command to point to it.
2017-09-16 06:57:16 +03:00
## Using a custom Docker Image that is Privately Hosted
2017-10-03 23:30:12 +03:00
To use a private docker image you will need to provide a docker username and password that have access to the repository you want to use.
2017-10-05 20:13:26 +03:00
In `.aztk/secrets.yaml` setup your docker config
2017-10-03 23:30:12 +03:00
```yaml
docker:
username: < myusername >
password: < mypassword >
```
If your private repository is not on docker hub (Azure container registry for example) you can provide the endpoint here too
```yaml
docker:
username: < myusername >
password: < mypassword >
endpoint: < https: / / my-custom-docker-endpoint . com >
```
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
### Building Your Own Docker Image
Building your own Docker Image provides more customization over your cluster's environment. For some, this may look like installing specific, and even private, libraries that their Spark jobs require. For others, it may just be setting up a version of Spark, Python or R that fits their particular needs.
The Azure Distributed Data Engineering Toolkit supports custom Docker images. To guarantee that your Spark deployment works, we recommend that you build on top of one of our supported images.
To build your own image, can either build _on top_ or _beneath_ one of our supported images _OR_ you can just modify one of the supported Dockerfiles to build your own.
### Building on top
You can build on top of our images by referencing the __aztk/spark__ image in the **FROM** keyword of your Dockerfile:
```sh
# Your custom Dockerfile
FROM aztk/spark:v0.1.0-spark2.3.0-base
...
```
### Building beneath
To build beneath one of our images, modify one of our Dockerfiles so that the **FROM** keyword pulls from your Docker image's location (as opposed to the default which is a base Ubuntu image):
```sh
# One of the Dockerfiles that AZTK supports
# Change the FROM statement to point to your hosted image repo
FROM my_username/my_repo:latest
...
```
Please note that for this method to work, your Docker image must have been built on Ubuntu.
2018-06-07 19:57:43 +03:00
## Custom Docker Image Requirements
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
If you are building your own custom image and __not__ building on top of a supported image, the following requirements are necessary.
Please make sure that the following environment variables are set:
- AZTK_DOCKER_IMAGE_VERSION
- JAVA_HOME
- SPARK_HOME
You also need to make sure that __PATH__ is correctly configured with $SPARK_HOME
- PATH=$SPARK_HOME/bin:$PATH
By default, these are set as follows:
``` sh
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV SPARK_HOME /home/spark-current
ENV PATH $SPARK_HOME/bin:$PATH
```
If you are using your own version of Spark, make that it is symlinked by "/home/spark-current". ** $SPARK_HOME**, must also point to "/home/spark-current".
## Hosting your Docker Image
2018-05-02 04:44:26 +03:00
By default, aztk assumes that your Docker images are publicly hosted on Docker Hub. However, we also support hosting your images privately.
Feature: refactor docker images (#510)
* add spark2.3.0 hadoop2.8.3 dockerfile
* start update to docker image
* add SPARK_DIST_CLASSPATH to bashrc, source .bashrc in docker run
* add maven install for jars
* docker image update and code fix
* add libthrift (still broken)
* start image refactor, build from source,
* add refactor to r base image
* finish refactor r image
* add storage jars and deps
* exclude netty to get rid of dependency conflict
* add miniconda image
* update 2.2.0 base, anaconda image
* remove unused cuda-8.0 image
* start pipenv implementation
* miniconda version arg
* update anaconda and miniconda image
* style
* pivot to virtualenv
* remove virtualenv from path when submitting apps
* flatten layers
* explicit calls to aztk python instead of activating virtualenv
* update base, miniconda, anaconda
* add compatibility version for base aztk images
* typo fix
* update pom
* update environment variable name
* update environment variables
* add anaconda images base & gpu
* update gpu and miniconda base images
* create venv in cluster create
* update base docker files, remove virtualenv
* fix path
* add exclusion to base images
* update r images
* delete python images (in favor of anaconda and miniconda)
* add miniconda gpu images
* update comment
* update aztk_version_compatibility to dokcer image version
* add a build script
* virutalenv->pipenv, add pipfile & pipfile.lock remove secretstorage
* aztk/staging->aztk/spark
* remove jars, add .null to keep directory
* update pipfile, update jupyter and jupyterlab
* update default images
* update base images to fix hdfs
* update build script with correct path
* add spark1.6.3 anaconda, miniconda, r base and gpu images
* update build script to include spark1.6.3
* mkdir out
* exclude commons lang and slf4j dependencies
* mkdir out
* no fail if dir exists
* update node_scripts
* update env var name
* update env var name
* fix the docker_repo docs
* master->0.7.0
2018-05-01 03:19:01 +03:00
See [here ](https://github.com/Azure/aztk/blob/v0.7.0/docs/12-docker-image.md#using-a-custom-docker-image-that-is-privately-hosted ) to learn more about using privately hosted Docker Images.