* updates for kf 0.4.1

* update aks regions and nvidia device plugin

* updates

* update jupyterhub
This commit is contained in:
Sertaç Özercan 2019-02-28 09:56:46 -08:00 коммит произвёл Rita Zhang
Родитель 8e42cbed2f
Коммит 7b23979133
5 изменённых файлов: 243 добавлений и 178 удалений

Просмотреть файл

@ -1,19 +1,21 @@
# Kubernetes
### Prerequisites
* [Docker Basics](../1-docker/README.md)
### Prerequisites
- [Docker Basics](../1-docker/README.md)
### Summary
In this module you will learn:
* The basic concepts of Kubernetes
* How to create a Kubernetes cluster on Azure
> *Important* : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this workshop.
- The basic concepts of Kubernetes
- How to create a Kubernetes cluster on Azure
> _Important_ : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this lab.
## The basic concepts of Kubernetes
[Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more.
[Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more.
### Overview
@ -32,12 +34,13 @@ The worker nodes communicate with the master components, configure the networkin
Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing. A Kubernetes object is a "record of intent" – once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, youre telling the Kubernetes system your clusters desired state.
The basic Kubernetes objects include:
* Pod - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run.
* Service - an abstraction which defines a logical set of Pods and a policy by which to access them.
* Volume - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers.
* Namespace - a way to divide a physical cluster resources into multiple virtual clusters between multiple users.
* Deployment - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server.
* Job - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model.
- Pod - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run.
- Service - an abstraction which defines a logical set of Pods and a policy by which to access them.
- Volume - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers.
- Namespace - a way to divide a physical cluster resources into multiple virtual clusters between multiple users.
- Deployment - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server.
- Job - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model.
### Creating a Kubernetes Object
@ -52,64 +55,70 @@ metadata:
name: nginx-deployment # Name of the deployment
spec: # Actual specifications of this deployment
replicas: 3 # Number of replicas (instances) for this deployment. 1 replica = 1 pod
template:
template:
metadata:
labels:
app: nginx
spec: # Specification for the Pod
spec: # Specification for the Pod
containers: # These are the containers running inside our Pod, in our case a single one
- name: nginx # Name of this container
image: nginx:1.7.9 # Image to run
ports: # Ports to expose
- containerPort: 80
- name: nginx # Name of this container
image: nginx:1.7.9 # Image to run
ports: # Ports to expose
- containerPort: 80
```
To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`).
To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`).
We will be creating a deployment in the exercise toward the end of this module, but first we need a cluster.
## Provisioning a Kubernetes cluster on Azure
We are going to use AKS to create a GPU-enabled Kubernetes cluster.
You could also use [acs-engine](https://github.com/Azure/acs-engine) if you prefer, this guide will assume you are using aks.
You could also use [aks-engine](https://github.com/Azure/aks-engine) if you prefer, this guide will assume you are using aks.
### A Note on GPUs with Kubernetes
As of this writing, GPUs are available for AKS in the `eastus` and `westeurope` regions. If you want more options you may want to use acs-engine for more flexibility.
You can view AKS region availability in [Azure AKS docs](https://docs.microsoft.com/en-us/azure/aks/container-service-quotas#region-availability)
You can find NVIDIA GPUs (N-series) availability in [region availability documentation](https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines&regions=all)
If you want more options, you may want to use aks-engine for more flexibility.
### With the CLI
#### Creating a resource group
```console
az group create --name <RESOURCE_GROUP_NAME> --location <LOCATION>
```
```
With:
| Parameter | Description |
| --- | --- |
| RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed. |
| LOCATION | Name of the region where the cluster should be deployed. |
With:
| Parameter | Description |
| ------------------- | -------------------------------------------------------------- |
| RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed. |
| LOCATION | Name of the region where the cluster should be deployed. |
#### Creating the cluster
#### Creating the cluster
```console
az aks create --node-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME>
--node-count <AGENT_COUNT> --kubernetes-version 1.11.1 --location <LOCATION> --generate-ssh-keys
az aks create --node-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME>
--node-count <AGENT_COUNT> --kubernetes-version 1.12.5 --location <LOCATION> --generate-ssh-keys
```
> Note : The kubernetes verion could change depending where you are deploying your cluster. You can get more informations running the `az aks get-versions` command.
With:
| Parameter | Description |
| --- | --- |
| AGENT_SIZE | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. Full list of [options here](https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_client_enums.py#L21). |
| RG | Name of the resource group that was created in the previous step. |
| NAME | Name of the AKS resource (can be whatever you want). |
| AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 3 or 4 is recommended to play with hyper-parameter tuning and distributed TensorFlow |
| LOCATION | Same location that was specified for the resource group creation. |
With:
| Parameter | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| AGENT_SIZE | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. Full list of [options here](https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_client_enums.py#L21). |
| RG | Name of the resource group that was created in the previous step. |
| NAME | Name of the AKS resource (can be whatever you want). |
| AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 3 or 4 is recommended to play with hyper-parameter tuning and distributed TensorFlow |
| LOCATION | Same location that was specified for the resource group creation. |
The command should take a few minutes to complete. Once it is done, the output should be a JSON object indicating among other things the `provisioningState`:
```
{
[...]
@ -129,6 +138,24 @@ az aks get-credentials --name <NAME> --resource-group <RG>
Where `NAME` and `RG` should be the same values as for the cluster creation.
#### Installing NVIDIA Device Plugin (AKS only)
For AKS, install NVIDIA Device Plugin using:
For Kubernetes 1.10:
```bash
kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
```
For Kubernetes 1.11 and above:
```bash
kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
```
For AKS Engine, NVIDIA Device Plugin will automatically installed with N-Series GPU clusters.
##### Validation
Once you are done with the cluster creation, and downloaded the `kubeconfig` file, running the following command:
@ -138,6 +165,7 @@ kubectl get nodes
```
Should yield an output similar to this one:
```
NAME STATUS ROLES AGE VERSION
aks-nodepool1-42640332-0 Ready agent 1h v1.11.1
@ -146,6 +174,7 @@ aks-nodepool1-42640332-2 Ready agent 1h v1.11.1
```
If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node:
```console
> kubectl describe node <NODE_NAME>
@ -153,9 +182,9 @@ If you provisioned GPU VM, describing one of the node should indicate the presen
Capacity:
nvidia.com/gpu: 1
[...]
```
```
> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
## Exercise
@ -163,16 +192,17 @@ Capacity:
> Note: If you didn't complete the exercise in module 1, you can use `wbuchwalter/tf-mnist` image for this exercise.
In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub.
In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub.
Since we now have a running Kubernetes cluster, let's run our training on it!
First, we need to create a YAML template to define what we want to deploy.
We want our deployment to have a few characteristics:
* It should be a `Job` since we expect the training to finish successfully after some time.
* It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
* The `Job` should be named `2-mnist-training`.
* We want our training to run for `500` steps.
* We want our training to use 1 GPU
- It should be a `Job` since we expect the training to finish successfully after some time.
- It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
- The `Job` should be named `2-mnist-training`.
- We want our training to run for `500` steps.
- We want our training to use 1 GPU
Here is what this would look like in YAML format:
@ -187,19 +217,19 @@ spec:
name: module2-ex1 # Name of the pod
spec:
containers: # List of containers that should run inside the pod, in our case there is only one.
- name: tensorflow
image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
resources:
limits:
nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
- name: tensorflow
image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
resources:
limits:
nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
volumes:
- name: nvidia
hostPath:
path: /usr/local/nvidia
- name: nvidia
hostPath:
path: /usr/local/nvidia
restartPolicy: OnFailure # restart the pod if it fails
```
@ -225,10 +255,13 @@ module2-ex1 1 0 1m
```
Looking at the Pods:
```console
kubectl get pods
````
```
You should see your training running
```bash
NAME READY STATUS RESTARTS AGE
module2-ex1-c5b8q 1/1 Runing 0 1m
@ -239,6 +272,7 @@ Finally you can look at the logs of your pod with:
```console
kubectl logs <pod-name>
```
> Be careful to use the Pod name (from `kubectl get pods`) not the Job name.
And you should see the training happening
@ -263,6 +297,7 @@ Accuracy at step 50: 0.888
```
After a few minutes, looking again at the Job should show that it has completed successfully:
```console
kubectl get job
```
@ -277,4 +312,3 @@ module2-ex1 1 1 3m
Currently our training doesn't do anything interesting. We are not even saving the model and summaries anywhere, but don't worry we are going to dive into this starting in Module 4.
[Module 3: Helm](../3-helm/README.md)

Просмотреть файл

@ -2,8 +2,8 @@
## Prerequisites
* [1 - Docker](../1-docker/README.md)
* [2 - Kubernetes](../2-kubernetes/README.md)
- [1 - Docker](../1-docker/README.md)
- [2 - Kubernetes](../2-kubernetes/README.md)
## Summary
@ -16,59 +16,49 @@ From [Kubeflow](https://github.com/kubeflow/kubeflow)'s own documetation:
> The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
Kubeflow is composed of multiple components:
* [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
* One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
* A serving component that will help you serve predictions with your models.
- [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
- One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
- A serving component that will help you serve predictions with your models.
For more general info on Kubeflow, head to the repo's [README](https://github.com/kubeflow/kubeflow/blob/master/README.md).
### Deploying Kubeflow
Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.
Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.
> ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments.
> ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments.
First, install ksonnet version [0.9.2](https://ksonnet.io/#get-started).
First, install ksonnet version [0.13.1](https://ksonnet.io/#get-started), or you can [download a prebuilt binary](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) for your OS.
Then run the following commands to deploy Kubeflow in your Kubernetes cluster:
Then run the following commands to download Kubeflow:
```bash
# Create a namespace for kubeflow deployment
NAMESPACE=kubeflow
kubectl create namespace ${NAMESPACE}
KUBEFLOW_SRC=kubeflow
# Which version of Kubeflow to use
# For a list of releases refer to:
# https://github.com/kubeflow/kubeflow/releases
VERSION=v0.2.2
mkdir ${KUBEFLOW_SRC}
cd ${KUBEFLOW_SRC}
# Initialize a ksonnet app. Set the namespace for it's default environment.
APP_NAME=my-kubeflow
ks init ${APP_NAME}
cd ${APP_NAME}
ks env set default --namespace ${NAMESPACE}
export KUBEFLOW_TAG=v0.4.1
# Add a reference to Kubeflow's ksonnet manifests
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
```
# Install Kubeflow components
ks pkg install kubeflow/core@${VERSION}
ks pkg install kubeflow/tf-serving@${VERSION}
`KUBEFLOW_SRC` a directory where you want to download the source to
# Create templates for core components
ks generate kubeflow-core kubeflow-core
`KUBEFLOW_TAG` a tag corresponding to the version to check out, such as master for the latest code.
# Customize Kubeflow's installation for AKS or acs-engine
ks param set kubeflow-core cloud aks
# ks param set kubeflow-core cloud acsengine
```bash
# Initialize a kubeflow app
KFAPP=mykubeflowapp
${KUBEFLOW_SOURCE}/scripts/kfctl.sh init ${KFAPP} --platform none
# Enable collection of anonymous usage metrics
# Skip this step if you don't want to enable collection.
ks param set kubeflow-core reportUsage true
ks param set kubeflow-core usageId $(uuidgen)
# Generate kubeflow app
cd ${KFAPP}
${KUBEFLOW_SOURCE}/scripts/kfctl.sh generate k8s
# Deploy Kubeflow
ks apply default -c kubeflow-core
# Deploy Kubeflow app
${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
```
### Validation
@ -79,17 +69,46 @@ should return something like this:
```
NAME READY STATUS RESTARTS AGE
ambassador-7789cddc5d-czf7p 2/2 Running 0 1d
ambassador-7789cddc5d-f79zp 2/2 Running 0 1d
ambassador-7789cddc5d-h57ms 2/2 Running 0 1d
centraldashboard-d5bf74c6b-nn925 1/1 Running 0 1d
tf-hub-0 1/1 Running 0 1d
tf-job-dashboard-8699ccb5ff-9phmv 1/1 Running 0 1d
tf-job-operator-646bdbcb7-bc479 1/1 Running 0 1d
kubeflow ambassador-b4d9cdb8-2qgww 1/1 Running 0 111m
kubeflow ambassador-b4d9cdb8-hpwdc 1/1 Running 0 111m
kubeflow ambassador-b4d9cdb8-khg8l 1/1 Running 0 111m
kubeflow argo-ui-6d6658d8f7-t6whw 1/1 Running 0 110m
kubeflow centraldashboard-6f686c5b7c-462cq 1/1 Running 0 111m
kubeflow jupyter-0 1/1 Running 0 111m
kubeflow katib-ui-6c59754c48-mgf62 1/1 Running 0 110m
kubeflow metacontroller-0 1/1 Running 0 111m
kubeflow minio-d79b65988-6qkxp 1/1 Running 0 110m
kubeflow ml-pipeline-66df9d86f6-rp245 1/1 Running 0 110m
kubeflow ml-pipeline-persistenceagent-7b86dbf4b5-rgndj 1/1 Running 0 110m
kubeflow ml-pipeline-scheduledworkflow-84f6477479-9tvhk 1/1 Running 0 110m
kubeflow ml-pipeline-ui-f76bb5f97-2s5qb 1/1 Running 0 110m
kubeflow mysql-ffc889689-xkpxb 1/1 Running 0 110m
kubeflow pytorch-operator-ff46f9b7d-qkbvh 1/1 Running 0 111m
kubeflow spartakus-volunteer-5b6c956c8f-2gnvb 1/1 Running 0 111m
kubeflow studyjob-controller-b7cdbd4cd-nf9z5 1/1 Running 0 110m
kubeflow tf-job-dashboard-7746db84cf-njdzk 1/1 Running 0 111m
kubeflow tf-job-operator-v1beta1-5949f668f7-j5zrn 1/1 Running 0 111m
kubeflow vizier-core-7c56465f6-t6d5p 1/1 Running 0 110m
kubeflow vizier-core-rest-67f588b4cb-lqvgr 1/1 Running 0 110m
kubeflow vizier-db-86dc7d89c5-8vtfs 1/1 Running 0 110m
kubeflow vizier-suggestion-bayesianoptimization-7cb546fb84-tsrn4 1/1 Running 0 110m
kubeflow vizier-suggestion-grid-6587f9d6b-92c9h 1/1 Running 0 110m
kubeflow vizier-suggestion-hyperband-8bb44f8c8-gs72m 1/1 Running 0 110m
kubeflow vizier-suggestion-random-7ff5db687b-bjdh5 1/1 Running 0 110m
kubeflow workflow-controller-cf79dfbff-lv7jk 1/1 Running 0 110m
```
The most important components for the purpose of this lab are `tf-hub-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-646bdbcb7-bc479` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
The most important components for the purpose of this lab are `jupyter-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-v1beta1-5949f668f7-j5zrn` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
### Remove Kubeflow
If you want to remove the Kubeflow deployment, you can run the following to remove the namespace and installed components:
```bash
cd ${KUBEFLOW_SRC}/${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh delete k8s
```
## Next Step
[5 - JupyterHub](../5-jupyterhub)
[5 - JupyterHub](../5-jupyterhub/README.md)

Просмотреть файл

@ -1,16 +1,18 @@
# Jupyter Notebooks on Kubernetes
## Prerequisites
* [1 - Docker Basics](../1-docker)
* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
* [4 - Kubeflow](../4-kubeflow)
## Prerequisites
- [1 - Docker Basics](../1-docker)
- [2 - Kubernetes Basics and cluster created](../2-kubernetes)
- [4 - Kubeflow](../4-kubeflow)
## Summary
In this module, you will learn how to:
* Run Jupyter Notebooks locally using Docker
* Run JupyterHub on Kubernetes using Kubeflow
- Run Jupyter Notebooks locally using Docker
- Run JupyterHub on Kubernetes using Kubeflow
## How Jupyter Notebooks work
The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes.
@ -31,56 +33,60 @@ docker run -it -p 8888:8888 tensorflow/tensorflow
#### Validation
To verify, browse to the url in the output log.
To verify, browse to the url in the output log.
For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413`
### Exercise 2: Run JupyterHub on Kubernetes using Kubeflow
In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow.
In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow.
As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob).
As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob).
In module 4, you installed the kubeflow-core component, which already includes JupyterHub and a corresponding load balancer service of type `ClusterIP`. To check its status, run the following kubectl command.
```
NAMESPACE=kubeflow
kubectl get svc -n=${NAMESPACE}
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
tf-hub-0 ClusterIP None <none> 8000/TCP 1m
tf-hub-lb ClusterIP 10.0.40.191 <none> 80/TCP 1m
jupyter-0 ClusterIP None <none> 8000/TCP 132m
jupyter-lb ClusterIP 10.0.191.68 <none> 80/TCP 132m
```
To connect to your JupyterHub locally:
To connect to the Kubeflow dashboard locally:
```bash
kubectl port-forward svc/ambassador -n ${NAMESPACE} 8080:80
```
PODNAME=`kubectl get pods --namespace=${NAMESPACE} --selector="app=tf-hub" --output=template --template="{{with index .items 0}}{{.metadata.name}}{{end}}"`
kubectl port-forward --namespace=${NAMESPACE} $PODNAME 8000:8000
```
Then navigate to JupyterHub: http://localhost:8080/hub
[Optional] To connect to your JupyterHub over a public IP:
To update the default service created for JupyterHub, run the following command to change the service to type LoadBalancer:
```
ks param set kubeflow-core jupyterHubServiceType LoadBalancer
ks apply ${YOUR_KF_ENV}
# YOUR_KF_ENV=default if you are continuing from previous module
```bash
cd ks_app
ks param set jupyter serviceType LoadBalancer
cd ..
${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
```
Create a new Jupyter Notebook instance:
- open http://127.0.0.1:8000 in your browser (or use the public IP for the service `tf-hub-lb`)
- log in using any username and password
- open http://localhost:8080/hub/ in your browser (or use the public IP for the service `tf-hub-lb`)
- log in using any username and password
- click the "Start My Server" button to sprawn a new Jupyter notebook
- from the image dropdown, select a tensorflow image for your notebook
- for CPU and memory, enter values based on your resource requirements, for example: 1 CPU and 2Gi
- to get available GPUs in your cluster, run the following command:
```
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
```
- for GPU, enter values in json format `{"nvidia.com/gpu":"1"}`
- click the "Spawn" button
@ -98,7 +104,6 @@ kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME}
After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree
## Next Step
[6 - TfJob](../6-tfjob)

Просмотреть файл

@ -2,15 +2,14 @@
## Prerequisites
* [1 - Docker](../1-docker/README.md)
* [2 - Kubernetes](../2-kubernetes/README.md)
* [4 - Kubeflow](../4-kubeflow/README.md)
- [1 - Docker](../1-docker/README.md)
- [2 - Kubernetes](../2-kubernetes/README.md)
- [4 - Kubeflow](../4-kubeflow/README.md)
## Summary
In this module you will learn how to describe a TensorFlow training using `TFJob` object.
### Kubernetes Custom Resource Definition
Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
@ -24,35 +23,35 @@ Before going further, let's take a look at what the `TFJob` object looks like:
**`TFJob` Object**
| Field | Type| Description |
|-------|-----|-------------|
| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
| kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` |
| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
| Field | Type | Description |
| ---------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1beta1` |
| kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` |
| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata) | Standard object's metadata. |
| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
`spec` is the most important part, so let's look at it too:
**`TFJobSpec` Object**
| Field | Type| Description |
|-------|-----|-------------|
| Field | Type | Description |
| ------------- | --------------------- | -------------------------------------------------------------- |
| TFReplicaSpec | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
Let's go deeper:
Let's go deeper:
**`TFReplicaSpec` Object**
| Field | Type| Description |
|-------|-----|-------------|
| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. |
| Field | Type | Description |
| ------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. |
Here is what a simple TensorFlow training looks like using this `TFJob` object:
```yaml
apiVersion: kubeflow.org/v1alpha2
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: example-tfjob
@ -73,7 +72,7 @@ spec:
Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
## Exercises
## Exercises
### Exercise 1: A Simple `TFJob`
@ -84,7 +83,7 @@ Let's schedule a very simple TensorFlow job using `TFJob` first.
When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
```yaml
apiVersion: kubeflow.org/v1alpha2
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: module6-ex1-gpu
@ -95,7 +94,7 @@ spec:
template:
spec:
containers:
- image: <DOCKER_USERNAME>/tf-mnist:gpu # From module 1
- image: <DOCKER_USERNAME>/tf-mnist:gpu # From module 1
name: tensorflow
resources:
limits:
@ -104,6 +103,7 @@ spec:
```
Save the template that applies to you in a file, and create the `TFJob`:
```console
kubectl create -f <template-path>
```
@ -115,7 +115,9 @@ First a `TFJob` was created:
```console
kubectl get tfjob
```
Returns:
```
NAME AGE
module6-ex1-gpu 5s
@ -126,7 +128,9 @@ As well as a `Pod`, which was actually created by the operator:
```console
kubectl get pod
```
Returns:
```
NAME READY STATUS RESTARTS AGE
module6-ex1-master-xs4b-0-6gpfn 1/1 Running 0 2m
@ -136,11 +140,11 @@ Note that the `Pod` might take a few minutes before actually running, the docker
Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
```console
```console
kubectl logs <your-pod-name>
```
This container is pretty verbose, but you should see a TensorFlow training happening:
This container is pretty verbose, but you should see a TensorFlow training happening:
```
[...]
@ -156,13 +160,13 @@ INFO:tensorflow:Final test accuracy = 88.4% (N=353)
[...]
```
> That's great and all, but how do we grab our trained model and TensorFlow's summaries?
> That's great and all, but how do we grab our trained model and TensorFlow's summaries?
Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.
Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.
Thankfully, Kubernetes `Volumes` can help us here.
If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.
But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).
If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.
But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).
In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
@ -171,19 +175,22 @@ In our case we are going to use Azure Files, as it is really easy to use with Ku
### Creating a New File Share and Kubernetes Secret
In the official documentation: [Persistent volumes with Azure files](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv), follow the steps listed under `Create storage account`, `Create storage class`, and `Create persistent volume claim`.
Be aware of a few details first:
* It is **very** important that you create your storage account in the **same** region and the same resource group (with MC_ prefix) as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
* While this document specifically refers to AKS, it will work for any K8s cluster
* Once the PVC is created, you will see a new file share under that storage account. All subsequent modules will be writing to that file share.
* PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects
* If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding)
Be aware of a few details first:
- It is **very** important that you create your storage account in the **same** region and the same resource group (with MC\_ prefix) as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
- While this document specifically refers to AKS, it will work for any K8s cluster
- Once the PVC is created, you will see a new file share under that storage account. All subsequent modules will be writing to that file share.
- PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects
- If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding)
Once you completed all the steps, run:
```console
kubectl get pvc
```
Which should return:
```
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
azurefile Bound pvc-346ab93b-4cbf-11e8-9fed-000d3a17b5e9 5Gi RWO azurefile 5m
@ -191,7 +198,7 @@ azurefile Bound pvc-346ab93b-4cbf-11e8-9fed-000d3a17b5e9 5Gi
### Updating our example to use our Azure File Share
Now we need to mount our new file share into our container so the model and the summaries can be persisted.
Now we need to mount our new file share into our container so the model and the summaries can be persisted.
Turns out mounting an Azure File share into a container is really easy, we simply need to reference our PVC in the `Volume` definition:
```yaml
@ -223,10 +230,10 @@ This means that when we run a training, all the important data is now stored in
#### Solution for Exercise 2
<details>
<summary><strong>Solution</strong></summary>
<summary><strong>Solution</strong></summary>
```yaml
apiVersion: kubeflow.org/v1alpha2
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: module6-ex2

Просмотреть файл

@ -11,13 +11,13 @@
# This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
{{- range $i, $lr := $lrlist }}
{{- range $j, $nblayers := $nblayerslist }}
apiVersion: kubeflow.org/v1alpha2
apiVersion: kubeflow.org/v1beta1
kind: TFJob # Each one of our trainings will be a separate TFJob
metadata:
name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
labels:
chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
spec:
spec:
tfReplicaSpecs:
MASTER:
template:
@ -25,14 +25,14 @@ spec:
restartPolicy: OnFailure
containers:
- name: tensorflow
image: {{ $image }}
image: {{ $image }}
env:
- name: LC_ALL
value: C.UTF-8
args:
# Here we pass a unique learning rate and hidden layer count to each instance.
# We also put the values between quotes to avoid potential formatting issues
- --learning-rate
- --learning-rate
- {{ $lr | quote }}
- --hidden-layers
- {{ $nblayers | quote }}
@ -45,7 +45,7 @@ spec:
{{ end }}
volumeMounts:
- mountPath: /tmp/tensorflow
subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory
subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory
name: azurefile
volumes:
- name: azurefile
@ -84,7 +84,7 @@ spec:
volumes:
- name: azurefile
persistentVolumeClaim:
claimName: azurefile
claimName: azurefile
containers:
- name: tensorboard
command:
@ -97,4 +97,4 @@ spec:
volumeMounts:
- mountPath: /tmp/tensorflow
subPath: module8-tf-paint
name: azurefile
name: azurefile