Updates for Kubeflow 0.4.1 (#52)

* updates for kf 0.4.1 * update aks regions and nvidia device plugin * updates * update jupyterhub
2019-02-28 09:56:46 -08:00 · 2019-02-28 09:56:46 -08:00 · 7b23979133
--- a/2-kubernetes/README.md
+++ b/2-kubernetes/README.md
@ -1,19 +1,21 @@
 # Kubernetes

-### Prerequisites  
-* [Docker Basics](../1-docker/README.md)
+### Prerequisites
+
+- [Docker Basics](../1-docker/README.md)

 ### Summary

 In this module you will learn:
-* The basic concepts of Kubernetes
-* How to create a Kubernetes cluster on Azure

-> *Important* : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this workshop.
+- The basic concepts of Kubernetes
+- How to create a Kubernetes cluster on Azure
+
+> _Important_ : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this lab.

 ## The basic concepts of Kubernetes

-[Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more. 
+[Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more.

 ### Overview

@ -32,12 +34,13 @@ The worker nodes communicate with the master components, configure the networkin
 Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing. A Kubernetes object is a "record of intent" – once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re telling the Kubernetes system your cluster’s desired state.

 The basic Kubernetes objects include:
-* Pod - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run.
-* Service - an abstraction which defines a logical set of Pods and a policy by which to access them.
-* Volume - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers.
-* Namespace - a way to divide a physical cluster resources into multiple virtual clusters between multiple users.
-* Deployment - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server.
-* Job - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model.
+
+- Pod - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run.
+- Service - an abstraction which defines a logical set of Pods and a policy by which to access them.
+- Volume - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers.
+- Namespace - a way to divide a physical cluster resources into multiple virtual clusters between multiple users.
+- Deployment - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server.
+- Job - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model.

 ### Creating a Kubernetes Object

@ -52,64 +55,70 @@ metadata:
  name: nginx-deployment # Name of the deployment
 spec: # Actual specifications of this deployment
  replicas: 3 # Number of replicas (instances) for this deployment. 1 replica = 1 pod
-  template: 
+  template:
    metadata:
      labels:
        app: nginx
-    spec: # Specification for the Pod 
+    spec: # Specification for the Pod
      containers: # These are the containers running inside our Pod, in our case a single one
-      - name: nginx # Name of this container
-        image: nginx:1.7.9 # Image to run
-        ports: # Ports to expose
-        - containerPort: 80
+        - name: nginx # Name of this container
+          image: nginx:1.7.9 # Image to run
+          ports: # Ports to expose
+            - containerPort: 80
 ```

-To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`). 
+To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`).
 We will be creating a deployment in the exercise toward the end of this module, but first we need a cluster.

 ## Provisioning a Kubernetes cluster on Azure

 We are going to use AKS to create a GPU-enabled Kubernetes cluster.
-You could also use [acs-engine](https://github.com/Azure/acs-engine) if you prefer, this guide will assume you are using aks.
-
+You could also use [aks-engine](https://github.com/Azure/aks-engine) if you prefer, this guide will assume you are using aks.

 ### A Note on GPUs with Kubernetes

-As of this writing, GPUs are available for AKS in the `eastus` and `westeurope` regions. If you want more options you may want to use acs-engine for more flexibility.
+You can view AKS region availability in [Azure AKS docs](https://docs.microsoft.com/en-us/azure/aks/container-service-quotas#region-availability)
+
+You can find NVIDIA GPUs (N-series) availability in [region availability documentation](https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines&regions=all)
+
+If you want more options, you may want to use aks-engine for more flexibility.

 ### With the CLI

 #### Creating a resource group
+
 ```console
 az group create --name <RESOURCE_GROUP_NAME> --location <LOCATION>
-```  
+```

-With:  
-  
-| Parameter | Description |
-| --- | --- | 
-| RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed.  |
-| LOCATION | Name of the region where the cluster should be deployed. |
+With:
+
+| Parameter           | Description                                                    |
+| ------------------- | -------------------------------------------------------------- |
+| RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed. |
+| LOCATION            | Name of the region where the cluster should be deployed.       |
+
+#### Creating the cluster

-#### Creating the cluster  
 ```console
-az aks create --node-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME> 
--node-count <AGENT_COUNT> --kubernetes-version 1.11.1 --location <LOCATION> --generate-ssh-keys
+az aks create --node-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME>
+--node-count <AGENT_COUNT> --kubernetes-version 1.12.5 --location <LOCATION> --generate-ssh-keys
 ```

 > Note : The kubernetes verion could change depending where you are deploying your cluster. You can get more informations running the `az aks get-versions` command.

-With:  
-  
-| Parameter | Description |
-| --- | --- | 
-| AGENT_SIZE | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. Full list of [options here](https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_client_enums.py#L21). |
-| RG | Name of the resource group that was created in the previous step. |
-| NAME | Name of the AKS resource (can be whatever you want). | 
-| AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 3 or 4 is recommended to play with hyper-parameter tuning and distributed TensorFlow | 
-| LOCATION | Same location that was specified for the resource group creation. |
+With:
+
+| Parameter   | Description                                                                                                                                                                                                                                                                                        |
+| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| AGENT_SIZE  | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. Full list of [options here](https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_client_enums.py#L21). |
+| RG          | Name of the resource group that was created in the previous step.                                                                                                                                                                                                                                  |
+| NAME        | Name of the AKS resource (can be whatever you want).                                                                                                                                                                                                                                               |
+| AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 3 or 4 is recommended to play with hyper-parameter tuning and distributed TensorFlow                                                                                                                                        |
+| LOCATION    | Same location that was specified for the resource group creation.                                                                                                                                                                                                                                  |

 The command should take a few minutes to complete. Once it is done, the output should be a JSON object indicating among other things the `provisioningState`:
+
 ```
 {
  [...]
@ -129,6 +138,24 @@ az aks get-credentials --name <NAME> --resource-group <RG>

 Where `NAME` and `RG` should be the same values as for the cluster creation.

+#### Installing NVIDIA Device Plugin (AKS only)
+
+For AKS, install NVIDIA Device Plugin using:
+
+For Kubernetes 1.10:
+
+```bash
+kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
+```
+
+For Kubernetes 1.11 and above:
+
+```bash
+kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
+```
+
+For AKS Engine, NVIDIA Device Plugin will automatically installed with N-Series GPU clusters.
+
 ##### Validation

 Once you are done with the cluster creation, and downloaded the `kubeconfig` file, running the following command:
@ -138,6 +165,7 @@ kubectl get nodes
 ```

 Should yield an output similar to this one:
+
 ```
 NAME                       STATUS    ROLES     AGE       VERSION
 aks-nodepool1-42640332-0   Ready     agent     1h        v1.11.1
@ -146,6 +174,7 @@ aks-nodepool1-42640332-2   Ready     agent     1h        v1.11.1
 ```

 If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node:
+
 ```console
 > kubectl describe node <NODE_NAME>

@ -153,9 +182,9 @@ If you provisioned GPU VM, describing one of the node should indicate the presen
 Capacity:
 nvidia.com/gpu:     1
 [...]
- ```
+```

-> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster 
+> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

 ## Exercise

@ -163,16 +192,17 @@ Capacity:

 > Note: If you didn't complete the exercise in module 1, you can use `wbuchwalter/tf-mnist` image for this exercise.

-In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub.  
+In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub.
 Since we now have a running Kubernetes cluster, let's run our training on it!

 First, we need to create a YAML template to define what we want to deploy.
 We want our deployment to have a few characteristics:
-* It should be a `Job` since we expect the training to finish successfully after some time.
-* It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
-* The `Job` should be named `2-mnist-training`.
-* We want our training to run for `500` steps.
-* We want our training to use 1 GPU
+
+- It should be a `Job` since we expect the training to finish successfully after some time.
+- It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
+- The `Job` should be named `2-mnist-training`.
+- We want our training to run for `500` steps.
+- We want our training to use 1 GPU

 Here is what this would look like in YAML format:

@ -187,19 +217,19 @@ spec:
      name: module2-ex1 # Name of the pod
    spec:
      containers: # List of containers that should run inside the pod, in our case there is only one.
-      - name: tensorflow
-        image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
-        args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
-        resources:
-          limits:
-            nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
-        volumeMounts:
-        - name: nvidia
-          mountPath: /usr/local/nvidia
+        - name: tensorflow
+          image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
+          args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
+          resources:
+            limits:
+              nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
+          volumeMounts:
+            - name: nvidia
+              mountPath: /usr/local/nvidia
      volumes:
-      - name: nvidia
-        hostPath:
-          path: /usr/local/nvidia
+        - name: nvidia
+          hostPath:
+            path: /usr/local/nvidia
      restartPolicy: OnFailure # restart the pod if it fails
 ```

@ -225,10 +255,13 @@ module2-ex1                      1         0            1m
 ```

 Looking at the Pods:
+
 ```console
 kubectl get pods
-````
+```
+
 You should see your training running
+
 ```bash
 NAME                                      READY     STATUS      RESTARTS   AGE
 module2-ex1-c5b8q                      1/1       Runing      0          1m
@ -239,6 +272,7 @@ Finally you can look at the logs of your pod with:
 ```console
 kubectl logs <pod-name>
 ```
+
 > Be careful to use the Pod name (from `kubectl get pods`) not the Job name.

 And you should see the training happening
@ -263,6 +297,7 @@ Accuracy at step 50: 0.888
 ```

 After a few minutes, looking again at the Job should show that it has completed successfully:
+
 ```console
 kubectl get job
 ```
@ -277,4 +312,3 @@ module2-ex1                    1         1            3m
 Currently our training doesn't do anything interesting. We are not even saving the model and summaries anywhere, but don't worry we are going to dive into this starting in Module 4.

 [Module 3: Helm](../3-helm/README.md)
-
--- a/4-kubeflow/README.md
+++ b/4-kubeflow/README.md
@ -2,8 +2,8 @@

 ## Prerequisites

-* [1 - Docker](../1-docker/README.md)
-* [2 - Kubernetes](../2-kubernetes/README.md)
+- [1 - Docker](../1-docker/README.md)
+- [2 - Kubernetes](../2-kubernetes/README.md)

 ## Summary

@ -16,59 +16,49 @@ From [Kubeflow](https://github.com/kubeflow/kubeflow)'s own documetation:
 > The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

 Kubeflow is composed of multiple components:
-* [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
-* One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
-* A serving component that will help you serve predictions with your models.
+
+- [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
+- One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
+- A serving component that will help you serve predictions with your models.

 For more general info on Kubeflow, head to the repo's [README](https://github.com/kubeflow/kubeflow/blob/master/README.md).

 ### Deploying Kubeflow

-Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.  
+Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.

-> ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments. 
+> ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments.

-First, install ksonnet version [0.9.2](https://ksonnet.io/#get-started).
+First, install ksonnet version [0.13.1](https://ksonnet.io/#get-started), or you can [download a prebuilt binary](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) for your OS.

-Then run the following commands to deploy Kubeflow in your Kubernetes cluster:
+Then run the following commands to download Kubeflow:

 ```bash
-# Create a namespace for kubeflow deployment
-NAMESPACE=kubeflow
-kubectl create namespace ${NAMESPACE}
+KUBEFLOW_SRC=kubeflow

-# Which version of Kubeflow to use
-# For a list of releases refer to:
-# https://github.com/kubeflow/kubeflow/releases
-VERSION=v0.2.2
+mkdir ${KUBEFLOW_SRC}
+cd ${KUBEFLOW_SRC}

-# Initialize a ksonnet app. Set the namespace for it's default environment.
-APP_NAME=my-kubeflow
-ks init ${APP_NAME}
-cd ${APP_NAME}
-ks env set default --namespace ${NAMESPACE}
+export KUBEFLOW_TAG=v0.4.1

-# Add a reference to Kubeflow's ksonnet manifests
-ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
+curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
+```

-# Install Kubeflow components
-ks pkg install kubeflow/core@${VERSION}
-ks pkg install kubeflow/tf-serving@${VERSION}
+`KUBEFLOW_SRC` a directory where you want to download the source to

-# Create templates for core components
-ks generate kubeflow-core kubeflow-core
+`KUBEFLOW_TAG` a tag corresponding to the version to check out, such as master for the latest code.

-# Customize Kubeflow's installation for AKS or acs-engine
-ks param set kubeflow-core cloud aks
-# ks param set kubeflow-core cloud acsengine
+```bash
+# Initialize a kubeflow app
+KFAPP=mykubeflowapp
+${KUBEFLOW_SOURCE}/scripts/kfctl.sh init ${KFAPP} --platform none

-# Enable collection of anonymous usage metrics
-# Skip this step if you don't want to enable collection.
-ks param set kubeflow-core reportUsage true
-ks param set kubeflow-core usageId $(uuidgen)
+# Generate kubeflow app
+cd ${KFAPP}
+${KUBEFLOW_SOURCE}/scripts/kfctl.sh generate k8s

-# Deploy Kubeflow
-ks apply default -c kubeflow-core
+# Deploy Kubeflow app
+${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
 ```

 ### Validation
@ -79,17 +69,46 @@ should return something like this:

 ```
 NAME                                READY     STATUS    RESTARTS   AGE
-ambassador-7789cddc5d-czf7p         2/2       Running   0          1d
-ambassador-7789cddc5d-f79zp         2/2       Running   0          1d
-ambassador-7789cddc5d-h57ms         2/2       Running   0          1d
-centraldashboard-d5bf74c6b-nn925    1/1       Running   0          1d
-tf-hub-0                            1/1       Running   0          1d
-tf-job-dashboard-8699ccb5ff-9phmv   1/1       Running   0          1d
-tf-job-operator-646bdbcb7-bc479     1/1       Running   0          1d
+kubeflow      ambassador-b4d9cdb8-2qgww                                 1/1     Running     0          111m
+kubeflow      ambassador-b4d9cdb8-hpwdc                                 1/1     Running     0          111m
+kubeflow      ambassador-b4d9cdb8-khg8l                                 1/1     Running     0          111m
+kubeflow      argo-ui-6d6658d8f7-t6whw                                  1/1     Running     0          110m
+kubeflow      centraldashboard-6f686c5b7c-462cq                         1/1     Running     0          111m
+kubeflow      jupyter-0                                                 1/1     Running     0          111m
+kubeflow      katib-ui-6c59754c48-mgf62                                 1/1     Running     0          110m
+kubeflow      metacontroller-0                                          1/1     Running     0          111m
+kubeflow      minio-d79b65988-6qkxp                                     1/1     Running     0          110m
+kubeflow      ml-pipeline-66df9d86f6-rp245                              1/1     Running     0          110m
+kubeflow      ml-pipeline-persistenceagent-7b86dbf4b5-rgndj             1/1     Running     0          110m
+kubeflow      ml-pipeline-scheduledworkflow-84f6477479-9tvhk            1/1     Running     0          110m
+kubeflow      ml-pipeline-ui-f76bb5f97-2s5qb                            1/1     Running     0          110m
+kubeflow      mysql-ffc889689-xkpxb                                     1/1     Running     0          110m
+kubeflow      pytorch-operator-ff46f9b7d-qkbvh                          1/1     Running     0          111m
+kubeflow      spartakus-volunteer-5b6c956c8f-2gnvb                      1/1     Running     0          111m
+kubeflow      studyjob-controller-b7cdbd4cd-nf9z5                       1/1     Running     0          110m
+kubeflow      tf-job-dashboard-7746db84cf-njdzk                         1/1     Running     0          111m
+kubeflow      tf-job-operator-v1beta1-5949f668f7-j5zrn                  1/1     Running     0          111m
+kubeflow      vizier-core-7c56465f6-t6d5p                               1/1     Running     0          110m
+kubeflow      vizier-core-rest-67f588b4cb-lqvgr                         1/1     Running     0          110m
+kubeflow      vizier-db-86dc7d89c5-8vtfs                                1/1     Running     0          110m
+kubeflow      vizier-suggestion-bayesianoptimization-7cb546fb84-tsrn4   1/1     Running     0          110m
+kubeflow      vizier-suggestion-grid-6587f9d6b-92c9h                    1/1     Running     0          110m
+kubeflow      vizier-suggestion-hyperband-8bb44f8c8-gs72m               1/1     Running     0          110m
+kubeflow      vizier-suggestion-random-7ff5db687b-bjdh5                 1/1     Running     0          110m
+kubeflow      workflow-controller-cf79dfbff-lv7jk                       1/1     Running     0          110m
 ```

-The most important components for the purpose of this lab are `tf-hub-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-646bdbcb7-bc479` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
+The most important components for the purpose of this lab are `jupyter-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-v1beta1-5949f668f7-j5zrn` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
+
+### Remove Kubeflow
+
+If you want to remove the Kubeflow deployment, you can run the following to remove the namespace and installed components:
+
+```bash
+cd ${KUBEFLOW_SRC}/${KFAPP}
+${KUBEFLOW_SRC}/scripts/kfctl.sh delete k8s
+```

 ## Next Step

-[5 - JupyterHub](../5-jupyterhub)
+[5 - JupyterHub](../5-jupyterhub/README.md)
--- a/5-jupyterhub/README.md
+++ b/5-jupyterhub/README.md
@ -1,16 +1,18 @@
 # Jupyter Notebooks on Kubernetes

-## Prerequisites  
-* [1 - Docker Basics](../1-docker)
-* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
-* [4 - Kubeflow](../4-kubeflow)
+## Prerequisites
+
+- [1 - Docker Basics](../1-docker)
+- [2 - Kubernetes Basics and cluster created](../2-kubernetes)
+- [4 - Kubeflow](../4-kubeflow)

 ## Summary

 In this module, you will learn how to:
-* Run Jupyter Notebooks locally using Docker
-* Run JupyterHub on Kubernetes using Kubeflow
- 
+
+- Run Jupyter Notebooks locally using Docker
+- Run JupyterHub on Kubernetes using Kubeflow
+
 ## How Jupyter Notebooks work

 The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes.
@ -31,56 +33,60 @@ docker run -it -p 8888:8888 tensorflow/tensorflow

 #### Validation

-To verify, browse to the url in the output log. 
+To verify, browse to the url in the output log.

 For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413`

-
 ### Exercise 2: Run JupyterHub on Kubernetes using Kubeflow

-In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow. 
+In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow.

-As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob). 
+As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob).

 In module 4, you installed the kubeflow-core component, which already includes JupyterHub and a corresponding load balancer service of type `ClusterIP`. To check its status, run the following kubectl command.

 ```
+NAMESPACE=kubeflow
 kubectl get svc -n=${NAMESPACE}

 NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
 ...
-tf-hub-0           ClusterIP      None            <none>        8000/TCP       1m
-tf-hub-lb          ClusterIP      10.0.40.191    <none>        80/TCP         1m
+jupyter-0                                ClusterIP   None           <none>        8000/TCP            132m
+jupyter-lb                               ClusterIP   10.0.191.68    <none>        80/TCP              132m
 ```

-To connect to your JupyterHub locally:
+To connect to the Kubeflow dashboard locally:

+```bash
+kubectl port-forward    svc/ambassador -n ${NAMESPACE} 8080:80
 ```
-PODNAME=`kubectl get pods --namespace=${NAMESPACE} --selector="app=tf-hub" --output=template --template="{{with index .items 0}}{{.metadata.name}}{{end}}"`
-kubectl port-forward --namespace=${NAMESPACE} $PODNAME 8000:8000
-```
+
+Then navigate to JupyterHub: http://localhost:8080/hub

 [Optional] To connect to your JupyterHub over a public IP:

 To update the default service created for JupyterHub, run the following command to change the service to type LoadBalancer:

-```
-ks param set kubeflow-core jupyterHubServiceType LoadBalancer
-ks apply ${YOUR_KF_ENV}
-
-# YOUR_KF_ENV=default if you are continuing from previous module
+```bash
+cd ks_app
+ks param set jupyter serviceType LoadBalancer
+cd ..
+${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
 ```

 Create a new Jupyter Notebook instance:
- open http://127.0.0.1:8000 in your browser (or use the public IP for the service `tf-hub-lb`)
- log in using any username and password 
+
+- open http://localhost:8080/hub/ in your browser (or use the public IP for the service `tf-hub-lb`)
+- log in using any username and password
 - click the "Start My Server" button to sprawn a new Jupyter notebook
 - from the image dropdown, select a tensorflow image for your notebook
 - for CPU and memory, enter values based on your resource requirements, for example: 1 CPU and 2Gi
 - to get available GPUs in your cluster, run the following command:
+
 ```
 kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
 ```
+
 - for GPU, enter values in json format `{"nvidia.com/gpu":"1"}`
 - click the "Spawn" button

@ -98,7 +104,6 @@ kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME}

 After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree

-
 ## Next Step

 [6 - TfJob](../6-tfjob)
--- a/6-tfjob/README.md
+++ b/6-tfjob/README.md
@ -2,15 +2,14 @@

 ## Prerequisites

-* [1 - Docker](../1-docker/README.md)
-* [2 - Kubernetes](../2-kubernetes/README.md)
-* [4 - Kubeflow](../4-kubeflow/README.md)
+- [1 - Docker](../1-docker/README.md)
+- [2 - Kubernetes](../2-kubernetes/README.md)
+- [4 - Kubeflow](../4-kubeflow/README.md)

 ## Summary

 In this module you will learn how to describe a TensorFlow training using `TFJob` object.

-
 ### Kubernetes Custom Resource Definition

 Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
@ -24,35 +23,35 @@ Before going further, let's take a look at what the `TFJob` object looks like:

 **`TFJob` Object**

-| Field | Type| Description |
-|-------|-----|-------------| 
-| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
-| kind | `string` |  Value representing the REST resource this object represents. In our case it's `TFJob` |
-| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
-| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
+| Field      | Type                                                                                                               | Description                                                                                    |
+| ---------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
+| apiVersion | `string`                                                                                                           | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1beta1` |
+| kind       | `string`                                                                                                           | Value representing the REST resource this object represents. In our case it's `TFJob`          |
+| metadata   | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata) | Standard object's metadata.                                                                    |
+| spec       | `TFJobSpec`                                                                                                        | The actual specification of our TensorFlow job, defined below.                                 |

 `spec` is the most important part, so let's look at it too:

 **`TFJobSpec` Object**

-| Field | Type| Description |
-|-------|-----|-------------|
+| Field         | Type                  | Description                                                    |
+| ------------- | --------------------- | -------------------------------------------------------------- |
 | TFReplicaSpec | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |

-Let's go deeper: 
+Let's go deeper:

 **`TFReplicaSpec` Object**

-| Field | Type| Description |
-|-------|-----|-------------|
-| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 
-| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
-| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.  |
+| Field         | Type                                                                                        | Description                                                                                                                                                                 |
+| ------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| TfReplicaType | `string`                                                                                    | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
+| Replicas      | `int`                                                                                       | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`.                                                          |
+| Template      | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.                                       |

 Here is what a simple TensorFlow training looks like using this `TFJob` object:

 ```yaml
-apiVersion: kubeflow.org/v1alpha2
+apiVersion: kubeflow.org/v1beta1
 kind: TFJob
 metadata:
  name: example-tfjob
@ -73,7 +72,7 @@ spec:

 Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.

-## Exercises 
+## Exercises

 ### Exercise 1: A Simple `TFJob`

@ -84,7 +83,7 @@ Let's schedule a very simple TensorFlow job using `TFJob` first.
 When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.

 ```yaml
-apiVersion: kubeflow.org/v1alpha2
+apiVersion: kubeflow.org/v1beta1
 kind: TFJob
 metadata:
  name: module6-ex1-gpu
@ -95,7 +94,7 @@ spec:
      template:
        spec:
          containers:
-            - image: <DOCKER_USERNAME>/tf-mnist:gpu  # From module 1
+            - image: <DOCKER_USERNAME>/tf-mnist:gpu # From module 1
              name: tensorflow
              resources:
                limits:
@ -104,6 +103,7 @@ spec:
 ```

 Save the template that applies to you in a file, and create the `TFJob`:
+
 ```console
 kubectl create -f <template-path>
 ```
@ -115,7 +115,9 @@ First a `TFJob` was created:
 ```console
 kubectl get tfjob
 ```
+
 Returns:
+
 ```
 NAME              AGE
 module6-ex1-gpu   5s
@ -126,7 +128,9 @@ As well as a `Pod`, which was actually created by the operator:
 ```console
 kubectl get pod
 ```
+
 Returns:
+
 ```
 NAME                                            READY     STATUS      RESTARTS   AGE
 module6-ex1-master-xs4b-0-6gpfn                 1/1       Running     0          2m
@ -136,11 +140,11 @@ Note that the `Pod` might take a few minutes before actually running, the docker

 Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:

-```console 
+```console
 kubectl logs <your-pod-name>
 ```

-This container is pretty verbose, but you should see a TensorFlow training happening: 
+This container is pretty verbose, but you should see a TensorFlow training happening:

 ```
 [...]
@ -156,13 +160,13 @@ INFO:tensorflow:Final test accuracy = 88.4% (N=353)
 [...]
 ```

-> That's great and all, but how do we grab our trained model and TensorFlow's summaries?  
+> That's great and all, but how do we grab our trained model and TensorFlow's summaries?

-Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.  
+Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.

 Thankfully, Kubernetes `Volumes` can help us here.
-If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.  
-But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).  
+If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.
+But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).

 In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.

@ -171,19 +175,22 @@ In our case we are going to use Azure Files, as it is really easy to use with Ku
 ### Creating a New File Share and Kubernetes Secret

 In the official documentation: [Persistent volumes with Azure files](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv), follow the steps listed under `Create storage account`, `Create storage class`, and `Create persistent volume claim`.
- Be aware of a few details first:
-* It is **very** important that you create your storage account in the **same** region and the same resource group (with MC_ prefix) as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
-* While this document specifically refers to AKS, it will work for any K8s cluster
-* Once the PVC is created, you will see a new file share under that storage account. All subsequent modules will be writing to that file share.
-* PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects
-* If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding)
+Be aware of a few details first:
+
+- It is **very** important that you create your storage account in the **same** region and the same resource group (with MC\_ prefix) as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
+- While this document specifically refers to AKS, it will work for any K8s cluster
+- Once the PVC is created, you will see a new file share under that storage account. All subsequent modules will be writing to that file share.
+- PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects
+- If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding)

 Once you completed all the steps, run:
+
 ```console
 kubectl get pvc
 ```

 Which should return:
+
 ```
 NAME             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
 azurefile        Bound     pvc-346ab93b-4cbf-11e8-9fed-000d3a17b5e9   5Gi        RWO            azurefile      5m
@ -191,7 +198,7 @@ azurefile        Bound     pvc-346ab93b-4cbf-11e8-9fed-000d3a17b5e9   5Gi

 ### Updating our example to use our Azure File Share

-Now we need to mount our new file share into our container so the model and the summaries can be persisted.  
+Now we need to mount our new file share into our container so the model and the summaries can be persisted.
 Turns out mounting an Azure File share into a container is really easy, we simply need to reference our PVC in the `Volume` definition:

 ```yaml
@ -223,10 +230,10 @@ This means that when we run a training, all the important data is now stored in
 #### Solution for Exercise 2

 <details>
-<summary><strong>Solution</strong></summary>  
+<summary><strong>Solution</strong></summary>

 ```yaml
-apiVersion: kubeflow.org/v1alpha2
+apiVersion: kubeflow.org/v1beta1
 kind: TFJob
 metadata:
  name: module6-ex2
--- a/8-hyperparam-sweep/solution-chart/templates/deployment.yaml
+++ b/8-hyperparam-sweep/solution-chart/templates/deployment.yaml
@ -11,13 +11,13 @@
 # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
 {{- range $i, $lr := $lrlist }}
 {{- range $j, $nblayers := $nblayerslist }}
-apiVersion: kubeflow.org/v1alpha2
+apiVersion: kubeflow.org/v1beta1
 kind: TFJob # Each one of our trainings will be a separate TFJob
 metadata:
  name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
  labels:
    chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
-spec: 
+spec:
  tfReplicaSpecs:
    MASTER:
      template:
@ -25,14 +25,14 @@ spec:
          restartPolicy: OnFailure
          containers:
            - name: tensorflow
-              image: {{ $image }} 
+              image: {{ $image }}
              env:
              - name: LC_ALL
                value: C.UTF-8
              args:
                # Here we pass a unique learning rate and hidden layer count to each instance.
                # We also put the values between quotes to avoid potential formatting issues
-                - --learning-rate  
+                - --learning-rate
                - {{ $lr | quote }}
                - --hidden-layers
                - {{ $nblayers | quote }}
@ -45,7 +45,7 @@ spec:
 {{ end }}
              volumeMounts:
              - mountPath: /tmp/tensorflow
-                subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory 
+                subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory
                name: azurefile
          volumes:
            - name: azurefile
@ -84,7 +84,7 @@ spec:
      volumes:
        - name: azurefile
          persistentVolumeClaim:
-            claimName: azurefile      
+            claimName: azurefile
      containers:
      - name: tensorboard
        command:
@ -97,4 +97,4 @@ spec:
        volumeMounts:
        - mountPath: /tmp/tensorflow
          subPath: module8-tf-paint
-          name: azurefile  
+          name: azurefile