Add lab for installing KF + refactor TfJob (#9)

* add lab for installing KF * add kubeflow section * refactor tfjob module * update gpus * remove tensorboard * fix GPU inconsistencies * update all indexes * reviewers comments
2018-05-01 11:39:12 +02:00 · 2018-05-01 11:39:12 +02:00 · a6a10089cf
--- a/0-intro/README.md
+++ b/0-intro/README.md
@ -1,5 +1,9 @@
 # Introduction

+This labs will walk you through setting up [Kubeflow](https://github.com/kubeflow/kubeflow) on a kubernetes cluster on Azure Container Service (AKS).
+
+We will then take a look at how to use the different components that make up Kubeflow.
+
 ## Motivations

 Machine learning model development and operationalization currently has very few industry-wide best practices to help us reduce the time to market and optimize the different steps.
--- a/1-docker/README.md
+++ b/1-docker/README.md
@ -235,7 +235,7 @@ Most importantly we want to be able to reuse this image on the Kubernetes cluste
 So let's push our image to Docker Hub:

 ```console
-docker push ${DOCKER_USERNAME}/tf-mnist
+docker push ${DOCKER_USERNAME}/tf-mnist:gpu
 ```

 If this command doesn't look familiar to you, make sure you went through part 1 and 2 of Docker's tutorial, and more precisely: [Tutorial - Share your image](https://docs.docker.com/get-started/part2/#share-your-image)
--- a/10-going-further/NFSonAzureConcept.png
+++ b/10-going-further/NFSonAzureConcept.png
--- a/10-going-further/README.md
+++ b/10-going-further/README.md
--- a/2-kubernetes/README.md
+++ b/2-kubernetes/README.md
@ -77,8 +77,6 @@ You could also use [acs-engine](https://github.com/Azure/acs-engine) if you pref

 As of this writing, GPUs are available for AKS in the `eastus` and `westeurope` regions. If you wants more options you may want to use acs-engine for more flexibility.

-Only module 3 has an exercise which is specific for GPU VMs, all other modules can be followed on either CPU or GPU clusters, so if you are on a budget, feel free to create a CPU cluster instead.
-
 ### With the CLI

 #### Creating a resource group
@ -172,6 +170,7 @@ We want our deployment to have a few characteristics:
 * It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
 * The `Job` should be named `2-mnist-training`.
 * We want our training to run for `500` steps.
+* We want our training to use 1 GPU

 Here is what this would look like in YAML format:

@ -187,8 +186,11 @@ spec:
    spec:
      containers: # List of containers that should run inside the pod, in our case there is only one.
      - name: tensorflow
-        image: ${DOCKER_USERNAME}/tf-mnist # The image to run, you can replace by your own.
+        image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
        args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
+        resources:
+          limits:
+            alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container 
      restartPolicy: OnFailure # restart the pod if it fails
 ```

--- a/3-helm/README.md
+++ b/3-helm/README.md
@ -195,4 +195,4 @@ You should see the following web page from your deployment :

 ## Next Step

-[4 - GPUs](../4-gpus/README.md)
+[4 - Kubeflow](../4-kubeflow/README.md)
--- a/4-kubeflow-tfjob/README.md
+++ b/4-kubeflow-tfjob/README.md
@ -1,540 +0,0 @@
-# `tensorflow/k8s` and `TFJob`
-
-## Prerequisites
-
-* [3 - Helm](../3-helm/README.md)
-* [4 - GPUs](../4-gpus/README.md)  
-
-## Summary
-
-In this module you will learn how [`tensorflow/k8s`](https://github.com/tensorflow/k8s) can greatly simplify our lives when running TensorFlow on Kubernetes.
-
-## `tensorflow/k8s`
-
-As we saw earlier, giving a container access to GPU is not exactly a breeze on Kubernetes: We need to manually mount the drivers from the node into the container.    
-If you already tried to run a distributed TensorFlow training, you know that it's not easy either. Getting the `ClusterSpec` right can be painful if you have more than a couple VMs, and it's also quite brittle (we will look more into distributed TensorFlow in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/README.md)).
-  
-`tensorflow/k8s` is a new project in TensorFlow's organization on GitHub that makes all of this much easier.  
-
-
-### Installing `tensorflow/k8s`
-
-Installing `tensorflow/k8s` with Helm is very easy, just run the following commands:
-
-```console
-> CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
-> helm install ${CHART} -n tf-job --wait --replace --set cloud=azure
-```
-
-If it worked, you should see something like:
-
-```
-NAME:   tf-job
-LAST DEPLOYED: Mon Nov 20 14:24:16 2017
-NAMESPACE: default
-STATUS: DEPLOYED
-
-RESOURCES:
-==> v1/ConfigMap
-NAME                    DATA  AGE
-tf-job-operator-config  1     7s
-
-==> v1beta1/Deployment
-NAME             DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
-tf-job-operator  1        1        1           1          7s
-
-==> v1/Pod(related)
-NAME                              READY  STATUS   RESTARTS  AGE
-tf-job-operator-3005087210-c3js3  1/1    Running  1         4s
-```
-
-This means that 3 resources were created, a `ConfigMap`, a `Deployment`, and a `Pod`.  
-We will see in just a moment what each of them do.
-
-### Kubernetes Custom Resource Definition
-
-Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
-In the case of `tensorflow/k8s`, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe TensorFlow a training.
-
-#### `TFJob` Specifications
-
-Before going further, let's take a look at what the `TFJob` looks like:
-
-> Note: Some of the fields are not described here for brevity.
-
-**`TFJob` Object**
-  
-| Field | Type| Description |
-|-------|-----|-------------| 
-| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
-| kind | `string` |  Value representing the REST resource this object represents. In our case it's `TFJob` |
-| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
-| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
-
-`spec` is the most important part, so let's look at it too:
-
-**`TFJobSpec` Object**
-
-| Field | Type| Description |
-|-------|-----|-------------|
-| ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
-
-Let's go deeper: 
-
-**`TFReplicaSpec` Object**
-
-| Field | Type| Description |
-|-------|-----|-------------|
-| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 
-| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
-| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.  |
-
-
-As a refresher, here is what a simple TensorFlow training (with GPU) would look like using "vanilla" kubernetes:
-
-```yaml
-apiVersion: batch/v1
-kind: Job
-metadata:
-  name: example-job
-spec:
-  template:
-    metadata:
-      name: example-job
-    spec:
-      restartPolicy: OnFailure
-      volumes:
-      - name: bin
-        hostPath: 
-          path: /usr/lib/nvidia-384/bin
-      - name: lib
-        hostPath: 
-          path: /usr/lib/nvidia-384
-      containers:
-      - name: tensorflow
-        image: wbuchwalter/<SAMPLE IMAGE>
-        resources:
-          requests:
-            alpha.kubernetes.io/nvidia-gpu: 1 
-        volumeMounts:
-        - name: bin
-          mountPath: /usr/local/nvidia/bin
-        - name: lib
-          mountPath: /usr/local/nvidia/lib64
-```
-Here is what the same thing looks like using the new `TFJob` resource:
-
-```yaml
-apiVersion: kubeflow.org/v1alpha1
-kind: TFJob
-metadata:
-  name: example-tfjob
-spec:
-  replicaSpecs:
-    - template:
-        spec:
-          containers:
-            - image: wbuchwalter/<SAMPLE IMAGE>
-              name: tensorflow
-              resources:
-                requests:
-                  alpha.kubernetes.io/nvidia-gpu: 1
-          restartPolicy: OnFailure
-```
-
-No need to mount drivers anymore! Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
-
-#### How does this work?
-
-As we saw earlier, when we installed the Helm chart for `tensorflow/k8s`, 3 resources were created in our cluster:
-* A `ConfigMap` named `tf-job-operator-config`
-* A `Deployment`
-* And a `Pod` named `tf-job-operator`
-
-The `tf-job-operator` pod (simply called the operator, or `TFJob` operator), is going to monitor your cluster, and every time you create a new resource of type `TFJob`, the operator will know what to do with it.
-Specifically, when you create a new `TFJob`, the operator will create a new Kubernetes `Job` for it, and automatically mount the drivers if needed (i.e. when you request a GPU).
-
-You may wonder how the operator knows which directory needs to be mounted in the container for the NVIDIA drivers: that's where the `ConfigMap` comes into play.  
-
-In K8s, a [`ConfigMap`](https://kubernetes.io/docs/tasks/configure-pod-container/configmap/) is a simple object that contains key-value pairs. This `ConfigMap` can then be linked with a container to inject some configuration.   
-
-When we installed the Helm chart, we specified which cloud provider we are running on by doing `--set cloud=azure`. 
-This creates a `ConfigMap` that contains configuration options specific for Azure, including the list of directory to mount.
-
-We can take a look at what is inside our `tf-job-operator-config` by doing:
-
-```console
-kubectl describe configmaps tf-job-operator-config
-```
-
-The output is:
-
-```
-Name:		tf-job-operator-config
-Namespace:	default
-Labels:		<none>
-Annotations:	<none>
-
-Data
-====
-controller_config_file.yaml:
----
-grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
-accelerators:
-  alpha.kubernetes.io/nvidia-gpu:
-    volumes:
-      - name: lib
-        mountPath: /usr/local/nvidia/lib64
-        hostPath:  /usr/lib/nvidia-384
-      - name: bin
-        mountPath: /usr/local/nvidia/bin
-        hostPath: /usr/lib/nvidia-384/bin
-      - name: libcuda
-        mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
-        hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
-```
-
-If you want to know more:
-* [tensorflow/k8s](https://github.com/tensorflow/k8s) GitHub repository
-* [Introducing Operators](https://coreos.com/blog/introducing-operators.html), a blog post by CoreOS explaining the Operator pattern
-
-## Exercises 
-
-### Exercise 1: A Simple `TFJob`
-
-Let's schedule a very simple TensorFlow job using `TFJob` first.
-
-> Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead.
-
-Depending on whether or not your cluster has GPU, choose the correct template:
-
-<details>
-<summary><strong>CPU Only</strong></summary>  
-  
-```yaml
-apiVersion: kubeflow.org/v1alpha1
-kind: TFJob
-metadata:
-  name: module5-ex1
-spec:
-  replicaSpecs:
-    - template:
-        spec:
-          containers:
-            - image: wbuchwalter/tf-mnist:cpu
-              name: tensorflow
-          restartPolicy: OnFailure
-```
-
-</details>
-
-<details>
-<summary><strong>With GPU</strong></summary>  
-
-When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
-
-```yaml
-apiVersion: kubeflow.org/v1alpha1
-kind: TFJob
-metadata:
-  name: module5-ex1-gpu
-spec:
-  replicaSpecs:
-    - template:
-        spec:
-          containers:
-            - image: wbuchwalter/tf-mnist:gpu
-              name: tensorflow
-              resources:
-                requests:
-                  alpha.kubernetes.io/nvidia-gpu: 1
-          restartPolicy: OnFailure
-```
-
-</details>  
-  
-
-
-Save the template that applies to you in a file, and create the `TFJob`:
-```console
-kubectl create -f <template-path>
-```
-
-Let's look at what has been created in our cluster.
-
-First a `TFJob` was created:
-
-```console
-kubectl get tfjob
-```
-Returns:
-```
-NAME            KIND
-module5-ex1   TFJob.v1alpha1.tensorflow.org
-```
-
-As well as a `Job`, which was actually created by the operator:
-
-```console
-kubectl get job
-```
-Returns:
-```
-NAME            DESIRED   SUCCESSFUL   AGE
-module5-ex1-master-xs4b-0   1         0            2m
-```
-and a `Pod`:
-
-```console
-kubectl get pod
-```
-Returns:
-```
-NAME                                READY     STATUS      RESTARTS   AGE
-module5-ex1-master-xs4b-0-6gpfn                 1/1       Running   0          2m
-```
-
-Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first.
-
-Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
-
-```console 
-kubectl logs <your-pod-name>
-```
-
-This container is pretty verbose, but you should see a TensorFlow training happening: 
-
-```
-[...]
-INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486
-INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100)
-INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0%
-INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210
-INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100)
-INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0%
-INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348
-INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100)
-INFO:tensorflow:Final test accuracy = 88.4% (N=353)
-[...]
-```
-
-Once your job is completed, clean it up:
-
-```console
-kubectl delete tfjob module5-ex1
-```
-
-> That's great and all, but how do we grab our trained model and TensorFlow's summaries?  
-
-Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.  
-
-Thankfully, Kubernetes `Volumes` can help us here.
-If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.  
-But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).  
-
-In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
-
-## Exercise 2: Azure Files to the Rescue
-
-### Creating a New File Share and Kubernetes Secret
-
-In the official documentation: [Using Azure Files with Kubernetes](https://docs.microsoft.com/en-in/azure/aks/azure-files), follow the steps listed under `Create an Azure file share` and `Create Kubernetes Secret`, but be aware of a few details first:
-* It is **very** important that you create you storage account (hence your resource group) in the **same** region as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
-* While this document specifically refers to AKS, it will work for any K8s cluster
-* Name your file share `tensorflow`. While the share could be named anything, it will make it easier to follow the examples later on. `AKS_PERS_SHARE_NAME` should be updated accordingly.
-
-Once you completed all the steps, run:
-```console
-kubectl get secrets
-```
-
-Which should return:
-```
-NAME                  TYPE                                  DATA      AGE
-azure-secret          Opaque                                2         4m
-```
-
-
-### Updating our example to use our Azure File Share
-
-Now we need to mount our new file share into our container so the model and the summaries can be persisted.  
-Turns out mounting an Azure File share into a container is really easy, we simply need to reference our secret in the `Volume` definition:
-
-```yaml
-[...]
- containers:
-  - image: <IMAGE>
-    name: tensorflow
-    volumeMounts:
-      - name: azurefile
-        mountPath: <MOUNT_PATH>
- volumes:
-  - name: azurefile
-    azureFile:
-      secretName: azure-secret
-      shareName: tensorflow
-      readOnly: false
-```
-
-Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
-Note that by default our container saves everything into `/app/tf_files` so that's the value you will want to use for `MOUNT_PATH`.
-
-Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
-
-![file-share](./file-share.png)
-
-This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share.
-
-#### Solution for Exercise 2
-
-*For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.*
-
-<details>
-<summary><strong>Solution</strong></summary>  
-
-```yaml
-apiVersion: kubeflow.org/v1alpha1
-kind: TFJob
-metadata:
-  name: module5-ex2
-spec:
-  replicaSpecs:
-    - template:
-        spec:
-          containers:
-            - image: wbuchwalter/tf-mnist:cpu
-              name: tensorflow
-              volumeMounts:
-                # By default our classifier saves the summaries in /tmp/tensorflow,
-                # so that's where we want to mount our Azure File Share.
-                - name: azurefile
-                  # The subPath allows us to mount a subdirectory within the azure file share instead of root
-                  # this is useful so that we can save the logs for each run in a different subdirectory
-                  # instead of overwriting what was done before.
-                  subPath: module5-ex2
-                  mountPath: /tmp/tensorflow 
-          volumes:
-            - name: azurefile
-              azureFile:
-                # We reference the secret we created just earlier 
-                # so that the account name and key are passed securely and not directly in a template
-                secretName: azure-secret
-                shareName: tensorflow
-                readOnly: false
-          restartPolicy: OnFailure
-```
-
-</details>
-
-
-**Don't forget to delete the `TFJob` once it is completed!**
-
-> Great, but what if I want to check out the training in TensorBoard, do I need to download everything on my machine?
-
-Actually no, you don't. `TFJob` provides a very handy mechanism to monitor your trainings with TensorBoard easily!
-We will try that in our third exercise.
-
-### Exercise 3: Adding TensorBoard
-
-So far, we have a TensorFlow training running, and it's model and summaries are persisted to an Azure File share.  
-But having TensorBoard monitoring the training would be pretty useful as well.
-Turns out `TFJob` can also help us with that.
-
-When we looked at the `TFJob` specification at the beginning of this module, we omitted some fields in `TFJobSpec` descriptions.
-Here is a still incomplete but more accurate representation with one additional field:
-
-**`TFJobSpec` Object**
-
-| Field | Type| Description |
-|-------|-----|-------------|
-| ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes. |
-| **TensorBoard** | `TensorBoardSpec` | Configuration to start a TensorBoard deployment associated to our training job. Defined below. |
-
-That's right, `TFJobSpec` contains an object of type `TensorBoadSpec` which allows us to describe a TensorBoard instance!
-Let's look at it:
-
-**`TensorBoardSpec` Object**
-
-| Field | Type| Description |
-|-------|-----|-------------|
-| LogDir | `string` | Location of TensorFlow summaries in the TensorBoard container. |
-| ServiceType | [`ServiceType`](https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services---service-types) | What kind of service should expose TensorBoard. Usually `ClusterIP` (Only reachable from within the cluster) or `LoadBalancer` (Exposes the service externally using a cloud provider’s load balancer. ) |
-| Volumes | [`Volume`](https://kubernetes.io/docs/api-reference/v1.8/#volume-v1-core) array | List of volumes that can be mounted.  |
-| VolumeMounts | [`VolumeMount`](https://kubernetes.io/docs/api-reference/v1.8/#volumemount-v1-core) array | Pod volumes to mount into the container's filesystem. |
-
-
-Let's add TensorBoard to our job then.
-Here is how this will work: We will keep the same TensorFlow training job as in exercise 2. This `TFJob` will write the model and summaries in the Azure File share.
-We will also set up the configuration for TensorBoard so that it reads the summaries from the same Azure File share:
-* `Volumes` and `VolumeMounts` in `TensorBoardSpec` should be updated adequately.
-* For `ServiceType`, you should use `LoadBalancer`, this will create a public IP so it will be easier to access.
-* `LogDir` will depend on how you configure `VolumeMounts`, but on your file share, the summaries will be under the `training_summaries` sub directory.
-
-#### Solution for Exercise 3
-
-*For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.*
-
-<details>
-<summary><strong>Solution</strong></summary>  
-
-```yaml
-apiVersion: kubeflow.org/v1alpha1
-kind: TFJob
-metadata:
-  name: module5-ex3
-spec:
-  replicaSpecs:
-    - template:
-        spec:
-          volumes:
-            - name: azurefile
-              azureFile:
-                  secretName: azure-secret
-                  shareName: tensorflow
-                  readOnly: false
-          containers:
-            - image: wbuchwalter/tf-mnist:cpu
-              name: tensorflow
-              volumeMounts:
-                - mountPath: /tmp/tensorflow
-                  subPath: module5-ex3 # Again we isolate the logs in a new directory on Azure Files
-                  name: azurefile
-          restartPolicy: OnFailure
-  tensorboard:
-    logDir: /tmp/tensorflow/logs
-    serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
-    volumes:
-      - name: azurefile
-        azureFile:
-            secretName: azure-secret
-            shareName: tensorflow
-    volumeMounts:
-      - mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
-        subPath: module5-ex3 # This should match the directory our Master is actually writing in
-        name: azurefile
-```
-</details>
-
-
-#### Validation
-
-If you updated the `TFJob` template correctly, when doing:
-```console
-kubectl get services
-```
-You should see something like:
-```
-NAME                           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
-kubernetes                     10.0.0.1       <none>          443/TCP        14d
-module5-ex3-master-7yqt-0      10.0.126.11    <none>          2222/TCP       5m
-module5-ex3-tensorboard-7yqt   10.0.199.170   104.42.193.76   80:31770/TCP   5m
-```
-Note that provisioning a public IP on Azure can take a few minutes. During this time the `EXTERNAL-IP` for TensorBoard's service will show as `<pending>`.  
-
-Once the public IP is provisioned, browse it, and you should land on a working TensorBoard instance with live monitoring of the training job running.
-
-![TensorBoard](./tensorboard.png)
-
-## Next Step
-
-[6 - Distributed TensorFlow](../6-distributed-tensorflow)
--- a/4-kubeflow/README.md
+++ b/4-kubeflow/README.md
@ -0,0 +1,95 @@
+# Kubeflow - Overview and Installation
+
+## Prerequisites
+
+* [1 - Docker](../1-docker/README.md)
+* [2 - Kubernetes](../2-kubernetes/README.md)
+
+## Summary
+
+In this module we are going to get an overview of the different components that make up [Kubeflow](https://github.com/kubeflow/kubeflow), and how to install them into our newly deployed Kubernetes cluster.
+
+### Kubeflow Overview
+
+From [Kubeflow](https://github.com/kubeflow/kubeflow)'s own documetation:
+
+> The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
+
+Kubeflow is composed of multiple components:
+* [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
+* One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
+* A serving component that will help you serve predictions with your models.
+
+For more general info on Kubeflow, head to the repo's [README](https://github.com/kubeflow/kubeflow/blob/master/README.md).
+
+### Deploying Kubeflow
+
+Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.  
+
+> ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments. 
+
+First, install ksonnet version [0.9.2](https://ksonnet.io/#get-started).
+
+Then run the following commands to deploy Kubeflow in your Kubernetes cluster:
+
+```bash
+# Create a namespace for kubeflow deployment
+NAMESPACE=kubeflow
+kubectl create namespace ${NAMESPACE}
+
+# Which version of Kubeflow to use
+# For a list of releases refer to:
+# https://github.com/kubeflow/kubeflow/releases
+VERSION=v0.1.2
+
+# Initialize a ksonnet app. Set the namespace for it's default environment.
+APP_NAME=my-kubeflow
+ks init ${APP_NAME}
+cd ${APP_NAME}
+ks env set default --namespace ${NAMESPACE}
+
+# Add a reference to Kubeflow's ksonnet manifests
+ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
+
+# Install Kubeflow components
+ks pkg install kubeflow/core@${VERSION}
+ks pkg install kubeflow/tf-serving@${VERSION}
+ks pkg install kubeflow/tf-job@${VERSION}
+
+# Create templates for core components
+ks generate kubeflow-core kubeflow-core
+
+# Customize Kubeflow's installation for AKS
+ks param set kubeflow-core cloud aks
+
+# Enable collection of anonymous usage metrics
+# Skip this step if you don't want to enable collection.
+ks param set kubeflow-core reportUsage true
+ks param set kubeflow-core usageId $(uuidgen)
+
+# Deploy Kubeflow
+ks apply default -c kubeflow-core
+```
+
+### Validation
+
+`kubectl get pods -n kubeflow`
+
+should return something like this:
+
+```
+NAME                                READY     STATUS    RESTARTS   AGE
+ambassador-7789cddc5d-czf7p         2/2       Running   0          1d
+ambassador-7789cddc5d-f79zp         2/2       Running   0          1d
+ambassador-7789cddc5d-h57ms         2/2       Running   0          1d
+centraldashboard-d5bf74c6b-nn925    1/1       Running   0          1d
+tf-hub-0                            1/1       Running   0          1d
+tf-job-dashboard-8699ccb5ff-9phmv   1/1       Running   0          1d
+tf-job-operator-646bdbcb7-bc479     1/1       Running   0          1d
+```
+
+The most important components for the puporse of this lab are `tf-hub-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-646bdbcb7-bc479` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
+
+## Next Step
+
+[5 - JupyterHub](../5-jupyterhub)
--- a/5-jupyterhub/README.md
+++ b/5-jupyterhub/README.md
@ -3,7 +3,7 @@
 ## Prerequisites  
 * [1 - Docker Basics](../1-docker)
 * [2 - Kubernetes Basics and cluster created](../2-kubernetes)
-* [4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob)
+* [4 - Kubeflow](../4-kubeflow)

 ## Summary

@ -95,3 +95,8 @@ kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME}
 ```

 After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree
+
+
+## Next Step
+
+[6 - TfJob](../6-tfjob)
--- a/6-tfjob/README.md
+++ b/6-tfjob/README.md
@ -0,0 +1,272 @@
+# `TFJob`
+
+## Prerequisites
+
+* [1 - Docker](../1-docker/README.md)
+* [2 - Kubernetes](../2-kubernetes/README.md)
+* [4 - Kubeflow](../4-kubeflow/README.md)
+
+## Summary
+
+In this module you will learn how to describe a TensorFlow training using `TfJob` object.
+
+
+### Kubernetes Custom Resource Definition
+
+Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
+In the case of Kubeflow, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe a TensorFlow training.
+
+#### `TFJob` Specifications
+
+Before going further, let's take a look at what the `TFJob` object looks like:
+
+> Note: Some of the fields are not described here for brevity.
+
+**`TFJob` Object**
+
+| Field | Type| Description |
+|-------|-----|-------------| 
+| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
+| kind | `string` |  Value representing the REST resource this object represents. In our case it's `TFJob` |
+| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
+| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
+
+`spec` is the most important part, so let's look at it too:
+
+**`TFJobSpec` Object**
+
+| Field | Type| Description |
+|-------|-----|-------------|
+| ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
+
+Let's go deeper: 
+
+**`TFReplicaSpec` Object**
+
+| Field | Type| Description |
+|-------|-----|-------------|
+| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 
+| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
+| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.  |
+
+Here is what a simple TensorFlow training looks like using this `TFJob` object:
+
+```yaml
+apiVersion: kubeflow.org/v1alpha1
+kind: TFJob
+metadata:
+  name: example-tfjob
+spec:
+  replicaSpecs:
+    - template:
+        spec:
+          containers:
+            - image: wbuchwalter/<SAMPLE IMAGE>
+              name: tensorflow
+              resources:
+                limits:
+                  alpha.kubernetes.io/nvidia-gpu: 1
+          restartPolicy: OnFailure
+```
+
+Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
+
+## Exercises 
+
+### Exercise 1: A Simple `TFJob`
+
+Let's schedule a very simple TensorFlow job using `TFJob` first.
+
+> Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead.
+
+When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
+
+```yaml
+apiVersion: kubeflow.org/v1alpha1
+kind: TFJob
+metadata:
+  name: module5-ex1-gpu
+spec:
+  replicaSpecs:
+    - template:
+        spec:
+          containers:
+            - image: wbuchwalter/tf-mnist:gpu
+              name: tensorflow
+              resources:
+                limits:
+                  alpha.kubernetes.io/nvidia-gpu: 1
+          restartPolicy: OnFailure
+```
+
+Save the template that applies to you in a file, and create the `TFJob`:
+```console
+kubectl create -f <template-path>
+```
+
+Let's look at what has been created in our cluster.
+
+First a `TFJob` was created:
+
+```console
+kubectl get tfjob
+```
+Returns:
+```
+NAME              AGE
+module5-ex1-gpu   5s
+```
+
+As well as a `Job`, which was actually created by the operator:
+
+```console
+kubectl get job
+```
+Returns:
+```bash
+NAME                        DESIRED   SUCCESSFUL   AGE
+module5-ex1-master-xs4b-0   1         0            2m
+```
+and a `Pod`:
+
+```console
+kubectl get pod
+```
+Returns:
+```
+NAME                                            READY     STATUS      RESTARTS   AGE
+module5-ex1-master-xs4b-0-6gpfn                 1/1       Running     0          2m
+```
+
+Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first.
+
+Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
+
+```console 
+kubectl logs <your-pod-name>
+```
+
+This container is pretty verbose, but you should see a TensorFlow training happening: 
+
+```
+[...]
+INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486
+INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100)
+INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0%
+INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210
+INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100)
+INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0%
+INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348
+INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100)
+INFO:tensorflow:Final test accuracy = 88.4% (N=353)
+[...]
+```
+
+> That's great and all, but how do we grab our trained model and TensorFlow's summaries?  
+
+Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.  
+
+Thankfully, Kubernetes `Volumes` can help us here.
+If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.  
+But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).  
+
+In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
+
+## Exercise 2: Azure Files to the Rescue
+
+### Creating a New File Share and Kubernetes Secret
+
+In the official documentation: [Using Azure Files with Kubernetes](https://docs.microsoft.com/en-us/azure/aks/azure-files-volume), follow the steps listed under `Create an Azure file share` and `Create Kubernetes Secret`, but be aware of a few details first:
+* It is **very** important that you create you storage account (hence your resource group) in the **same** region as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
+* While this document specifically refers to AKS, it will work for any K8s cluster
+* Name your file share `tensorflow`. While the share could be named anything, it will make it easier to follow the examples later on. `AKS_PERS_SHARE_NAME` should be updated accordingly.
+
+Once you completed all the steps, run:
+```console
+kubectl get secrets
+```
+
+Which should return:
+```
+NAME                  TYPE         DATA      AGE
+azure-secret          Opaque       2         4m
+```
+
+### Updating our example to use our Azure File Share
+
+Now we need to mount our new file share into our container so the model and the summaries can be persisted.  
+Turns out mounting an Azure File share into a container is really easy, we simply need to reference our secret in the `Volume` definition:
+
+```yaml
+[...]
+ containers:
+  - image: <IMAGE>
+    name: tensorflow
+    resources:
+      limits:
+        alpha.kubernetes.io/nvidia-gpu: 1
+    volumeMounts:
+      - name: azurefile
+        mountPath: <MOUNT_PATH>
+ volumes:
+  - name: azurefile
+    azureFile:
+      secretName: azure-secret
+      shareName: tensorflow
+      readOnly: false
+```
+
+Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
+Note that by default our container saves everything into `/app/tf_files` so that's the value you will want to use for `MOUNT_PATH`.
+
+Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
+
+![file-share](./file-share.png)
+
+This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share.
+
+#### Solution for Exercise 2
+
+<details>
+<summary><strong>Solution</strong></summary>  
+
+```yaml
+apiVersion: kubeflow.org/v1alpha1
+kind: TFJob
+metadata:
+  name: module5-ex2
+spec:
+  replicaSpecs:
+    - template:
+        spec:
+          containers:
+            - image: wbuchwalter/tf-mnist:cpu
+              name: tensorflow
+              resources:
+                limits:
+                  alpha.kubernetes.io/nvidia-gpu: 1
+              volumeMounts:
+                # By default our classifier saves the summaries in /tmp/tensorflow,
+                # so that's where we want to mount our Azure File Share.
+                - name: azurefile
+                  # The subPath allows us to mount a subdirectory within the azure file share instead of root
+                  # this is useful so that we can save the logs for each run in a different subdirectory
+                  # instead of overwriting what was done before.
+                  subPath: module5-ex2
+                  mountPath: /tmp/tensorflow 
+          volumes:
+            - name: azurefile
+              azureFile:
+                # We reference the secret we created just earlier 
+                # so that the account name and key are passed securely and not directly in a template
+                secretName: azure-secret
+                shareName: tensorflow
+                readOnly: false
+          restartPolicy: OnFailure
+```
+
+</details>
+
+## Next Step
+
+[7 - Distributed TensorFlow](../7-distributed-tensorflow)
--- a/4-kubeflow-tfjob/file-share.png
+++ b/4-kubeflow-tfjob/file-share.png
--- a/4-kubeflow-tfjob/tensorboard.png
+++ b/4-kubeflow-tfjob/tensorboard.png
--- a/7-distributed-tensorflow/README.md
+++ b/7-distributed-tensorflow/README.md
@ -3,7 +3,7 @@
 ## Prerequisites

 * [2 - Kubernetes Basics and cluster created](../2-kubernetes)
-* [4 - Kubeflow and TFJob Basics](../4-kubeflow-tfjob)
+* [6 - TfJob](../6-tfjob)

 ## Summary

@ -371,4 +371,4 @@ There are two things to notice here:

 ## Next Step

-[7 - Hyperparameters Sweep with Helm](../7-hyperparam-sweep)
+[8 - Hyperparameters Sweep with Helm](../8-hyperparam-sweep)
--- a/7-distributed-tensorflow/solution-src/Dockerfile
+++ b/7-distributed-tensorflow/solution-src/Dockerfile
--- a/7-distributed-tensorflow/solution-src/main.py
+++ b/7-distributed-tensorflow/solution-src/main.py
--- a/7-distributed-tensorflow/tensorboard.png
+++ b/7-distributed-tensorflow/tensorboard.png
--- a/8-hyperparam-sweep/README.md
+++ b/8-hyperparam-sweep/README.md
@ -3,7 +3,7 @@
 ## Prerequisites

 * [3 - Helm](../3-helm)
-* [4 - Kubeflow and TFJob Basics](../4-kubeflow-tfjob)
+* [6 - TfJob Basics](../6-tfjob)
  
 ### "Vanilla" Hyperparameter Sweep

@ -170,4 +170,4 @@ tensorboard-85dfc74f8d-4gf24          1/1    Running            0         4h

 ## Next Step

-[8 - Going Further](../8-going-further)
+[9 - Serving](../9-serving)
--- a/8-hyperparam-sweep/solution-chart/Chart.yaml
+++ b/8-hyperparam-sweep/solution-chart/Chart.yaml
--- a/8-hyperparam-sweep/solution-chart/templates/_helpers.tpl
+++ b/8-hyperparam-sweep/solution-chart/templates/_helpers.tpl
--- a/8-hyperparam-sweep/solution-chart/templates/deployment.yaml
+++ b/8-hyperparam-sweep/solution-chart/templates/deployment.yaml
--- a/8-hyperparam-sweep/solution-chart/values.yaml
+++ b/8-hyperparam-sweep/solution-chart/values.yaml
--- a/8-hyperparam-sweep/src/Dockerfile
+++ b/8-hyperparam-sweep/src/Dockerfile
--- a/8-hyperparam-sweep/src/Dockerfile.gpu
+++ b/8-hyperparam-sweep/src/Dockerfile.gpu
--- a/8-hyperparam-sweep/src/main.py
+++ b/8-hyperparam-sweep/src/main.py
--- a/8-hyperparam-sweep/src/requirements.txt
+++ b/8-hyperparam-sweep/src/requirements.txt
--- a/8-hyperparam-sweep/src/starry.jpg
+++ b/8-hyperparam-sweep/src/starry.jpg
--- a/8-hyperparam-sweep/tensorboard.png
+++ b/8-hyperparam-sweep/tensorboard.png
--- a/9-serving/README.md
+++ b/9-serving/README.md
--- a/README.md
+++ b/README.md
@ -27,12 +27,13 @@ git clone https://github.com/Azure/kubeflow-labs
 |1| **[Docker](1-docker)** | Docker and containers 101.|
 |2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.|
 |3| **[Helm](3-helm)** | Introduction to Helm |
-|4| **[Kubeflow + TFJob](4-kubeflow-tfjob)** | Introduction to Kubeflow. How to use `tensorflow/k8s` and `TFJob` to deploy a simple TensorFlow training.|
+|4| **[Kubeflow](4-kubeflow)** | Introduction to Kubeflow and how to deploy it in your cluster.|
 |5| **[JupyterHub](5-jupyterhub)** | Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow |
-|6| **[Distributed Tensorflow](6-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`|
-|7| **[Hyperparameters Sweep with Helm](7-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results |
-|8| **[Serving](8-serving)** | Using TensorFlow Serving to serve predictions |
-|9| **[Going Further](9-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. |
+|6| **[TFJob](6-tfjob)** | Introduction to `TFJob` and how to use it to deploy a simple TensorFlow training.|
+|7| **[Distributed Tensorflow](7-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`|
+|8| **[Hyperparameters Sweep with Helm](8-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results |
+|9| **[Serving](9-serving)** | Using TensorFlow Serving to serve predictions |
+|10| **[Going Further](10-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. |


 # Contributing