Additional edits for kf 0.2.2 (#49)

2018-09-20 14:33:46 -04:00 · 2018-09-20 14:33:46 -04:00 · 9c2d3f6c0f
--- a/1-docker/README.md
+++ b/1-docker/README.md
@ -117,8 +117,8 @@ As you can see, we are not building a new image from scratch, instead we are usi
 You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/.

 What is important to note is that different tags need to be used depending on if you want to use GPU or not.  
-For example, if you wanted to run your model with TensorFlow 1.4.0 and CPU only, you would use `tensorflow/tensorflow:1.4.0`.  
-If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.4.0-gpu`.
+For example, if you wanted to run your model with TensorFlow 1.10.0 and CPU only, you would use `tensorflow/tensorflow:1.10.0`.  
+If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.10.0-gpu`.

 The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script.

@ -141,7 +141,7 @@ The output from this command should look like this:

 ```
 Sending build context to Docker daemon  11.26kB
-Step 1/3 : FROM tensorflow/tensorflow:1.4.0
+Step 1/3 : FROM tensorflow/tensorflow:1.10.0
 ---> a61a91cc0d1b
 Step 2/3 : COPY main.py /app/main.py
 ---> b264d6e9a5ef
--- a/2-kubernetes/README.md
+++ b/2-kubernetes/README.md
@ -151,10 +151,12 @@ If you provisioned GPU VM, describing one of the node should indicate the presen

 [...]
 Capacity:
- alpha.kubernetes.io/nvidia-gpu:	1
+ nvidia.com/gpu:     1
 [...]
 ```

+> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster 
+
 ## Exercise

 ### Running our Model on Kubernetes
@ -190,7 +192,7 @@ spec:
        args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
        resources:
          limits:
-            alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
+            nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
        volumeMounts:
        - name: nvidia
          mountPath: /usr/local/nvidia
--- a/4-kubeflow/README.md
+++ b/4-kubeflow/README.md
@ -54,7 +54,6 @@ ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
 # Install Kubeflow components
 ks pkg install kubeflow/core@${VERSION}
 ks pkg install kubeflow/tf-serving@${VERSION}
-# ks pkg install kubeflow/tf-job@${VERSION} # TODO: delete this one?

 # Create templates for core components
 ks generate kubeflow-core kubeflow-core
--- a/5-jupyterhub/README.md
+++ b/5-jupyterhub/README.md
@ -72,7 +72,7 @@ ks apply ${YOUR_KF_ENV}
 ```

 Create a new Jupyter Notebook instance:
- open http://127.0.0.1:8000 in your browser
+- open http://127.0.0.1:8000 in your browser (or use the public IP for the service `tf-hub-lb`)
 - log in using any username and password 
 - click the "Start My Server" button to sprawn a new Jupyter notebook
 - from the image dropdown, select a tensorflow image for your notebook
@ -81,7 +81,7 @@ Create a new Jupyter Notebook instance:
 ```
 kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
 ```
- for GPU, enter values in json format `{"alpha.kubernetes.io/nvidia-gpu":"1"}`
+- for GPU, enter values in json format `{"nvidia.com/gpu":"1"}`
 - click the "Spawn" button

 ![jupyterhub](./jupyterhub.png)
--- a/5-jupyterhub/jupyterhub.png
+++ b/5-jupyterhub/jupyterhub.png
--- a/6-tfjob/README.md
+++ b/6-tfjob/README.md
@ -65,7 +65,7 @@ spec:
              name: tensorflow
              resources:
                limits:
-                  alpha.kubernetes.io/nvidia-gpu: 1
+                  nvidia.com/gpu: 1
          restartPolicy: OnFailure
 ```

@ -95,7 +95,7 @@ spec:
              name: tensorflow
              resources:
                limits:
-                  alpha.kubernetes.io/nvidia-gpu: 1
+                  nvidia.com/gpu: 1
          restartPolicy: OnFailure
 ```

@ -197,18 +197,18 @@ Turns out mounting an Azure File share into a container is really easy, we simpl
    name: tensorflow
    resources:
      limits:
-        alpha.kubernetes.io/nvidia-gpu: 1
+        nvidia.com/gpu: 1
    volumeMounts:
      - name: azurefile
-        mountPath: <MOUNT_PATH>
+        subPath: module6-ex2-gpu
+        mountPath: /tmp/tensorflow
 volumes:
  - name: azurefile
    persistentVolumeClaim:
      claimName: azurefile
 ```

-Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
-Note that by default our container saves everything into `/tmp/tensorflow` so that's the value you will want to use for `MOUNT_PATH`.
+Update your template from exercise 1 to mount the Azure File share into your container, and create your new job.

 Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:

@ -228,13 +228,16 @@ metadata:
  name: module6-ex2
 spec:
  tfReplicaSpecs:
-    Master:
+    MASTER:
      replicas: 1
      template:
        spec:
          containers:
-            - image: <DOCKER_USERNAME>/tf-mnist:1.0
+            - image: <DOCKER_USERNAME>/tf-mnist:gpu
              name: tensorflow
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
              volumeMounts:
                # By default our classifier saves the summaries in /tmp/tensorflow,
                # so that's where we want to mount our Azure File Share.
@ -242,13 +245,13 @@ spec:
                  # The subPath allows us to mount a subdirectory within the azure file share instead of root
                  # this is useful so that we can save the logs for each run in a different subdirectory
                  # instead of overwriting what was done before.
-                  subPath: module6-ex2
+                  subPath: module6-ex2-gpu
                  mountPath: /tmp/tensorflow
+          restartPolicy: OnFailure
          volumes:
            - name: azurefile
              persistentVolumeClaim:
                claimName: azurefile
-          restartPolicy: OnFailure
 ```

 </details>
--- a/7-distributed-tensorflow/README.md
+++ b/7-distributed-tensorflow/README.md
@ -279,14 +279,14 @@ A working code sample is available in [`solution-src/main.py`](./solution-src/ma
 <summary><strong>TFJob's Template</strong></summary>

 ```yaml
-apiVersion: kubeflow.org/v1alpha1
+apiVersion: kubeflow.org/v1alpha2
 kind: TFJob
 metadata:
  name: module7-ex1-gpu
 spec:
-  replicaSpecs:
-    - replicas: 1 # 1 Master
-      tfReplicaType: MASTER
+  tfReplicaSpecs:
+    MASTER:
+      replicas: 1
      template:
        spec:
          volumes:
@ -294,45 +294,46 @@ spec:
              persistentVolumeClaim:
                claimName: azurefile
          containers:
-          - image: ritazh/tf-mnist:distributedgpu  # You can replace this by your own image           
+          - image: <DOCKER_USERNAME>/tf-mnist:distributedgpu  # You can replace this by your own image
            name: tensorflow
            imagePullPolicy: Always
            resources:
              limits:
-                alpha.kubernetes.io/nvidia-gpu: 1
+                nvidia.com/gpu: 1
            volumeMounts:
              - mountPath: /tmp/tensorflow
                subPath: module7-ex1-gpu
                name: azurefile
          restartPolicy: OnFailure
-    - replicas: 1 # 1 or 2 Workers depends on how many gpus you have
-      tfReplicaType: WORKER
+    WORKER:
+      replicas: 2
      template:
        spec:
          containers:
-          - image: ritazh/tf-mnist:distributedgpu  # You can replace this by your own image                       
+          - image: <DOCKER_USERNAME>/tf-mnist:distributedgpu  # You can replace this by your own image    
            name: tensorflow
            imagePullPolicy: Always
            resources:
              limits:
-                alpha.kubernetes.io/nvidia-gpu: 1
+                nvidia.com/gpu: 1
            volumeMounts:
          restartPolicy: OnFailure
-    - replicas: 1  # 1 Parameter server
-      tfReplicaType: PS
+    PS:
+      replicas: 1
      template:
        spec:
          containers:
-          - image: ritazh/tf-mnist:distributed  # You can replace this by your own image                       
+          - image: <DOCKER_USERNAME>/tf-mnist:distributed  # You can replace this by your own image 
            name: tensorflow
            imagePullPolicy: Always
+            ports:
+            - containerPort: 6006
          restartPolicy: OnFailure
-
 ```

 There are few things to notice here:
-* Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `replicaSpec`, not on the `workers` or `ps`.
-* We are not specifying anything for the `PS` `replicaSpec` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
+* Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `tfReplicaSpecs`, not on the `WORKER`s or `PS`.
+* We are not specifying anything for the `PS` `tfReplicaSpecs` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
 * When you have limited GPU resources, you can specify Master and Worker nodes to request GPU resources and PS node will only request CPU resources.

 </details>
--- a/8-hyperparam-sweep/solution-chart/templates/deployment.yaml
+++ b/8-hyperparam-sweep/solution-chart/templates/deployment.yaml
@ -11,15 +11,16 @@
 # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
 {{- range $i, $lr := $lrlist }}
 {{- range $j, $nblayers := $nblayerslist }}
-apiVersion: kubeflow.org/v1alpha1
+apiVersion: kubeflow.org/v1alpha2
 kind: TFJob # Each one of our trainings will be a separate TFJob
 metadata:
  name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
  labels:
    chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
 spec: 
-  replicaSpecs:
-    - template:
+  tfReplicaSpecs:
+    MASTER:
+      template:
        spec:
          restartPolicy: OnFailure
          containers:
@ -40,7 +41,7 @@ spec:
 {{ if $useGPU }}  # We only want to request GPUs if we asked for it in values.yaml with useGPU
              resources:
                limits:
-                  alpha.kubernetes.io/nvidia-gpu: 1
+                  nvidia.com/gpu: 1
 {{ end }}
              volumeMounts:
              - mountPath: /tmp/tensorflow