Add lab for installing KF + refactor TfJob (#9)
* add lab for installing KF * add kubeflow section * refactor tfjob module * update gpus * remove tensorboard * fix GPU inconsistencies * update all indexes * reviewers comments
|
@ -1,5 +1,9 @@
|
|||
# Introduction
|
||||
|
||||
This labs will walk you through setting up [Kubeflow](https://github.com/kubeflow/kubeflow) on a kubernetes cluster on Azure Container Service (AKS).
|
||||
|
||||
We will then take a look at how to use the different components that make up Kubeflow.
|
||||
|
||||
## Motivations
|
||||
|
||||
Machine learning model development and operationalization currently has very few industry-wide best practices to help us reduce the time to market and optimize the different steps.
|
||||
|
|
|
@ -235,7 +235,7 @@ Most importantly we want to be able to reuse this image on the Kubernetes cluste
|
|||
So let's push our image to Docker Hub:
|
||||
|
||||
```console
|
||||
docker push ${DOCKER_USERNAME}/tf-mnist
|
||||
docker push ${DOCKER_USERNAME}/tf-mnist:gpu
|
||||
```
|
||||
|
||||
If this command doesn't look familiar to you, make sure you went through part 1 and 2 of Docker's tutorial, and more precisely: [Tutorial - Share your image](https://docs.docker.com/get-started/part2/#share-your-image)
|
||||
|
|
До Ширина: | Высота: | Размер: 219 KiB После Ширина: | Высота: | Размер: 219 KiB |
|
@ -77,8 +77,6 @@ You could also use [acs-engine](https://github.com/Azure/acs-engine) if you pref
|
|||
|
||||
As of this writing, GPUs are available for AKS in the `eastus` and `westeurope` regions. If you wants more options you may want to use acs-engine for more flexibility.
|
||||
|
||||
Only module 3 has an exercise which is specific for GPU VMs, all other modules can be followed on either CPU or GPU clusters, so if you are on a budget, feel free to create a CPU cluster instead.
|
||||
|
||||
### With the CLI
|
||||
|
||||
#### Creating a resource group
|
||||
|
@ -172,6 +170,7 @@ We want our deployment to have a few characteristics:
|
|||
* It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
|
||||
* The `Job` should be named `2-mnist-training`.
|
||||
* We want our training to run for `500` steps.
|
||||
* We want our training to use 1 GPU
|
||||
|
||||
Here is what this would look like in YAML format:
|
||||
|
||||
|
@ -187,8 +186,11 @@ spec:
|
|||
spec:
|
||||
containers: # List of containers that should run inside the pod, in our case there is only one.
|
||||
- name: tensorflow
|
||||
image: ${DOCKER_USERNAME}/tf-mnist # The image to run, you can replace by your own.
|
||||
image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
|
||||
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
|
||||
restartPolicy: OnFailure # restart the pod if it fails
|
||||
```
|
||||
|
||||
|
|
|
@ -195,4 +195,4 @@ You should see the following web page from your deployment :
|
|||
|
||||
## Next Step
|
||||
|
||||
[4 - GPUs](../4-gpus/README.md)
|
||||
[4 - Kubeflow](../4-kubeflow/README.md)
|
|
@ -1,540 +0,0 @@
|
|||
# `tensorflow/k8s` and `TFJob`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* [3 - Helm](../3-helm/README.md)
|
||||
* [4 - GPUs](../4-gpus/README.md)
|
||||
|
||||
## Summary
|
||||
|
||||
In this module you will learn how [`tensorflow/k8s`](https://github.com/tensorflow/k8s) can greatly simplify our lives when running TensorFlow on Kubernetes.
|
||||
|
||||
## `tensorflow/k8s`
|
||||
|
||||
As we saw earlier, giving a container access to GPU is not exactly a breeze on Kubernetes: We need to manually mount the drivers from the node into the container.
|
||||
If you already tried to run a distributed TensorFlow training, you know that it's not easy either. Getting the `ClusterSpec` right can be painful if you have more than a couple VMs, and it's also quite brittle (we will look more into distributed TensorFlow in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/README.md)).
|
||||
|
||||
`tensorflow/k8s` is a new project in TensorFlow's organization on GitHub that makes all of this much easier.
|
||||
|
||||
|
||||
### Installing `tensorflow/k8s`
|
||||
|
||||
Installing `tensorflow/k8s` with Helm is very easy, just run the following commands:
|
||||
|
||||
```console
|
||||
> CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
|
||||
> helm install ${CHART} -n tf-job --wait --replace --set cloud=azure
|
||||
```
|
||||
|
||||
If it worked, you should see something like:
|
||||
|
||||
```
|
||||
NAME: tf-job
|
||||
LAST DEPLOYED: Mon Nov 20 14:24:16 2017
|
||||
NAMESPACE: default
|
||||
STATUS: DEPLOYED
|
||||
|
||||
RESOURCES:
|
||||
==> v1/ConfigMap
|
||||
NAME DATA AGE
|
||||
tf-job-operator-config 1 7s
|
||||
|
||||
==> v1beta1/Deployment
|
||||
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
|
||||
tf-job-operator 1 1 1 1 7s
|
||||
|
||||
==> v1/Pod(related)
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
tf-job-operator-3005087210-c3js3 1/1 Running 1 4s
|
||||
```
|
||||
|
||||
This means that 3 resources were created, a `ConfigMap`, a `Deployment`, and a `Pod`.
|
||||
We will see in just a moment what each of them do.
|
||||
|
||||
### Kubernetes Custom Resource Definition
|
||||
|
||||
Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
|
||||
In the case of `tensorflow/k8s`, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe TensorFlow a training.
|
||||
|
||||
#### `TFJob` Specifications
|
||||
|
||||
Before going further, let's take a look at what the `TFJob` looks like:
|
||||
|
||||
> Note: Some of the fields are not described here for brevity.
|
||||
|
||||
**`TFJob` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
|
||||
| kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` |
|
||||
| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
|
||||
| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
|
||||
|
||||
`spec` is the most important part, so let's look at it too:
|
||||
|
||||
**`TFJobSpec` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
|
||||
|
||||
Let's go deeper:
|
||||
|
||||
**`TFReplicaSpec` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
|
||||
| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
|
||||
| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. |
|
||||
|
||||
|
||||
As a refresher, here is what a simple TensorFlow training (with GPU) would look like using "vanilla" kubernetes:
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: example-job
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
name: example-job
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
volumes:
|
||||
- name: bin
|
||||
hostPath:
|
||||
path: /usr/lib/nvidia-384/bin
|
||||
- name: lib
|
||||
hostPath:
|
||||
path: /usr/lib/nvidia-384
|
||||
containers:
|
||||
- name: tensorflow
|
||||
image: wbuchwalter/<SAMPLE IMAGE>
|
||||
resources:
|
||||
requests:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
volumeMounts:
|
||||
- name: bin
|
||||
mountPath: /usr/local/nvidia/bin
|
||||
- name: lib
|
||||
mountPath: /usr/local/nvidia/lib64
|
||||
```
|
||||
Here is what the same thing looks like using the new `TFJob` resource:
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: example-tfjob
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/<SAMPLE IMAGE>
|
||||
name: tensorflow
|
||||
resources:
|
||||
requests:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
No need to mount drivers anymore! Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
|
||||
|
||||
#### How does this work?
|
||||
|
||||
As we saw earlier, when we installed the Helm chart for `tensorflow/k8s`, 3 resources were created in our cluster:
|
||||
* A `ConfigMap` named `tf-job-operator-config`
|
||||
* A `Deployment`
|
||||
* And a `Pod` named `tf-job-operator`
|
||||
|
||||
The `tf-job-operator` pod (simply called the operator, or `TFJob` operator), is going to monitor your cluster, and every time you create a new resource of type `TFJob`, the operator will know what to do with it.
|
||||
Specifically, when you create a new `TFJob`, the operator will create a new Kubernetes `Job` for it, and automatically mount the drivers if needed (i.e. when you request a GPU).
|
||||
|
||||
You may wonder how the operator knows which directory needs to be mounted in the container for the NVIDIA drivers: that's where the `ConfigMap` comes into play.
|
||||
|
||||
In K8s, a [`ConfigMap`](https://kubernetes.io/docs/tasks/configure-pod-container/configmap/) is a simple object that contains key-value pairs. This `ConfigMap` can then be linked with a container to inject some configuration.
|
||||
|
||||
When we installed the Helm chart, we specified which cloud provider we are running on by doing `--set cloud=azure`.
|
||||
This creates a `ConfigMap` that contains configuration options specific for Azure, including the list of directory to mount.
|
||||
|
||||
We can take a look at what is inside our `tf-job-operator-config` by doing:
|
||||
|
||||
```console
|
||||
kubectl describe configmaps tf-job-operator-config
|
||||
```
|
||||
|
||||
The output is:
|
||||
|
||||
```
|
||||
Name: tf-job-operator-config
|
||||
Namespace: default
|
||||
Labels: <none>
|
||||
Annotations: <none>
|
||||
|
||||
Data
|
||||
====
|
||||
controller_config_file.yaml:
|
||||
----
|
||||
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
|
||||
accelerators:
|
||||
alpha.kubernetes.io/nvidia-gpu:
|
||||
volumes:
|
||||
- name: lib
|
||||
mountPath: /usr/local/nvidia/lib64
|
||||
hostPath: /usr/lib/nvidia-384
|
||||
- name: bin
|
||||
mountPath: /usr/local/nvidia/bin
|
||||
hostPath: /usr/lib/nvidia-384/bin
|
||||
- name: libcuda
|
||||
mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
|
||||
hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
|
||||
```
|
||||
|
||||
If you want to know more:
|
||||
* [tensorflow/k8s](https://github.com/tensorflow/k8s) GitHub repository
|
||||
* [Introducing Operators](https://coreos.com/blog/introducing-operators.html), a blog post by CoreOS explaining the Operator pattern
|
||||
|
||||
## Exercises
|
||||
|
||||
### Exercise 1: A Simple `TFJob`
|
||||
|
||||
Let's schedule a very simple TensorFlow job using `TFJob` first.
|
||||
|
||||
> Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead.
|
||||
|
||||
Depending on whether or not your cluster has GPU, choose the correct template:
|
||||
|
||||
<details>
|
||||
<summary><strong>CPU Only</strong></summary>
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module5-ex1
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/tf-mnist:cpu
|
||||
name: tensorflow
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>With GPU</strong></summary>
|
||||
|
||||
When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module5-ex1-gpu
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/tf-mnist:gpu
|
||||
name: tensorflow
|
||||
resources:
|
||||
requests:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
Save the template that applies to you in a file, and create the `TFJob`:
|
||||
```console
|
||||
kubectl create -f <template-path>
|
||||
```
|
||||
|
||||
Let's look at what has been created in our cluster.
|
||||
|
||||
First a `TFJob` was created:
|
||||
|
||||
```console
|
||||
kubectl get tfjob
|
||||
```
|
||||
Returns:
|
||||
```
|
||||
NAME KIND
|
||||
module5-ex1 TFJob.v1alpha1.tensorflow.org
|
||||
```
|
||||
|
||||
As well as a `Job`, which was actually created by the operator:
|
||||
|
||||
```console
|
||||
kubectl get job
|
||||
```
|
||||
Returns:
|
||||
```
|
||||
NAME DESIRED SUCCESSFUL AGE
|
||||
module5-ex1-master-xs4b-0 1 0 2m
|
||||
```
|
||||
and a `Pod`:
|
||||
|
||||
```console
|
||||
kubectl get pod
|
||||
```
|
||||
Returns:
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
module5-ex1-master-xs4b-0-6gpfn 1/1 Running 0 2m
|
||||
```
|
||||
|
||||
Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first.
|
||||
|
||||
Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
|
||||
|
||||
```console
|
||||
kubectl logs <your-pod-name>
|
||||
```
|
||||
|
||||
This container is pretty verbose, but you should see a TensorFlow training happening:
|
||||
|
||||
```
|
||||
[...]
|
||||
INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486
|
||||
INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100)
|
||||
INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0%
|
||||
INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210
|
||||
INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100)
|
||||
INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0%
|
||||
INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348
|
||||
INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100)
|
||||
INFO:tensorflow:Final test accuracy = 88.4% (N=353)
|
||||
[...]
|
||||
```
|
||||
|
||||
Once your job is completed, clean it up:
|
||||
|
||||
```console
|
||||
kubectl delete tfjob module5-ex1
|
||||
```
|
||||
|
||||
> That's great and all, but how do we grab our trained model and TensorFlow's summaries?
|
||||
|
||||
Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.
|
||||
|
||||
Thankfully, Kubernetes `Volumes` can help us here.
|
||||
If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.
|
||||
But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).
|
||||
|
||||
In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
|
||||
|
||||
## Exercise 2: Azure Files to the Rescue
|
||||
|
||||
### Creating a New File Share and Kubernetes Secret
|
||||
|
||||
In the official documentation: [Using Azure Files with Kubernetes](https://docs.microsoft.com/en-in/azure/aks/azure-files), follow the steps listed under `Create an Azure file share` and `Create Kubernetes Secret`, but be aware of a few details first:
|
||||
* It is **very** important that you create you storage account (hence your resource group) in the **same** region as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
|
||||
* While this document specifically refers to AKS, it will work for any K8s cluster
|
||||
* Name your file share `tensorflow`. While the share could be named anything, it will make it easier to follow the examples later on. `AKS_PERS_SHARE_NAME` should be updated accordingly.
|
||||
|
||||
Once you completed all the steps, run:
|
||||
```console
|
||||
kubectl get secrets
|
||||
```
|
||||
|
||||
Which should return:
|
||||
```
|
||||
NAME TYPE DATA AGE
|
||||
azure-secret Opaque 2 4m
|
||||
```
|
||||
|
||||
|
||||
### Updating our example to use our Azure File Share
|
||||
|
||||
Now we need to mount our new file share into our container so the model and the summaries can be persisted.
|
||||
Turns out mounting an Azure File share into a container is really easy, we simply need to reference our secret in the `Volume` definition:
|
||||
|
||||
```yaml
|
||||
[...]
|
||||
containers:
|
||||
- image: <IMAGE>
|
||||
name: tensorflow
|
||||
volumeMounts:
|
||||
- name: azurefile
|
||||
mountPath: <MOUNT_PATH>
|
||||
volumes:
|
||||
- name: azurefile
|
||||
azureFile:
|
||||
secretName: azure-secret
|
||||
shareName: tensorflow
|
||||
readOnly: false
|
||||
```
|
||||
|
||||
Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
|
||||
Note that by default our container saves everything into `/app/tf_files` so that's the value you will want to use for `MOUNT_PATH`.
|
||||
|
||||
Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
|
||||
|
||||
![file-share](./file-share.png)
|
||||
|
||||
This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share.
|
||||
|
||||
#### Solution for Exercise 2
|
||||
|
||||
*For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.*
|
||||
|
||||
<details>
|
||||
<summary><strong>Solution</strong></summary>
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module5-ex2
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/tf-mnist:cpu
|
||||
name: tensorflow
|
||||
volumeMounts:
|
||||
# By default our classifier saves the summaries in /tmp/tensorflow,
|
||||
# so that's where we want to mount our Azure File Share.
|
||||
- name: azurefile
|
||||
# The subPath allows us to mount a subdirectory within the azure file share instead of root
|
||||
# this is useful so that we can save the logs for each run in a different subdirectory
|
||||
# instead of overwriting what was done before.
|
||||
subPath: module5-ex2
|
||||
mountPath: /tmp/tensorflow
|
||||
volumes:
|
||||
- name: azurefile
|
||||
azureFile:
|
||||
# We reference the secret we created just earlier
|
||||
# so that the account name and key are passed securely and not directly in a template
|
||||
secretName: azure-secret
|
||||
shareName: tensorflow
|
||||
readOnly: false
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
**Don't forget to delete the `TFJob` once it is completed!**
|
||||
|
||||
> Great, but what if I want to check out the training in TensorBoard, do I need to download everything on my machine?
|
||||
|
||||
Actually no, you don't. `TFJob` provides a very handy mechanism to monitor your trainings with TensorBoard easily!
|
||||
We will try that in our third exercise.
|
||||
|
||||
### Exercise 3: Adding TensorBoard
|
||||
|
||||
So far, we have a TensorFlow training running, and it's model and summaries are persisted to an Azure File share.
|
||||
But having TensorBoard monitoring the training would be pretty useful as well.
|
||||
Turns out `TFJob` can also help us with that.
|
||||
|
||||
When we looked at the `TFJob` specification at the beginning of this module, we omitted some fields in `TFJobSpec` descriptions.
|
||||
Here is a still incomplete but more accurate representation with one additional field:
|
||||
|
||||
**`TFJobSpec` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes. |
|
||||
| **TensorBoard** | `TensorBoardSpec` | Configuration to start a TensorBoard deployment associated to our training job. Defined below. |
|
||||
|
||||
That's right, `TFJobSpec` contains an object of type `TensorBoadSpec` which allows us to describe a TensorBoard instance!
|
||||
Let's look at it:
|
||||
|
||||
**`TensorBoardSpec` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| LogDir | `string` | Location of TensorFlow summaries in the TensorBoard container. |
|
||||
| ServiceType | [`ServiceType`](https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services---service-types) | What kind of service should expose TensorBoard. Usually `ClusterIP` (Only reachable from within the cluster) or `LoadBalancer` (Exposes the service externally using a cloud provider’s load balancer. ) |
|
||||
| Volumes | [`Volume`](https://kubernetes.io/docs/api-reference/v1.8/#volume-v1-core) array | List of volumes that can be mounted. |
|
||||
| VolumeMounts | [`VolumeMount`](https://kubernetes.io/docs/api-reference/v1.8/#volumemount-v1-core) array | Pod volumes to mount into the container's filesystem. |
|
||||
|
||||
|
||||
Let's add TensorBoard to our job then.
|
||||
Here is how this will work: We will keep the same TensorFlow training job as in exercise 2. This `TFJob` will write the model and summaries in the Azure File share.
|
||||
We will also set up the configuration for TensorBoard so that it reads the summaries from the same Azure File share:
|
||||
* `Volumes` and `VolumeMounts` in `TensorBoardSpec` should be updated adequately.
|
||||
* For `ServiceType`, you should use `LoadBalancer`, this will create a public IP so it will be easier to access.
|
||||
* `LogDir` will depend on how you configure `VolumeMounts`, but on your file share, the summaries will be under the `training_summaries` sub directory.
|
||||
|
||||
#### Solution for Exercise 3
|
||||
|
||||
*For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.*
|
||||
|
||||
<details>
|
||||
<summary><strong>Solution</strong></summary>
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module5-ex3
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
volumes:
|
||||
- name: azurefile
|
||||
azureFile:
|
||||
secretName: azure-secret
|
||||
shareName: tensorflow
|
||||
readOnly: false
|
||||
containers:
|
||||
- image: wbuchwalter/tf-mnist:cpu
|
||||
name: tensorflow
|
||||
volumeMounts:
|
||||
- mountPath: /tmp/tensorflow
|
||||
subPath: module5-ex3 # Again we isolate the logs in a new directory on Azure Files
|
||||
name: azurefile
|
||||
restartPolicy: OnFailure
|
||||
tensorboard:
|
||||
logDir: /tmp/tensorflow/logs
|
||||
serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
|
||||
volumes:
|
||||
- name: azurefile
|
||||
azureFile:
|
||||
secretName: azure-secret
|
||||
shareName: tensorflow
|
||||
volumeMounts:
|
||||
- mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
|
||||
subPath: module5-ex3 # This should match the directory our Master is actually writing in
|
||||
name: azurefile
|
||||
```
|
||||
</details>
|
||||
|
||||
|
||||
#### Validation
|
||||
|
||||
If you updated the `TFJob` template correctly, when doing:
|
||||
```console
|
||||
kubectl get services
|
||||
```
|
||||
You should see something like:
|
||||
```
|
||||
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||
kubernetes 10.0.0.1 <none> 443/TCP 14d
|
||||
module5-ex3-master-7yqt-0 10.0.126.11 <none> 2222/TCP 5m
|
||||
module5-ex3-tensorboard-7yqt 10.0.199.170 104.42.193.76 80:31770/TCP 5m
|
||||
```
|
||||
Note that provisioning a public IP on Azure can take a few minutes. During this time the `EXTERNAL-IP` for TensorBoard's service will show as `<pending>`.
|
||||
|
||||
Once the public IP is provisioned, browse it, and you should land on a working TensorBoard instance with live monitoring of the training job running.
|
||||
|
||||
![TensorBoard](./tensorboard.png)
|
||||
|
||||
## Next Step
|
||||
|
||||
[6 - Distributed TensorFlow](../6-distributed-tensorflow)
|
|
@ -0,0 +1,95 @@
|
|||
# Kubeflow - Overview and Installation
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* [1 - Docker](../1-docker/README.md)
|
||||
* [2 - Kubernetes](../2-kubernetes/README.md)
|
||||
|
||||
## Summary
|
||||
|
||||
In this module we are going to get an overview of the different components that make up [Kubeflow](https://github.com/kubeflow/kubeflow), and how to install them into our newly deployed Kubernetes cluster.
|
||||
|
||||
### Kubeflow Overview
|
||||
|
||||
From [Kubeflow](https://github.com/kubeflow/kubeflow)'s own documetation:
|
||||
|
||||
> The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
|
||||
|
||||
Kubeflow is composed of multiple components:
|
||||
* [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
|
||||
* One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
|
||||
* A serving component that will help you serve predictions with your models.
|
||||
|
||||
For more general info on Kubeflow, head to the repo's [README](https://github.com/kubeflow/kubeflow/blob/master/README.md).
|
||||
|
||||
### Deploying Kubeflow
|
||||
|
||||
Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.
|
||||
|
||||
> ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments.
|
||||
|
||||
First, install ksonnet version [0.9.2](https://ksonnet.io/#get-started).
|
||||
|
||||
Then run the following commands to deploy Kubeflow in your Kubernetes cluster:
|
||||
|
||||
```bash
|
||||
# Create a namespace for kubeflow deployment
|
||||
NAMESPACE=kubeflow
|
||||
kubectl create namespace ${NAMESPACE}
|
||||
|
||||
# Which version of Kubeflow to use
|
||||
# For a list of releases refer to:
|
||||
# https://github.com/kubeflow/kubeflow/releases
|
||||
VERSION=v0.1.2
|
||||
|
||||
# Initialize a ksonnet app. Set the namespace for it's default environment.
|
||||
APP_NAME=my-kubeflow
|
||||
ks init ${APP_NAME}
|
||||
cd ${APP_NAME}
|
||||
ks env set default --namespace ${NAMESPACE}
|
||||
|
||||
# Add a reference to Kubeflow's ksonnet manifests
|
||||
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
|
||||
|
||||
# Install Kubeflow components
|
||||
ks pkg install kubeflow/core@${VERSION}
|
||||
ks pkg install kubeflow/tf-serving@${VERSION}
|
||||
ks pkg install kubeflow/tf-job@${VERSION}
|
||||
|
||||
# Create templates for core components
|
||||
ks generate kubeflow-core kubeflow-core
|
||||
|
||||
# Customize Kubeflow's installation for AKS
|
||||
ks param set kubeflow-core cloud aks
|
||||
|
||||
# Enable collection of anonymous usage metrics
|
||||
# Skip this step if you don't want to enable collection.
|
||||
ks param set kubeflow-core reportUsage true
|
||||
ks param set kubeflow-core usageId $(uuidgen)
|
||||
|
||||
# Deploy Kubeflow
|
||||
ks apply default -c kubeflow-core
|
||||
```
|
||||
|
||||
### Validation
|
||||
|
||||
`kubectl get pods -n kubeflow`
|
||||
|
||||
should return something like this:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
ambassador-7789cddc5d-czf7p 2/2 Running 0 1d
|
||||
ambassador-7789cddc5d-f79zp 2/2 Running 0 1d
|
||||
ambassador-7789cddc5d-h57ms 2/2 Running 0 1d
|
||||
centraldashboard-d5bf74c6b-nn925 1/1 Running 0 1d
|
||||
tf-hub-0 1/1 Running 0 1d
|
||||
tf-job-dashboard-8699ccb5ff-9phmv 1/1 Running 0 1d
|
||||
tf-job-operator-646bdbcb7-bc479 1/1 Running 0 1d
|
||||
```
|
||||
|
||||
The most important components for the puporse of this lab are `tf-hub-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-646bdbcb7-bc479` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
|
||||
|
||||
## Next Step
|
||||
|
||||
[5 - JupyterHub](../5-jupyterhub)
|
|
@ -3,7 +3,7 @@
|
|||
## Prerequisites
|
||||
* [1 - Docker Basics](../1-docker)
|
||||
* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
|
||||
* [4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob)
|
||||
* [4 - Kubeflow](../4-kubeflow)
|
||||
|
||||
## Summary
|
||||
|
||||
|
@ -95,3 +95,8 @@ kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME}
|
|||
```
|
||||
|
||||
After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree
|
||||
|
||||
|
||||
## Next Step
|
||||
|
||||
[6 - TfJob](../6-tfjob)
|
||||
|
|
|
@ -0,0 +1,272 @@
|
|||
# `TFJob`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* [1 - Docker](../1-docker/README.md)
|
||||
* [2 - Kubernetes](../2-kubernetes/README.md)
|
||||
* [4 - Kubeflow](../4-kubeflow/README.md)
|
||||
|
||||
## Summary
|
||||
|
||||
In this module you will learn how to describe a TensorFlow training using `TfJob` object.
|
||||
|
||||
|
||||
### Kubernetes Custom Resource Definition
|
||||
|
||||
Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
|
||||
In the case of Kubeflow, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe a TensorFlow training.
|
||||
|
||||
#### `TFJob` Specifications
|
||||
|
||||
Before going further, let's take a look at what the `TFJob` object looks like:
|
||||
|
||||
> Note: Some of the fields are not described here for brevity.
|
||||
|
||||
**`TFJob` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
|
||||
| kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` |
|
||||
| metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
|
||||
| spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
|
||||
|
||||
`spec` is the most important part, so let's look at it too:
|
||||
|
||||
**`TFJobSpec` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
|
||||
|
||||
Let's go deeper:
|
||||
|
||||
**`TFReplicaSpec` Object**
|
||||
|
||||
| Field | Type| Description |
|
||||
|-------|-----|-------------|
|
||||
| TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
|
||||
| Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
|
||||
| Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. |
|
||||
|
||||
Here is what a simple TensorFlow training looks like using this `TFJob` object:
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: example-tfjob
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/<SAMPLE IMAGE>
|
||||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
|
||||
|
||||
## Exercises
|
||||
|
||||
### Exercise 1: A Simple `TFJob`
|
||||
|
||||
Let's schedule a very simple TensorFlow job using `TFJob` first.
|
||||
|
||||
> Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead.
|
||||
|
||||
When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module5-ex1-gpu
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/tf-mnist:gpu
|
||||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
Save the template that applies to you in a file, and create the `TFJob`:
|
||||
```console
|
||||
kubectl create -f <template-path>
|
||||
```
|
||||
|
||||
Let's look at what has been created in our cluster.
|
||||
|
||||
First a `TFJob` was created:
|
||||
|
||||
```console
|
||||
kubectl get tfjob
|
||||
```
|
||||
Returns:
|
||||
```
|
||||
NAME AGE
|
||||
module5-ex1-gpu 5s
|
||||
```
|
||||
|
||||
As well as a `Job`, which was actually created by the operator:
|
||||
|
||||
```console
|
||||
kubectl get job
|
||||
```
|
||||
Returns:
|
||||
```bash
|
||||
NAME DESIRED SUCCESSFUL AGE
|
||||
module5-ex1-master-xs4b-0 1 0 2m
|
||||
```
|
||||
and a `Pod`:
|
||||
|
||||
```console
|
||||
kubectl get pod
|
||||
```
|
||||
Returns:
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
module5-ex1-master-xs4b-0-6gpfn 1/1 Running 0 2m
|
||||
```
|
||||
|
||||
Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first.
|
||||
|
||||
Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
|
||||
|
||||
```console
|
||||
kubectl logs <your-pod-name>
|
||||
```
|
||||
|
||||
This container is pretty verbose, but you should see a TensorFlow training happening:
|
||||
|
||||
```
|
||||
[...]
|
||||
INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486
|
||||
INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100)
|
||||
INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0%
|
||||
INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210
|
||||
INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100)
|
||||
INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0%
|
||||
INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348
|
||||
INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100)
|
||||
INFO:tensorflow:Final test accuracy = 88.4% (N=353)
|
||||
[...]
|
||||
```
|
||||
|
||||
> That's great and all, but how do we grab our trained model and TensorFlow's summaries?
|
||||
|
||||
Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.
|
||||
|
||||
Thankfully, Kubernetes `Volumes` can help us here.
|
||||
If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.
|
||||
But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).
|
||||
|
||||
In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
|
||||
|
||||
## Exercise 2: Azure Files to the Rescue
|
||||
|
||||
### Creating a New File Share and Kubernetes Secret
|
||||
|
||||
In the official documentation: [Using Azure Files with Kubernetes](https://docs.microsoft.com/en-us/azure/aks/azure-files-volume), follow the steps listed under `Create an Azure file share` and `Create Kubernetes Secret`, but be aware of a few details first:
|
||||
* It is **very** important that you create you storage account (hence your resource group) in the **same** region as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
|
||||
* While this document specifically refers to AKS, it will work for any K8s cluster
|
||||
* Name your file share `tensorflow`. While the share could be named anything, it will make it easier to follow the examples later on. `AKS_PERS_SHARE_NAME` should be updated accordingly.
|
||||
|
||||
Once you completed all the steps, run:
|
||||
```console
|
||||
kubectl get secrets
|
||||
```
|
||||
|
||||
Which should return:
|
||||
```
|
||||
NAME TYPE DATA AGE
|
||||
azure-secret Opaque 2 4m
|
||||
```
|
||||
|
||||
### Updating our example to use our Azure File Share
|
||||
|
||||
Now we need to mount our new file share into our container so the model and the summaries can be persisted.
|
||||
Turns out mounting an Azure File share into a container is really easy, we simply need to reference our secret in the `Volume` definition:
|
||||
|
||||
```yaml
|
||||
[...]
|
||||
containers:
|
||||
- image: <IMAGE>
|
||||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
volumeMounts:
|
||||
- name: azurefile
|
||||
mountPath: <MOUNT_PATH>
|
||||
volumes:
|
||||
- name: azurefile
|
||||
azureFile:
|
||||
secretName: azure-secret
|
||||
shareName: tensorflow
|
||||
readOnly: false
|
||||
```
|
||||
|
||||
Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
|
||||
Note that by default our container saves everything into `/app/tf_files` so that's the value you will want to use for `MOUNT_PATH`.
|
||||
|
||||
Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
|
||||
|
||||
![file-share](./file-share.png)
|
||||
|
||||
This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share.
|
||||
|
||||
#### Solution for Exercise 2
|
||||
|
||||
<details>
|
||||
<summary><strong>Solution</strong></summary>
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module5-ex2
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
spec:
|
||||
containers:
|
||||
- image: wbuchwalter/tf-mnist:cpu
|
||||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
volumeMounts:
|
||||
# By default our classifier saves the summaries in /tmp/tensorflow,
|
||||
# so that's where we want to mount our Azure File Share.
|
||||
- name: azurefile
|
||||
# The subPath allows us to mount a subdirectory within the azure file share instead of root
|
||||
# this is useful so that we can save the logs for each run in a different subdirectory
|
||||
# instead of overwriting what was done before.
|
||||
subPath: module5-ex2
|
||||
mountPath: /tmp/tensorflow
|
||||
volumes:
|
||||
- name: azurefile
|
||||
azureFile:
|
||||
# We reference the secret we created just earlier
|
||||
# so that the account name and key are passed securely and not directly in a template
|
||||
secretName: azure-secret
|
||||
shareName: tensorflow
|
||||
readOnly: false
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Next Step
|
||||
|
||||
[7 - Distributed TensorFlow](../7-distributed-tensorflow)
|
До Ширина: | Высота: | Размер: 26 KiB После Ширина: | Высота: | Размер: 26 KiB |
До Ширина: | Высота: | Размер: 112 KiB После Ширина: | Высота: | Размер: 112 KiB |
|
@ -3,7 +3,7 @@
|
|||
## Prerequisites
|
||||
|
||||
* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
|
||||
* [4 - Kubeflow and TFJob Basics](../4-kubeflow-tfjob)
|
||||
* [6 - TfJob](../6-tfjob)
|
||||
|
||||
## Summary
|
||||
|
||||
|
@ -371,4 +371,4 @@ There are two things to notice here:
|
|||
|
||||
## Next Step
|
||||
|
||||
[7 - Hyperparameters Sweep with Helm](../7-hyperparam-sweep)
|
||||
[8 - Hyperparameters Sweep with Helm](../8-hyperparam-sweep)
|
До Ширина: | Высота: | Размер: 106 KiB После Ширина: | Высота: | Размер: 106 KiB |
|
@ -3,7 +3,7 @@
|
|||
## Prerequisites
|
||||
|
||||
* [3 - Helm](../3-helm)
|
||||
* [4 - Kubeflow and TFJob Basics](../4-kubeflow-tfjob)
|
||||
* [6 - TfJob Basics](../6-tfjob)
|
||||
|
||||
### "Vanilla" Hyperparameter Sweep
|
||||
|
||||
|
@ -170,4 +170,4 @@ tensorboard-85dfc74f8d-4gf24 1/1 Running 0 4h
|
|||
|
||||
## Next Step
|
||||
|
||||
[8 - Going Further](../8-going-further)
|
||||
[9 - Serving](../9-serving)
|
До Ширина: | Высота: | Размер: 87 KiB После Ширина: | Высота: | Размер: 87 KiB |
До Ширина: | Высота: | Размер: 485 KiB После Ширина: | Высота: | Размер: 485 KiB |
11
README.md
|
@ -27,12 +27,13 @@ git clone https://github.com/Azure/kubeflow-labs
|
|||
|1| **[Docker](1-docker)** | Docker and containers 101.|
|
||||
|2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.|
|
||||
|3| **[Helm](3-helm)** | Introduction to Helm |
|
||||
|4| **[Kubeflow + TFJob](4-kubeflow-tfjob)** | Introduction to Kubeflow. How to use `tensorflow/k8s` and `TFJob` to deploy a simple TensorFlow training.|
|
||||
|4| **[Kubeflow](4-kubeflow)** | Introduction to Kubeflow and how to deploy it in your cluster.|
|
||||
|5| **[JupyterHub](5-jupyterhub)** | Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow |
|
||||
|6| **[Distributed Tensorflow](6-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`|
|
||||
|7| **[Hyperparameters Sweep with Helm](7-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results |
|
||||
|8| **[Serving](8-serving)** | Using TensorFlow Serving to serve predictions |
|
||||
|9| **[Going Further](9-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. |
|
||||
|6| **[TFJob](6-tfjob)** | Introduction to `TFJob` and how to use it to deploy a simple TensorFlow training.|
|
||||
|7| **[Distributed Tensorflow](7-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`|
|
||||
|8| **[Hyperparameters Sweep with Helm](8-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results |
|
||||
|9| **[Serving](9-serving)** | Using TensorFlow Serving to serve predictions |
|
||||
|10| **[Going Further](10-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. |
|
||||
|
||||
|
||||
# Contributing
|
||||
|
|