Merge branch 'master' into merge-helm

This commit is contained in:
William Buchwalter 2018-05-01 09:16:21 +02:00
Родитель 20af61da4c abe3a907b4
Коммит 1619579896
18 изменённых файлов: 607 добавлений и 769 удалений

16
.gitignore поставляемый
Просмотреть файл

@ -1 +1,15 @@
.DS_Store
# Binaries for programs and plugins
*.exe
*.dll
*.so
*.dylib
# Test binary, build with `go test -c`
*.test
# Output of the go coverage tool, specifically when used with LiteIDE
*.out
# Project-local glide cache, RE: https://github.com/Masterminds/glide/issues/736
.glide/
.DS_Store

Просмотреть файл

@ -11,7 +11,7 @@ Here are a subset of pain points that exists in a typical ML workflow.
#### A Typical (Simplified) ML Workflow and its Pain Points
![Typical Workflow](workflow.png)
This workshop is going to focus on improving the training process by leveraging containers and Kubernetes.
This workshop is going to focus on improving the training and serving process by leveraging containers and Kubernetes.
Today many data scientists are training their models either on their physical workstation (be it a laptop or a desktop with multiple GPUs) or using a VM (sometime, but rarely, a couple of them) in the cloud.

Просмотреть файл

@ -69,25 +69,15 @@ We will be creating a deployment in the exercise toward the end of this module,
## Provisioning a Kubernetes cluster on Azure
There are multiple ways to provision a Kubernetes (K8s) on Azure:
* ACS
* AKS
* acs-engine
We are going to use AKS to create a GPU-enabled Kubernetes cluster.
You could also use [acs-engine](https://github.com/Azure/acs-engine) if you prefer, this guide will assume you are using aks.
AKS is currently still in preview and acs-engine is a bit more complex to setup, so we advice you to create your cluster using ACS.
We are going to create a Linux-based K8s cluster.
You can either create the cluster using the portal, or using Azure-CLI (`az`).
### A Note on GPUs with Kubernetes
As of this writing, GPUs are still in preview with ACS.
You can deploy an ACS cluster with GPU VMs (such as `Standard_NC6`) in `westus2` or `uksouth` but you should be aware of some pitfalls:
* Deploying a GPU cluster takes longer than a CPU cluster (about 10-15 minutes more) because the NVIDIA drivers need to be installed as well.
* Since this is a preview, you might hit capacity issues if the location you chose does not have enough GPUs available to accommodate you.
As of this writing, GPUs are available for AKS in the `eastus` and `westeurope` regions. If you wants more options you may want to use acs-engine for more flexibility.
**Unless you are already pretty familiar with docker and Kubernetes, we recommend that you create a cluster with CPU VMs to save some time.**
Only module 3 has an exercise which is specific for GPU VMs, all other modules can be followed on either CPU or GPU clusters.
Only module 3 has an exercise which is specific for GPU VMs, all other modules can be followed on either CPU or GPU clusters, so if you are on a budget, feel free to create a CPU cluster instead.
### With the CLI
@ -105,22 +95,23 @@ With:
#### Creating the cluster
```console
az acs create --agent-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME>
--orchestrator-type Kubernetes --agent-count <AGENT_COUNT>
--location <LOCATION> --generate-ssh-keys
az aks create --agent-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME>
--agent-count <AGENT_COUNT> --kubernetes-version 1.9.6 --location <LOCATION> --generate-ssh-keys
```
> Note : The kubernetes verion could change depending where you are deploying your cluster. You can get more informations running the `az aks get-versions` command.
With:
| Parameter | Description |
| --- | --- |
| AGENT_SIZE | The size of K8s's agent VM. `Standard_D2_v2` is enough for this workshop. |
| AGENT_SIZE | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. |
| RG | Name of the resource group that was created in the previous step. |
| NAME | Name of the ACS resource (can be whatever you want). |
| AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 2 or 3 is recommended to play with hyper-parameter tuning and distributed TensorFlow |
| LOCATION | Same location that was specified for the resource group creation. |
The command should take a few minutes to complete (longer if you chose GPU VMs). Once it is done, the output should be a JSON object indicating among other things the `provisioningState`:
The command should take a few minutes to complete. Once it is done, the output should be a JSON object indicating among other things the `provisioningState`:
```
{
[...]
@ -135,7 +126,7 @@ The `kubeconfig` file is a configuration file that will allow Kubernetes's CLI (
To download the `kubeconfig` file from the cluster we just created, run:
```console
az acs kubernetes get-credentials --name <NAME> --resource-group <RG>
az aks get-credentials --name <NAME> --resource-group <RG>
```
Where `NAME` and `RG` should be the same values as for the cluster creation.
@ -150,11 +141,10 @@ kubectl get nodes
Should yield an output similar to this one:
```
NAME STATUS AGE VERSION
k8s-agent-ef2b999d-0 Ready 9d v1.7.7
k8s-agent-ef2b999d-1 Ready 9d v1.7.7
k8s-agent-ef2b999d-2 Ready 9d v1.7.7
k8s-master-ef2b999d-0 Ready 9d v1.7.7
NAME STATUS ROLES AGE VERSION
aks-nodepool1-42640332-0 Ready agent 1h v1.9.6
aks-nodepool1-42640332-1 Ready agent 1h v1.9.6
aks-nodepool1-42640332-2 Ready agent 1h v1.9.6
```
If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node:
@ -219,7 +209,7 @@ kubectl get job
Should show your new job:
```bash
NAME DESIRED SUCCESSFUL AGE
NAME DESIRED SUCCESSFUL AGE
module2-ex1 1 0 1m
```
@ -267,7 +257,7 @@ kubectl get job
```
```bash
NAME DESIRED SUCCESSFUL AGE
NAME DESIRED SUCCESSFUL AGE
module2-ex1 1 1 3m
```

Просмотреть файл

@ -1,299 +0,0 @@
# GPUs And Kubernetes
## Prerequisites
* [1 - Docker Basics](../1-docker)
* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
## Summary
In this module you will learn how to:
* Create a Pod that is using GPU.
* Requesting a GPU
* Mounting the NVIDIA drivers into the container
## Important Note
If you created a cluster with CPU VMs only you won't be able to complete the exercises in this module, but it still contains valuable information that you should read through nonetheless.
## How GPU works with Kubernetes
GPU support in K8s is still in it's early stage, and as such requires a bit of effort on your part to use.
While you don't need to do anything to access a CPU from inside your container (except specifying CPU request and limit optionally), getting access to the agent's GPU is a little bit more tricky:
* First, the drivers needs to be installed on the agent, otherwise this agent will not report the presence of GPU, and you won't be able to use it (this is already done for you in ACS/AKS/acs-engine).
* Then you need to explicitly ask for 1 or multiple GPU(s) to be mounted into your container, otherwise you will simply not be able to access the GPU, even if is running on a GPU agent.
* Finally, and most importantly, you need to mount the drivers from the agent VM into your container.
In Module 5, we will see how this process can be greatly simplified when using TensorFlow with `TFJob`, but for now, let's do it ourselves.
### Creating a container that can benefit from GPU
As a prerequisite for everything else, it is important to make sure that the container we are going to use actually knows what to do with a GPU.
For example TensorFlow needs to be installed with GPU support. CUDA and cuDNN also needs to be present.
Thankfully, most deep learning framework provide base images that are ready to use with GPU support, so we can use them as base image.
For example, TensorFlow has a lot of different images ready to use [https://hub.docker.com/r/tensorflow/tensorflow/tags/](https://hub.docker.com/r/tensorflow/tensorflow/tags/) such as:
* `tensorflow/tensorflow:1.4.0-gpu-py3` for GPU
* `tensorflow/tensorflow:1.4.0-py3` for CPU only
CNTK also has pre-built images with or without GPU [https://hub.docker.com/r/microsoft/cntk/tags/](https://hub.docker.com/r/microsoft/cntk/tags/):
* `microsoft/cntk:2.2-gpu-python3.5-cuda8.0-cudnn6.0` for GPU
* `microsoft/cntk:2.2-python3.5` for CPU only
Also what's important to note, is that most deep learning frameworks images are built on top of the official [nvidia/cuda][https://hub.docker.com/r/nvidia/cuda/] image, which already comes with CUDA and cuDNN preinstalled, so you don't need to worry about installing them.
### Requesting GPU(s)
K8s has a concept of resource `requests` and `limits` allowing you to specify how much CPU, RAM and GPU should be reserved for a specific container.
By default, if no `limits` is specified for CPU or RAM on a container, K8s will schedule it on any node and run the container with unbounded CPU and memory limits.
> *To know more on K8s `requests` and `limits`, see [Managing Compute Resources for Containers](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/).*
However, things are different for GPUs. If no `limit` is defined for GPU, K8s will run the pod on any node (with or without GPU), and will not expose the GPU even if the node has one. So you need to explicitly set the `limit` to the exact number of GPUs that should be assigned to your container.
Also, not that while you can request for a fraction of a CPU, you cannot request a fraction of a GPU. One GPU can thus only be assigned to one container at a time.
The name for the GPU resource in K8s is `alpha.kubernetes.io/nvidia-gpu` for versions `1.8` and below and `nvidia.com/gpu` for versions > `1.9`. Note that currently only NVIDIA GPUs are supported.
To set the `limit` for GPU, you should provide a value to `spec.containers[].resources.limits.alpha.kubernetes.io/nvidia-gpu`, in YAML this would looks like:
```yaml
[...]
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
[...]
```
### Exposing the node's drivers into the container
Now for the tricky part.
As stated earlier the NVIDIA drivers needs to be exposed (mounted) from the node into the container. This is a bit tricky since the location of the drivers can vary depending on the operating system of the node, as well as depending on how the drivers were installed.
For ACS/AKS/acs-engine only Ubuntu nodes are supported so far, so it should be a consistent experience as long as your cluster was created with one of them.
##### Drivers locations on the node
| Path | Purpose
|----|----|
|`/usr/lib/nvidia-384` | NVIDIA libraries |
|`/usr/lib/nvidia-384/bin`| NVIDIA binaries |
|`/usr/lib/x86_64-linux-gnu/libcuda.so.1` | CUDA Driver API library |
> Note that the NVIDIA driver's version is `384` at the time of this writing, but the driver's location will change as the version change.
For each of the above paths we need to create a corresponding `Volume` and a `VolumeMount` to expose them into our container.
> To understand how to configure `Volumes` and `VolumeMounts` take a look at [Volumes](https://kubernetes.io/docs/user-guide/walkthrough/#volumes) on the Kubernetes documentation.
## Exercises
### 1. NVIDIA-SMI
In this first exercise we are simply going to schedule a `Job` that will run `nvidia-smi`, printing details about our GPU from inside the container and exit.
You don't need to build a custom image, instead, simply use the official `nvidia/cuda` docker image.
Your K8s YAML template should have the following characteristics:
* It should be a `Job`
* It should be name `module4-ex1`
* It should request 1 GPU
* It should mount the drivers from the node into the container
* It should run the `nvidia-smi` executable
#### Useful Links
* [Microsoft Azure Container Service Engine - Using GPUs with Kubernetes](https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/gpu.md)
#### Validation
Once you have created your Job with `kubectl create -f <template-path>:
```console
kubectl get pods -a
```
The `-a` arguments tells K8s to also report pods that are already completed. Since the container exits as soon as you nvidia-smi finishes executing, it might already be completed by the tome you execute the command.
```bash
NAME READY STATUS RESTARTS AGE
module4-ex1-p40vx 0/1 Completed 0 20s
```
Let's look at the logs of our pod
```console
kubectl logs <pod-name>
```
```bash
Wed Nov 29 23:43:03 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98 Driver Version: 384.98 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000E322:00:00.0 Off | 0 |
| N/A 39C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
We can see that `nvidia-smi` has successfully detected a Tesla K80 with drivers version `384.98`.
#### Solution
<details>
<summary><strong>Solution (expand to see)</strong></summary>
<p>
```yaml
apiVersion: batch/v1
kind: Job # We want a Job
metadata:
name: 4-nvidia-smi
spec:
template:
metadata:
name: module4-ex1
spec:
restartPolicy: Never
volumes: # Where the NVIDIA driver libraries and binaries are located on the host (note that libcuda is not needed to run nvidia-smi)
- name: bin
hostPath:
path: /usr/lib/nvidia-384/bin
- name: lib
hostPath:
path: /usr/lib/nvidia-384
containers:
- name: nvidia-smi
image: nvidia/cuda # Which image to run
command:
- nvidia-smi
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1 # Requesting 1 GPU
volumeMounts: # Where the NVIDIA driver libraries and binaries should be mounted inside our container
- name: bin
mountPath: /usr/local/nvidia/bin
- name: lib
mountPath: /usr/local/nvidia/lib64
```
</p>
</details>
### 2. Running TensorFlow with GPU
In module 1 and 2, we first created a Docker image for our MNIST classifier and then ran a training on Kubernetes.
However, this training only used CPU. Let's make things much faster by accelerating our training with GPU.
You'll find the code and the `Dockerfile` under [`./src`](./src).
For this exercise, your tasks are to:
* Modify our `Dockerfile` to use a base image compatible with GPU, such as `tensorflow/tensorflow:1.4.0-gpu`
* Build and push this new image under a new tag, such as `${DOCKER_USERNAME}/tf-mnist:gpu`
* Modify the [template we built in module 2](2-kubernetes/training.yaml) to add a GPU `limit` and mount the drivers libraries.
* Deploy this new template.
### Validation
Once you deployed your template, take a look at the logs of your pod:
```console
kubectl logs <pod-name>
```
And you should see that your GPU is correctly detected and used by TensorFlow ( `[...] Found device 0 with properties: name: Tesla K80 [...]`)
```bash
2017-11-30 00:59:54.053227: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 01:00:03.274198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: b2de:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 01:00:03.274238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: b2de:00:00.0, compute capability: 3.7)
2017-11-30 01:00:08.000884: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1245
Accuracy at step 10: 0.6664
Accuracy at step 20: 0.8227
Accuracy at step 30: 0.8657
Accuracy at step 40: 0.8815
Accuracy at step 50: 0.892
Accuracy at step 60: 0.9068
[...]
```
### Solution
<details>
<summary><strong>Solution (expand to see)</strong></summary>
<p>
First we need to modify the `Dockerfile`.
We just need to change the tag of the TensorFlow base image to be one that support GPU:
```dockerfile
FROM tensorflow/tensorflow:1.4.0-gpu
COPY main.py /app/main.py
ENTRYPOINT ["python", "/app/main.py"]
```
Then we can create our Job template:
```yaml
apiVersion: batch/v1
kind: Job # Our training should be a Job since it is supposed to terminate at some point
metadata:
name: module4-ex2 # Name of our job
spec:
template: # Template of the Pod that is going to be run by the Job
metadata:
name: mnist-pod # Name of the pod
spec:
containers: # List of containers that should run inside the pod, in our case there is only one.
- name: tensorflow
image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts: # Where the drivers should be mounted in the container
- name: lib
mountPath: /usr/local/nvidia/lib64
- name: libcuda
mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
restartPolicy: OnFailure
volumes: # Where the drivers are located on the node
- name: lib
hostPath:
path: /usr/lib/nvidia-384
- name: libcuda
hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
```
And deploy it with
```console
kubectl create -f <template-path>
```
</p>
</details>
## Next Step
[5 - TFJob](../5-tfjob/README.md)

Просмотреть файл

@ -1,5 +0,0 @@
# Change this image to one that supports GPU
FROM tensorflow/tensorflow:1.4.0
COPY main.py /app/main.py
ENTRYPOINT ["python", "/app/main.py"]

Просмотреть файл

@ -1,212 +0,0 @@
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an 'AS IS' BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A simple MNIST classifier which displays summaries in TensorBoard.
This is an unimpressive MNIST model, but it is a good example of using
tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
naming summary tags so that they are grouped meaningfully in TensorBoard.
It demonstrates the functionality of every TensorBoard dashboard.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import os
import sys
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
FLAGS = None
def train():
# Import data
mnist = input_data.read_data_sets(FLAGS.data_dir,
one_hot=True,
fake_data=FLAGS.fake_data)
# Create a multilayer model.
# Input placeholders
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [None, 784], name='x-input')
y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
with tf.name_scope('input_reshape'):
image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
tf.summary.image('input', image_shaped_input, 10)
# We can't initialize these variables to 0 - the network will get stuck.
def weight_variable(shape):
"""Create a weight variable with appropriate initialization."""
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
"""Create a bias variable with appropriate initialization."""
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def variable_summaries(var):
"""Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
with tf.name_scope('summaries'):
mean = tf.reduce_mean(var)
tf.summary.scalar('mean', mean)
with tf.name_scope('stddev'):
stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
tf.summary.scalar('stddev', stddev)
tf.summary.scalar('max', tf.reduce_max(var))
tf.summary.scalar('min', tf.reduce_min(var))
tf.summary.histogram('histogram', var)
def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
"""Reusable code for making a simple neural net layer.
It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
It also sets up name scoping so that the resultant graph is easy to read,
and adds a number of summary ops.
"""
# Adding a name scope ensures logical grouping of the layers in the graph.
with tf.name_scope(layer_name):
# This Variable will hold the state of the weights for the layer
with tf.name_scope('weights'):
weights = weight_variable([input_dim, output_dim])
variable_summaries(weights)
with tf.name_scope('biases'):
biases = bias_variable([output_dim])
variable_summaries(biases)
with tf.name_scope('Wx_plus_b'):
preactivate = tf.matmul(input_tensor, weights) + biases
tf.summary.histogram('pre_activations', preactivate)
activations = act(preactivate, name='activation')
tf.summary.histogram('activations', activations)
return activations
hidden1 = nn_layer(x, 784, 500, 'layer1')
with tf.name_scope('dropout'):
keep_prob = tf.placeholder(tf.float32)
tf.summary.scalar('dropout_keep_probability', keep_prob)
dropped = tf.nn.dropout(hidden1, keep_prob)
# Do not apply softmax activation yet, see below.
y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
with tf.name_scope('cross_entropy'):
# The raw formulation of cross-entropy,
#
# tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
# reduction_indices=[1]))
#
# can be numerically unstable.
#
# So here we use tf.nn.softmax_cross_entropy_with_logits on the
# raw outputs of the nn_layer above, and then average across
# the batch.
diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
with tf.name_scope('total'):
cross_entropy = tf.reduce_mean(diff)
tf.summary.scalar('cross_entropy', cross_entropy)
with tf.name_scope('train'):
train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
cross_entropy)
with tf.name_scope('accuracy'):
with tf.name_scope('correct_prediction'):
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
with tf.name_scope('accuracy'):
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy', accuracy)
# Merge all the summaries and write them out to
# /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
merged = tf.summary.merge_all()
def feed_dict(train):
"""Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
if train or FLAGS.fake_data:
xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
k = FLAGS.dropout
else:
xs, ys = mnist.test.images, mnist.test.labels
k = 1.0
return {x: xs, y_: ys, keep_prob: k}
sess = tf.InteractiveSession()
train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)
test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test')
tf.global_variables_initializer().run()
# Train the model, and also write summaries.
# Every 10th step, measure test-set accuracy, and write test summaries
# All other steps, run train_step on training data, & add training summaries
for i in range(FLAGS.max_steps):
if i % 10 == 0: # Record summaries and test-set accuracy
summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
test_writer.add_summary(summary, i)
print('Accuracy at step %s: %s' % (i, acc))
else: # Record train set summaries, and train
if i % 100 == 99: # Record execution stats
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
else: # Record a summary
summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
train_writer.add_summary(summary, i)
train_writer.close()
test_writer.close()
def main(_):
train()
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
default=False,
help='If true, uses fake data for unit testing.')
parser.add_argument('--max_steps', type=int, default=1000,
help='Number of steps to run trainer.')
parser.add_argument('--learning_rate', type=float, default=0.001,
help='Initial learning rate')
parser.add_argument('--dropout', type=float, default=0.9,
help='Keep probability for training dropout.')
parser.add_argument(
'--data_dir',
type=str,
default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
'tensorflow/input_data'),
help='Directory for storing input data')
parser.add_argument(
'--log_dir',
type=str,
default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
'tensorflow/logs'),
help='Summaries log directory')
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

Просмотреть файл

Просмотреть файл

До

Ширина:  |  Высота:  |  Размер: 26 KiB

После

Ширина:  |  Высота:  |  Размер: 26 KiB

Просмотреть файл

До

Ширина:  |  Высота:  |  Размер: 112 KiB

После

Ширина:  |  Высота:  |  Размер: 112 KiB

97
5-jupyterhub/README.md Normal file
Просмотреть файл

@ -0,0 +1,97 @@
# Jupyter Notebooks on Kubernetes
## Prerequisites
* [1 - Docker Basics](../1-docker)
* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
* [4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob)
## Summary
In this module, you will learn how to:
* Run Jupyter Notebooks locally using Docker
* Run JupyterHub on Kubernetes using Kubeflow
## How Jupyter Notebooks work
The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes.
## How JupyterHub works
The [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/) is a multi-user Hub, spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group. Let's look at how we can create JupyterHub to spawn multiple instances of Jupyter Notebook on Kubernetes using Kubeflow.
## Exercises
### Exercise 1: Run Jupyter Notebooks locally using Docker
In this first exercise, we will run Jupyter Notebooks locally using Docker. We will use the official tensorflow docker image as it comes with Jupyter notebook.
```console
docker run -it -p 8888:8888 tensorflow/tensorflow
```
#### Validation
To verify, browse to the url in the output log.
For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413`
### Exercise 2: Run JupyterHub on Kubernetes using Kubeflow
In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow.
As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob).
In module 4, you installed the kubeflow-core component, which already includes JupyterHub and a corresponding load balancer service of type `ClusterIP`. To check its status, run the following kubectl command.
```
kubectl get svc -n=${NAMESPACE}
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
tf-hub-0 ClusterIP None <none> 8000/TCP 1m
tf-hub-lb ClusterIP 10.0.40.191 <none> 80/TCP 1m
```
To connect to your JupyterHub locally:
```
PODNAME=`kubectl get pods --namespace=${NAMESPACE} --selector="app=tf-hub" --output=template --template="{{with index .items 0}}{{.metadata.name}}{{end}}"`
kubectl port-forward --namespace=${NAMESPACE} $PODNAME 8000:8000
```
[Optional] To connect to your JupyterHub over a public IP:
To update the default service created for JupyterHub, run the following command to change the service to type LoadBalancer:
```
ks param set kubeflow-core jupyterHubServiceType LoadBalancer
ks apply ${YOUR_KF_ENV}
```
Create a new Jupyter Notebook instance:
- open http://127.0.0.1:8000 in your browser
- log in using any username and password
- click the "Start My Server" button to sprawn a new Jupyter notebook
- from the image dropdown, select a tensorflow image for your notebook
- for CPU and memory, enter values based on your resource requirements, for example: 1 CPU and 2Gi
- to get available GPUs in your cluster, run the following command:
```
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
```
- for GPU, enter values in json format `{"alpha.kubernetes.io/nvidia-gpu":"1"}`
- click the "Spawn" button
![jupyterhub](./jupyterhub.png)
The images are quite large. This process can take a long time.
#### Validation
You can check the status of the pod by running:
```
kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME}
```
After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree

Двоичные данные
5-jupyterhub/jupyterhub.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 90 KiB

30
8-serving/README.md Normal file
Просмотреть файл

@ -0,0 +1,30 @@
# TensorFlow Serving
## Introduction
TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.
## Getting started
## Installation
```commandline
ks init my-model-server
cd my-model-server
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/tf-serving
ks env add cloud
ks env set cloud --namespace ${NAMESPACE}
MODEL_COMPONENT=serveInception
MODEL_NAME=inception
#Replace this with the url to your bucket if using your own model
MODEL_PATH=gs://kubeflow-models/inception
MODEL_SERVER_IMAGE=gcr.io/$(gcloud config get-value project)/model-server:1.0
ks generate tf-serving ${MODEL_COMPONENT} --name=${MODEL_NAME}
ks param set --env=cloud ${MODEL_COMPONENT} modelPath $MODEL_PATH
# If you want to use your custom image.
ks param set --env=cloud ${MODEL_COMPONENT} modelServerImage $MODEL_SERVER_IMAGE
# If you want to have the http endpoint.
ks param set --env=cloud ${MODEL_COMPONENT} deployHttpProxy true
```

Просмотреть файл

До

Ширина:  |  Высота:  |  Размер: 219 KiB

После

Ширина:  |  Высота:  |  Размер: 219 KiB

Просмотреть файл

Просмотреть файл

@ -1,197 +0,0 @@
# Jupyter Notebooks on Kubernetes
## Prerequisites
* [1 - Docker Basics](../1-docker)
* [2 - Kubernetes Basics and cluster created](../2-kubernetes)
## Summary
In this module, you will learn how to:
* Run Jupyter Notebooks locally using Docker
* Run Jupyter Notebooks on Kubernetes
## How Jupyter Notebooks work
The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes.
## Exercises
### Exercise 1: Run Jupyter Notebooks locally using Docker
In this first exercise, we will run Jupyter Notebooks locally using Docker. We will use the official tensorflow docker image as it comes with Jupyter notebook.
```console
docker run -it -p 8888:8888 tensorflow/tensorflow
```
#### Validation
To verify, browse to the url in the output log.
For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413`
### Exercise 2: Run Jupyter Notebooks on Kubernetes
In this exercise, we will run Jupyter Notebooks on a Kubernetes cluster.
As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster.
Similar to running Jupyter Notebooks locally using Docker, we can again use the official tensorflow docker image as it comes with Jupyter notebook. But here we can run many instances of Jupyter Notebooks in the cluster to handle additional load.
To run Jupyter Notebook using Kubernetes, you need to:
* Create a Pod using tensorflow image
* Expose port 8888 to run Jupyter notebook
* [With GPU] Mount nvidia libraries from the host VM to a custom directory in the container
* Create a Service to run Jupyter Notebook
#### Solution for Exercise 2
Create a yaml file like to the one below.
<details>
<summary><strong>Solution for CPU only (expand to see)</strong></summary>
<p>
```yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: jupyter-server
name: jupyter-server
spec:
ports:
- port: 8888
targetPort: 8888
selector:
app: jupyter-server
type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: jupyter-server
spec:
replicas: 1
template:
metadata:
labels:
app: jupyter-server
spec:
containers:
- args:
image: tensorflow/tensorflow
name: jupyter-server
ports:
- containerPort: 8888
```
</p>
</details>
<details>
<summary><strong>Solution with GPU (expand to see)</strong></summary>
<p>
```yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: jupyter-server
name: jupyter-server
spec:
ports:
- port: 8888
targetPort: 8888
selector:
app: jupyter-server
type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: jupyter-server
spec:
replicas: 1
template:
metadata:
labels:
app: jupyter-server
spec:
containers:
- name: jupyter-server
image: tensorflow/tensorflow:latest-gpu
ports:
- containerPort: 8888
imagePullPolicy: IfNotPresent
env:
- name: LD_LIBRARY_PATH
value: /usr/lib/nvidia:/usr/lib/x86_64-linux-gnu
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /usr/local/nvidia/bin
name: bin
- mountPath: /usr/lib/nvidia
name: lib
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
name: libcuda
volumes:
- name: bin
hostPath:
path: /usr/lib/nvidia-384/bin
- name: lib
hostPath:
path: /usr/lib/nvidia-384
- name: libcuda
hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
```
</p>
</details>
Save the yaml file, then deploy it to your Kubernetes cluster by running:
```console
kubectl create -f <template-path>
```
#### Validation
After the deployment is created, a pod running tensorflow will be created, along with a new service for the Jupyter notebook. The new service will acquire a new external ip to run Jupyter Notebook on port 8888. This may take few minutes to complete.
To verify, run the following to view the output log to get the URL and the token for the hosted Jupyter notebook:
```console
kubectl log jupyter-server-xxxxx
# sample output
http://localhost:8888/?token=2e7c875bd4e72137911d33e209c91d01f7a7b44868cf664d
```
Next to get the public ip for the new service created for Jupyter Notebook, run:
```console
kubectl get svc jupyter-server -o jsonpath={.status.loadBalancer.ingress[0].ip}
xx.xx.xx.xx
```
From a browser, navigate to the Jupyter notebook with the following URL, replace `PUBLICIP` with the output from previous step:
```
http://<PUBLICIP>:8888/?token=2e7c875bd4e72137911d33e209c91d01f7a7b44868cf664d
```

404
LICENSE
Просмотреть файл

@ -1,21 +1,391 @@
MIT License
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.
Copyright (c) 2017 Microsoft
Using Creative Commons Public Licenses
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More_considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution 4.0 International Public License ("Public License"). To the
extent this Public License may be interpreted as a contract, You are
granted the Licensed Rights in consideration of Your acceptance of
these terms and conditions, and the Licensor grants You such rights in
consideration of benefits the Licensor receives from making the
Licensed Material available under these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
d. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
e. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
f. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
g. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
h. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
i. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
j. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
k. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part; and
b. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
4. If You Share Adapted Material You produce, the Adapter's
License You apply must not prevent recipients of the Adapted
Material from complying with this Public License.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material; and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.
Creative Commons may be contacted at creativecommons.org.

17
LICENSE-CODE Normal file
Просмотреть файл

@ -0,0 +1,17 @@
The MIT License (MIT)
Copyright (c) Microsoft Corporation
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
associated documentation files (the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial
portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Просмотреть файл

@ -1,4 +1,4 @@
# Train TensorFlow Models at Scale with Kubernetes on Azure
# Labs for Training and Serving TensorFlow Models with Kubernetes and Kubeflow on Azure Container Service (AKS)
<!-- ## [Learning Objectives](./learningObjectives.md)
## [Presentation Content](./presentationContent.md)
@ -6,16 +6,17 @@
## Prerequisites
1. Have a valid Microsoft Azure subscription allowing the creation of an ACS cluster
1. Have a valid Microsoft Azure subscription allowing the creation of an AKS cluster
1. Docker client installed: [Installing Docker](https://www.docker.com/community-edition)
1. Azure-cli (2.0) installed: [Installing the Azure CLI 2.0 | Microsoft Docs](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest)
1. Git cli installed: [Installing Git CLI](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
1. Kubectl installed: [Installing Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
1. Helm installed: [Installing Helm CLI](https://docs.helm.sh/using_helm/#from-the-binary-releases) (**Note**: On Windows you can extract the `tar` file using a tool like 7Zip.
1. Helm installed: [Installing Helm CLI](https://docs.helm.sh/using_helm/#from-the-binary-releases) (**Note**: On Windows you can extract the `tar` file using a tool like 7Zip.)
1. ksonnet installed: [Installing ksonnet CLI](https://ksonnet.io/#get-started)
Clone this repository somewhere so you can easily access the different source files:
```console
git clone https://github.com/wbuchwalter/tensorflow-k8s-azure
git clone https://github.com/Azure/kubeflow-labs
```
## Content Summary
@ -26,9 +27,41 @@ git clone https://github.com/wbuchwalter/tensorflow-k8s-azure
|1| **[Docker](1-docker)** | Docker and containers 101.|
|2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.|
|3| **[Helm](3-helm)** | Introduction to Helm |
|4| **[GPUs](4-gpus)** | How to use GPUs with Kubernetes.|
|5| **[TFJob](5-tfjob)** | How to use `tensorflow/k8s` and `TFJob` to deploy a simple TensorFlow training.|
|6| **[Distributed Tensorflow](6-distributed-tensorflow)** | Going distributed with `TFJob`|
|7| **[Hyperparameters Sweep with Helm](7-hyperparam-sweep)** | Using Helm to deploy a large number of training testing different hypothesis, monitoring and comparing them. |
|8| **[Going Further](8-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage. |
|9| **[Jupyter Notebooks](9-jupyter)** | Easily deploy a Jupyter Notebook instance on Kubernetes. |
|4| **[Kubeflow + TFJob](4-kubeflow-tfjob)** | Introduction to Kubeflow. How to use `tensorflow/k8s` and `TFJob` to deploy a simple TensorFlow training.|
|5| **[JupyterHub](5-jupyterhub)** | Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow |
|6| **[Distributed Tensorflow](6-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`|
|7| **[Hyperparameters Sweep with Helm](7-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results |
|8| **[Serving](8-serving)** | Using TensorFlow Serving to serve predictions |
|9| **[Going Further](9-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. |
# Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
# Legal Notices
Microsoft and any contributors grant you a license to the Microsoft documentation and other content
in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode),
see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the
[LICENSE-CODE](LICENSE-CODE) file.
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation
may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries.
The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks.
Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
Privacy information can be found at https://privacy.microsoft.com/en-us/
Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents,
or trademarks, whether by implication, estoppel or otherwise.