Additional edits for kf 0.2.2 (#49)

This commit is contained in:
Brian Redmond 2018-09-20 14:33:46 -04:00 коммит произвёл Rita Zhang
Родитель 7202ea194f
Коммит 9c2d3f6c0f
8 изменённых файлов: 44 добавлений и 38 удалений

Просмотреть файл

@ -117,8 +117,8 @@ As you can see, we are not building a new image from scratch, instead we are usi
You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/.
What is important to note is that different tags need to be used depending on if you want to use GPU or not.
For example, if you wanted to run your model with TensorFlow 1.4.0 and CPU only, you would use `tensorflow/tensorflow:1.4.0`.
If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.4.0-gpu`.
For example, if you wanted to run your model with TensorFlow 1.10.0 and CPU only, you would use `tensorflow/tensorflow:1.10.0`.
If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.10.0-gpu`.
The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script.
@ -141,7 +141,7 @@ The output from this command should look like this:
```
Sending build context to Docker daemon 11.26kB
Step 1/3 : FROM tensorflow/tensorflow:1.4.0
Step 1/3 : FROM tensorflow/tensorflow:1.10.0
---> a61a91cc0d1b
Step 2/3 : COPY main.py /app/main.py
---> b264d6e9a5ef

Просмотреть файл

@ -151,10 +151,12 @@ If you provisioned GPU VM, describing one of the node should indicate the presen
[...]
Capacity:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
[...]
```
> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
## Exercise
### Running our Model on Kubernetes
@ -190,7 +192,7 @@ spec:
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia

Просмотреть файл

@ -54,7 +54,6 @@ ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
# Install Kubeflow components
ks pkg install kubeflow/core@${VERSION}
ks pkg install kubeflow/tf-serving@${VERSION}
# ks pkg install kubeflow/tf-job@${VERSION} # TODO: delete this one?
# Create templates for core components
ks generate kubeflow-core kubeflow-core

Просмотреть файл

@ -72,7 +72,7 @@ ks apply ${YOUR_KF_ENV}
```
Create a new Jupyter Notebook instance:
- open http://127.0.0.1:8000 in your browser
- open http://127.0.0.1:8000 in your browser (or use the public IP for the service `tf-hub-lb`)
- log in using any username and password
- click the "Start My Server" button to sprawn a new Jupyter notebook
- from the image dropdown, select a tensorflow image for your notebook
@ -81,7 +81,7 @@ Create a new Jupyter Notebook instance:
```
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
```
- for GPU, enter values in json format `{"alpha.kubernetes.io/nvidia-gpu":"1"}`
- for GPU, enter values in json format `{"nvidia.com/gpu":"1"}`
- click the "Spawn" button
![jupyterhub](./jupyterhub.png)

Двоичные данные
5-jupyterhub/jupyterhub.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 90 KiB

После

Ширина:  |  Высота:  |  Размер: 30 KiB

Просмотреть файл

@ -65,7 +65,7 @@ spec:
name: tensorflow
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
restartPolicy: OnFailure
```
@ -95,7 +95,7 @@ spec:
name: tensorflow
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
restartPolicy: OnFailure
```
@ -197,18 +197,18 @@ Turns out mounting an Azure File share into a container is really easy, we simpl
name: tensorflow
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
volumeMounts:
- name: azurefile
mountPath: <MOUNT_PATH>
subPath: module6-ex2-gpu
mountPath: /tmp/tensorflow
volumes:
- name: azurefile
persistentVolumeClaim:
claimName: azurefile
```
Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
Note that by default our container saves everything into `/tmp/tensorflow` so that's the value you will want to use for `MOUNT_PATH`.
Update your template from exercise 1 to mount the Azure File share into your container, and create your new job.
Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
@ -228,13 +228,16 @@ metadata:
name: module6-ex2
spec:
tfReplicaSpecs:
Master:
MASTER:
replicas: 1
template:
spec:
containers:
- image: <DOCKER_USERNAME>/tf-mnist:1.0
- image: <DOCKER_USERNAME>/tf-mnist:gpu
name: tensorflow
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
# By default our classifier saves the summaries in /tmp/tensorflow,
# so that's where we want to mount our Azure File Share.
@ -242,13 +245,13 @@ spec:
# The subPath allows us to mount a subdirectory within the azure file share instead of root
# this is useful so that we can save the logs for each run in a different subdirectory
# instead of overwriting what was done before.
subPath: module6-ex2
subPath: module6-ex2-gpu
mountPath: /tmp/tensorflow
restartPolicy: OnFailure
volumes:
- name: azurefile
persistentVolumeClaim:
claimName: azurefile
restartPolicy: OnFailure
```
</details>

Просмотреть файл

@ -279,14 +279,14 @@ A working code sample is available in [`solution-src/main.py`](./solution-src/ma
<summary><strong>TFJob's Template</strong></summary>
```yaml
apiVersion: kubeflow.org/v1alpha1
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
name: module7-ex1-gpu
spec:
replicaSpecs:
- replicas: 1 # 1 Master
tfReplicaType: MASTER
tfReplicaSpecs:
MASTER:
replicas: 1
template:
spec:
volumes:
@ -294,45 +294,46 @@ spec:
persistentVolumeClaim:
claimName: azurefile
containers:
- image: ritazh/tf-mnist:distributedgpu # You can replace this by your own image
- image: <DOCKER_USERNAME>/tf-mnist:distributedgpu # You can replace this by your own image
name: tensorflow
imagePullPolicy: Always
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /tmp/tensorflow
subPath: module7-ex1-gpu
name: azurefile
restartPolicy: OnFailure
- replicas: 1 # 1 or 2 Workers depends on how many gpus you have
tfReplicaType: WORKER
WORKER:
replicas: 2
template:
spec:
containers:
- image: ritazh/tf-mnist:distributedgpu # You can replace this by your own image
- image: <DOCKER_USERNAME>/tf-mnist:distributedgpu # You can replace this by your own image
name: tensorflow
imagePullPolicy: Always
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
volumeMounts:
restartPolicy: OnFailure
- replicas: 1 # 1 Parameter server
tfReplicaType: PS
PS:
replicas: 1
template:
spec:
containers:
- image: ritazh/tf-mnist:distributed # You can replace this by your own image
- image: <DOCKER_USERNAME>/tf-mnist:distributed # You can replace this by your own image
name: tensorflow
imagePullPolicy: Always
ports:
- containerPort: 6006
restartPolicy: OnFailure
```
There are few things to notice here:
* Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `replicaSpec`, not on the `workers` or `ps`.
* We are not specifying anything for the `PS` `replicaSpec` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
* Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `tfReplicaSpecs`, not on the `WORKER`s or `PS`.
* We are not specifying anything for the `PS` `tfReplicaSpecs` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
* When you have limited GPU resources, you can specify Master and Worker nodes to request GPU resources and PS node will only request CPU resources.
</details>

Просмотреть файл

@ -11,15 +11,16 @@
# This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
{{- range $i, $lr := $lrlist }}
{{- range $j, $nblayers := $nblayerslist }}
apiVersion: kubeflow.org/v1alpha1
apiVersion: kubeflow.org/v1alpha2
kind: TFJob # Each one of our trainings will be a separate TFJob
metadata:
name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
labels:
chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
spec:
replicaSpecs:
- template:
tfReplicaSpecs:
MASTER:
template:
spec:
restartPolicy: OnFailure
containers:
@ -40,7 +41,7 @@ spec:
{{ if $useGPU }} # We only want to request GPUs if we asked for it in values.yaml with useGPU
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
{{ end }}
volumeMounts:
- mountPath: /tmp/tensorflow