Additional edits for kf 0.2.2 (#49)
This commit is contained in:
Родитель
7202ea194f
Коммит
9c2d3f6c0f
|
@ -117,8 +117,8 @@ As you can see, we are not building a new image from scratch, instead we are usi
|
|||
You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/.
|
||||
|
||||
What is important to note is that different tags need to be used depending on if you want to use GPU or not.
|
||||
For example, if you wanted to run your model with TensorFlow 1.4.0 and CPU only, you would use `tensorflow/tensorflow:1.4.0`.
|
||||
If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.4.0-gpu`.
|
||||
For example, if you wanted to run your model with TensorFlow 1.10.0 and CPU only, you would use `tensorflow/tensorflow:1.10.0`.
|
||||
If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.10.0-gpu`.
|
||||
|
||||
The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script.
|
||||
|
||||
|
@ -141,7 +141,7 @@ The output from this command should look like this:
|
|||
|
||||
```
|
||||
Sending build context to Docker daemon 11.26kB
|
||||
Step 1/3 : FROM tensorflow/tensorflow:1.4.0
|
||||
Step 1/3 : FROM tensorflow/tensorflow:1.10.0
|
||||
---> a61a91cc0d1b
|
||||
Step 2/3 : COPY main.py /app/main.py
|
||||
---> b264d6e9a5ef
|
||||
|
|
|
@ -151,10 +151,12 @@ If you provisioned GPU VM, describing one of the node should indicate the presen
|
|||
|
||||
[...]
|
||||
Capacity:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
[...]
|
||||
```
|
||||
|
||||
> Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
|
||||
|
||||
## Exercise
|
||||
|
||||
### Running our Model on Kubernetes
|
||||
|
@ -190,7 +192,7 @@ spec:
|
|||
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
|
||||
nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
|
||||
volumeMounts:
|
||||
- name: nvidia
|
||||
mountPath: /usr/local/nvidia
|
||||
|
|
|
@ -54,7 +54,6 @@ ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
|
|||
# Install Kubeflow components
|
||||
ks pkg install kubeflow/core@${VERSION}
|
||||
ks pkg install kubeflow/tf-serving@${VERSION}
|
||||
# ks pkg install kubeflow/tf-job@${VERSION} # TODO: delete this one?
|
||||
|
||||
# Create templates for core components
|
||||
ks generate kubeflow-core kubeflow-core
|
||||
|
|
|
@ -72,7 +72,7 @@ ks apply ${YOUR_KF_ENV}
|
|||
```
|
||||
|
||||
Create a new Jupyter Notebook instance:
|
||||
- open http://127.0.0.1:8000 in your browser
|
||||
- open http://127.0.0.1:8000 in your browser (or use the public IP for the service `tf-hub-lb`)
|
||||
- log in using any username and password
|
||||
- click the "Start My Server" button to sprawn a new Jupyter notebook
|
||||
- from the image dropdown, select a tensorflow image for your notebook
|
||||
|
@ -81,7 +81,7 @@ Create a new Jupyter Notebook instance:
|
|||
```
|
||||
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
|
||||
```
|
||||
- for GPU, enter values in json format `{"alpha.kubernetes.io/nvidia-gpu":"1"}`
|
||||
- for GPU, enter values in json format `{"nvidia.com/gpu":"1"}`
|
||||
- click the "Spawn" button
|
||||
|
||||
![jupyterhub](./jupyterhub.png)
|
||||
|
|
Двоичные данные
5-jupyterhub/jupyterhub.png
Двоичные данные
5-jupyterhub/jupyterhub.png
Двоичный файл не отображается.
До Ширина: | Высота: | Размер: 90 KiB После Ширина: | Высота: | Размер: 30 KiB |
|
@ -65,7 +65,7 @@ spec:
|
|||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
|
@ -95,7 +95,7 @@ spec:
|
|||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
|
@ -197,18 +197,18 @@ Turns out mounting an Azure File share into a container is really easy, we simpl
|
|||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
- name: azurefile
|
||||
mountPath: <MOUNT_PATH>
|
||||
subPath: module6-ex2-gpu
|
||||
mountPath: /tmp/tensorflow
|
||||
volumes:
|
||||
- name: azurefile
|
||||
persistentVolumeClaim:
|
||||
claimName: azurefile
|
||||
```
|
||||
|
||||
Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
|
||||
Note that by default our container saves everything into `/tmp/tensorflow` so that's the value you will want to use for `MOUNT_PATH`.
|
||||
Update your template from exercise 1 to mount the Azure File share into your container, and create your new job.
|
||||
|
||||
Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
|
||||
|
||||
|
@ -228,13 +228,16 @@ metadata:
|
|||
name: module6-ex2
|
||||
spec:
|
||||
tfReplicaSpecs:
|
||||
Master:
|
||||
MASTER:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- image: <DOCKER_USERNAME>/tf-mnist:1.0
|
||||
- image: <DOCKER_USERNAME>/tf-mnist:gpu
|
||||
name: tensorflow
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
# By default our classifier saves the summaries in /tmp/tensorflow,
|
||||
# so that's where we want to mount our Azure File Share.
|
||||
|
@ -242,13 +245,13 @@ spec:
|
|||
# The subPath allows us to mount a subdirectory within the azure file share instead of root
|
||||
# this is useful so that we can save the logs for each run in a different subdirectory
|
||||
# instead of overwriting what was done before.
|
||||
subPath: module6-ex2
|
||||
subPath: module6-ex2-gpu
|
||||
mountPath: /tmp/tensorflow
|
||||
restartPolicy: OnFailure
|
||||
volumes:
|
||||
- name: azurefile
|
||||
persistentVolumeClaim:
|
||||
claimName: azurefile
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
</details>
|
||||
|
|
|
@ -279,14 +279,14 @@ A working code sample is available in [`solution-src/main.py`](./solution-src/ma
|
|||
<summary><strong>TFJob's Template</strong></summary>
|
||||
|
||||
```yaml
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
apiVersion: kubeflow.org/v1alpha2
|
||||
kind: TFJob
|
||||
metadata:
|
||||
name: module7-ex1-gpu
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- replicas: 1 # 1 Master
|
||||
tfReplicaType: MASTER
|
||||
tfReplicaSpecs:
|
||||
MASTER:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
volumes:
|
||||
|
@ -294,45 +294,46 @@ spec:
|
|||
persistentVolumeClaim:
|
||||
claimName: azurefile
|
||||
containers:
|
||||
- image: ritazh/tf-mnist:distributedgpu # You can replace this by your own image
|
||||
- image: <DOCKER_USERNAME>/tf-mnist:distributedgpu # You can replace this by your own image
|
||||
name: tensorflow
|
||||
imagePullPolicy: Always
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
- mountPath: /tmp/tensorflow
|
||||
subPath: module7-ex1-gpu
|
||||
name: azurefile
|
||||
restartPolicy: OnFailure
|
||||
- replicas: 1 # 1 or 2 Workers depends on how many gpus you have
|
||||
tfReplicaType: WORKER
|
||||
WORKER:
|
||||
replicas: 2
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- image: ritazh/tf-mnist:distributedgpu # You can replace this by your own image
|
||||
- image: <DOCKER_USERNAME>/tf-mnist:distributedgpu # You can replace this by your own image
|
||||
name: tensorflow
|
||||
imagePullPolicy: Always
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
restartPolicy: OnFailure
|
||||
- replicas: 1 # 1 Parameter server
|
||||
tfReplicaType: PS
|
||||
PS:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- image: ritazh/tf-mnist:distributed # You can replace this by your own image
|
||||
- image: <DOCKER_USERNAME>/tf-mnist:distributed # You can replace this by your own image
|
||||
name: tensorflow
|
||||
imagePullPolicy: Always
|
||||
ports:
|
||||
- containerPort: 6006
|
||||
restartPolicy: OnFailure
|
||||
|
||||
```
|
||||
|
||||
There are few things to notice here:
|
||||
* Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `replicaSpec`, not on the `workers` or `ps`.
|
||||
* We are not specifying anything for the `PS` `replicaSpec` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
|
||||
* Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `tfReplicaSpecs`, not on the `WORKER`s or `PS`.
|
||||
* We are not specifying anything for the `PS` `tfReplicaSpecs` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
|
||||
* When you have limited GPU resources, you can specify Master and Worker nodes to request GPU resources and PS node will only request CPU resources.
|
||||
|
||||
</details>
|
||||
|
|
|
@ -11,15 +11,16 @@
|
|||
# This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
|
||||
{{- range $i, $lr := $lrlist }}
|
||||
{{- range $j, $nblayers := $nblayerslist }}
|
||||
apiVersion: kubeflow.org/v1alpha1
|
||||
apiVersion: kubeflow.org/v1alpha2
|
||||
kind: TFJob # Each one of our trainings will be a separate TFJob
|
||||
metadata:
|
||||
name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
|
||||
labels:
|
||||
chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
|
||||
spec:
|
||||
replicaSpecs:
|
||||
- template:
|
||||
tfReplicaSpecs:
|
||||
MASTER:
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
containers:
|
||||
|
@ -40,7 +41,7 @@ spec:
|
|||
{{ if $useGPU }} # We only want to request GPUs if we asked for it in values.yaml with useGPU
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
nvidia.com/gpu: 1
|
||||
{{ end }}
|
||||
volumeMounts:
|
||||
- mountPath: /tmp/tensorflow
|
||||
|
|
Загрузка…
Ссылка в новой задаче