274fba0da5 | ||
---|---|---|
.. | ||
Dockerfiles | ||
XGBoost | ||
auto-test | ||
caffe | ||
caffe2 | ||
chainer | ||
cluster-configuration | ||
cntk | ||
horovod | ||
images | ||
job-editor | ||
jupyter | ||
kafka | ||
keras | ||
mpi | ||
mxnet | ||
ocr-serving | ||
pytorch | ||
scikit-learn | ||
serving | ||
spark | ||
tensorflow | ||
README.md |
README.md
OpenPAI Job Examples
Table of Contents
- Quick start: how to write and submit a CIFAR-10 job
- List of off-the-shelf examples
- List of customized job template
- What if the example is failed
- Contributing
Quick start: how to write and submit a CIFAR-10 job
(1) Prepare a job json file
In this section, we will use CIFAR-10 training job as an example to explain how to write and submit a job in OpenPAI.
CIFAR-10 is an established computer-vision dataset used for image classification.
- Full example for tensorflow cifar10 image classification training on OpenPAI:
{
// Name for the job, need to be unique
"jobName": "tensorflow-cifar10",
// URL pointing to the Docker image for all tasks in the job
"image": "openpai/pai.example.tensorflow",
// Data directory existing on HDFS
"dataDir": "/tmp/data",
// Output directory on HDFS,
"outputDir": "/tmp/output",
// List of taskRole, one task role at least
"taskRoles": [
{
// Name for the task role
"name": "cifar_train",
// Number of tasks for the task role, no less than 1
"taskNumber": 1,
// CPU number for one task in the task role, no less than 1
"cpuNumber": 8,
// Memory for one task in the task role, no less than 100
"memoryMB": 32768,
// GPU number for one task in the task role, no less than 0
"gpuNumber": 1,
// Executable command for tasks in the task role, can not be empty
"command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
}
]
}
-
Save content to a file. Name this file as cifar10.json
(2) Submit job json file from OpenPAI webportal
Users can refer to this tutorial submit a job in web portal for job submission from OpenPAI webportal.
List of off-the-shelf examples
Examples which can be run by submitting the json straightly without any modification.
- tensorflow.cifar10.json: Single GPU training on CIFAR-10 using TensorFlow.
- pytorch.mnist.json: Single GPU training on MNIST using PyTorch.
- pytorch.regression.json: Regression using PyTorch.
- mxnet.autoencoder.json: Autoencoder using MXNet.
- mxnet.image-classification.json: Image
- serving.tensorflow.json: TensorFlow model serving. classification on MNIST using MXNet.
List of customized job template
These user could customize and run these jobs over OpenPAI.
-
CNTK:
What if the example is failed
The example in the folder could be failed due to the following reasons:
- The format of json is incorrect. You may get error when you copy the json file to the webportal. It may due to version updating of webportal. You should refer to the latest version of it.
- The docker image is removed. You will find this error in your job tracking page. You should create an issue to report it, or you can build the image according to the dockerfile in the example's folder, then push it to another docker registry and modify the json file's image field. Just refer to the README or DOCKER in the folder of that example.
- If the example you submit contains a prepare.sh script shell, it may fail due to the source of the data or code changed or been unstable. You may get error in your job tracking page. Check and try to fix it.
- The version of the code, tools or library. You may get this error if you rebuild the docker image. Some example doesn't fix the version of its dependency, so, you should check the version.
Contributing
If you want to contribute a job example that can be run on PAI, please open a new pull request.
-
Prepare a folder under pai/examples folder, for example create pai/examples/caffe2/
-
Prepare example files:
Under Caffe2 example dir, user should prepare these files for an example's contribution PR:
- README.md: Example's introductions
- Dockerfile: Example's dependencies
- Pai job json file: Example's OpenPAI job json template
- [Optional] Code file: Example's code file