pai/examples
Hao Yuan 274fba0da5
add setup_hdfs.sh (#1850)
2018-12-07 16:12:08 +08:00
..
Dockerfiles docs: tutorial of autobuild docker (#1589) 2018-10-30 16:07:53 +08:00
XGBoost refactor job-tutorial folder to examples (#1360) 2018-09-14 09:05:34 +08:00
auto-test [auto-test][example][tested] Add username support for auto-test to keep away path conflict (#1679) 2018-11-09 14:08:11 +08:00
caffe refactor job-tutorial folder to examples (#1360) 2018-09-14 09:05:34 +08:00
caffe2 refactor job-tutorial folder to examples (#1360) 2018-09-14 09:05:34 +08:00
chainer refactor job-tutorial folder to examples (#1360) 2018-09-14 09:05:34 +08:00
cluster-configuration Enable Cluster object model (#1824) 2018-12-05 17:31:49 +08:00
cntk [auto-test][example][tested] Add username support for auto-test to keep away path conflict (#1679) 2018-11-09 14:08:11 +08:00
horovod [auto-test][example][tested] Add username support for auto-test to keep away path conflict (#1679) 2018-11-09 14:08:11 +08:00
images remove duplicate content 2018-08-17 15:38:01 +08:00
job-editor add setup_hdfs.sh (#1850) 2018-12-07 16:12:08 +08:00
jupyter refactor job-tutorial folder to examples (#1360) 2018-09-14 09:05:34 +08:00
kafka refactor examples 2018-08-17 15:38:01 +08:00
keras Fixed typos (#1607) 2018-11-05 13:54:36 +08:00
mpi [auto-test][example][tested] Add username support for auto-test to keep away path conflict (#1679) 2018-11-09 14:08:11 +08:00
mxnet Fixed typos (#1607) 2018-11-05 13:54:36 +08:00
ocr-serving OCR opensource example (#1549) 2018-10-25 19:07:28 +08:00
pytorch Fixed typos (#1607) 2018-11-05 13:54:36 +08:00
scikit-learn Fixed typos (#1607) 2018-11-05 13:54:36 +08:00
serving refactor job-tutorial folder to examples (#1360) 2018-09-14 09:05:34 +08:00
spark add instructions for customizing pyspark job (#1449) 2018-09-30 14:40:09 +08:00
tensorflow [auto-test][example][tested] Add username support for auto-test to keep away path conflict (#1679) 2018-11-09 14:08:11 +08:00
README.md [example]revise document (#1656) 2018-11-08 14:35:20 +08:00

README.md

OpenPAI Job Examples

Table of Contents

Quick start: how to write and submit a CIFAR-10 job

(1) Prepare a job json file

In this section, we will use CIFAR-10 training job as an example to explain how to write and submit a job in OpenPAI.

CIFAR-10 is an established computer-vision dataset used for image classification.

  • Full example for tensorflow cifar10 image classification training on OpenPAI:
{
  // Name for the job, need to be unique
  "jobName": "tensorflow-cifar10",
  // URL pointing to the Docker image for all tasks in the job
  "image": "openpai/pai.example.tensorflow",
  // Data directory existing on HDFS
  "dataDir": "/tmp/data",
  // Output directory on HDFS, 
  "outputDir": "/tmp/output",
  // List of taskRole, one task role at least
  "taskRoles": [
    {
      // Name for the task role
      "name": "cifar_train",
      // Number of tasks for the task role, no less than 1
      "taskNumber": 1,
      // CPU number for one task in the task role, no less than 1
      "cpuNumber": 8,
      // Memory for one task in the task role, no less than 100
      "memoryMB": 32768,
      // GPU number for one task in the task role, no less than 0
      "gpuNumber": 1,
      // Executable command for tasks in the task role, can not be empty
      "command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
    }
  ]
}

(2) Submit job json file from OpenPAI webportal

Users can refer to this tutorial submit a job in web portal for job submission from OpenPAI webportal.

List of off-the-shelf examples

Examples which can be run by submitting the json straightly without any modification.

List of customized job template

These user could customize and run these jobs over OpenPAI.

What if the example is failed

The example in the folder could be failed due to the following reasons:

  1. The format of json is incorrect. You may get error when you copy the json file to the webportal. It may due to version updating of webportal. You should refer to the latest version of it.
  2. The docker image is removed. You will find this error in your job tracking page. You should create an issue to report it, or you can build the image according to the dockerfile in the example's folder, then push it to another docker registry and modify the json file's image field. Just refer to the README or DOCKER in the folder of that example.
  3. If the example you submit contains a prepare.sh script shell, it may fail due to the source of the data or code changed or been unstable. You may get error in your job tracking page. Check and try to fix it.
  4. The version of the code, tools or library. You may get this error if you rebuild the docker image. Some example doesn't fix the version of its dependency, so, you should check the version.

Contributing

If you want to contribute a job example that can be run on PAI, please open a new pull request.

  • Prepare a folder under pai/examples folder, for example create pai/examples/caffe2/

  • Prepare example files:

    Under Caffe2 example dir, user should prepare these files for an example's contribution PR:

PAI_caffe2_dir

  1. README.md: Example's introductions
  2. Dockerfile: Example's dependencies
  3. Pai job json file: Example's OpenPAI job json template
  4. [Optional] Code file: Example's code file