Resource scheduling and cluster management for AI
Перейти к файлу
YundongYe ee12c0f37e
[Kubespray] Disable netchecker (#3932)
2019-11-29 16:50:09 +08:00
.dependabot Update dependabot config file (#2914) 2019-06-11 17:34:11 +08:00
.github [runtime] [rest-server] Add teamwise_storage runtime plugin (#3875) 2019-11-25 14:57:17 +08:00
build Add error message when image build failed (#2133) 2019-02-12 15:33:56 +08:00
contrib [Kubespray] Disable netchecker (#3932) 2019-11-29 16:50:09 +08:00
deployment K8S managed NFS+SMB storage (#3826) 2019-11-13 16:04:07 +08:00
docs Add api endpoint and webportal page of job retry history (#3831) 2019-11-21 00:13:47 +08:00
examples Improve job history pipeline fault tolerance (#3832) 2019-11-12 18:52:38 +08:00
marketplace Delete cntk_g2p.yaml from marketplace for it always failed; (#1526) 2018-10-17 16:45:31 +08:00
marketplace-v2 Add a horovod mpi example in marketplace (#3568) 2019-09-06 17:40:50 +08:00
src [Web Portal] add react svg loader, remove unnecessary svg files (#3930) 2019-11-29 15:35:13 +08:00
subprojects [Hived]: Disable leader election (#3928) 2019-11-28 16:08:35 +08:00
version update pai version 2019-07-03 19:27:40 +08:00
.gitattributes Add service deployment for framework controller (#3435) 2019-09-02 20:35:11 +08:00
.gitignore ErrorSpec: runtime part (#3585) 2019-09-18 10:41:18 +08:00
.travis.yml Add CI workflow for GitHub Actions (#3363) 2019-08-13 14:16:31 +08:00
Jenkinsfile Fix dev-box (#3388) 2019-08-16 20:07:00 +08:00
LICENSE Initial commit 2017-09-25 04:24:56 -07:00
README.md add frameworkcontroller (#3747) 2019-10-18 13:55:17 +08:00
README_zh_CN.md Chinese translation updates (#3424) 2019-08-23 10:42:03 +08:00
RELEASE_NOTE.md change link of external project like python sdk (#3237) 2019-07-22 19:42:16 +08:00
RELEASE_NOTE_zh_CN.md Chinese translation updates (#3424) 2019-08-23 10:42:03 +08:00
crowdin.yml Chinese translation updates (#3424) 2019-08-23 10:42:03 +08:00
paictl.py [Job Debugging] Basic Implement Of Job Debugging. (#2272) 2019-03-07 13:25:04 +08:00
pailogo.jpg add logo 2018-04-18 19:37:06 +08:00
sysarch.png update system architecture section 2017-11-24 16:18:19 +08:00

README.md

Open Platform for AI (OpenPAI) alt text

Build Status Coverage Status Join the chat at https://gitter.im/Microsoft/pai Version

简体中文

OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.

OpenPAI v0.14.0 has been released!

Table of Contents

  1. When to consider OpenPAI
  2. Why choose OpenPAI
  3. Get started
  4. Deploy OpenPAI
  5. Train models
  6. Administration
  7. Reference
  8. Get involved
  9. How to contribute

When to consider OpenPAI

  1. When your organization needs to share powerful AI computing resources (GPU/FPGA farm, etc.) among teams.
  2. When your organization needs to share and reuse common AI assets like Model, Data, Environment, etc.
  3. When your organization needs an easy IT ops platform for AI.
  4. When you want to run a complete training pipeline in one place.

Why choose OpenPAI

The platform incorporates the mature design that has a proven track record in Microsoft's large-scale production environment.

Support on-premises and easy to deploy

OpenPAI is a full stack solution. OpenPAI not only supports on-premises, hybrid, or public Cloud deployment but also supports single-box deployment for trial users.

Pre-built docker for popular AI frameworks. Easy to include heterogeneous hardware. Support Distributed training, such as distributed TensorFlow.

Most complete solution and easy to extend

OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible Hadoop / Kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate.

Targeting at openness and advancing state-of-art technology, Microsoft Research (MSR) and Microsoft Search Technology Center (STC) had also released few other open source projects.

  • NNI : An open source AutoML toolkit for neural architecture search and hyper-parameter tuning. We encourage researchers and students leverage these projects to accelerate the AI development and research.
  • FrameworkController : A general purpose Kubernetes controller to orchestrate all kinds of applications.
  • MMdnn : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.
  • NeuronBlocks : An NLP deep learning modeling toolkit that helps engineers to build DNN models like playing Lego. The main goal of this toolkit is to minimize developing cost for NLP deep neural network model building, including both training and inference stages.
  • SPTAG : Space Partition Tree And Graph (SPTAG) is an open source library for large scale vector approximate nearest neighbor search scenario.

Get started

OpenPAI manages computing resources and is optimized for deep learning. Through docker technology, the computing hardware are decoupled with software, so that it's easy to run distributed jobs, switch with different deep learning frameworks, or run other kinds of jobs on consistent environments.

As OpenPAI is a platform, deploy a cluster is first step before using. A single server is also supported to deploy OpenPAI and manage its resource.

If the cluster is ready, learn from train models about how to use it.

Deploy OpenPAI

Follow this part to check prerequisites, deploy and validate an OpenPAI cluster. More servers can be added as needed after initial deployed.

It's highly recommended to try OpenPAI on server(s), which has no usage and service. Refer to here for hardware specification.

Prerequisites and preparation

  • Ubuntu 16.04 (18.04 should work, but not fully tested.)
  • Assign each server a static IP address, and make sure servers can communicate each other.
  • Server can access internet, especially need to have access to the docker hub registry service or its mirror. Deployment process will pull Docker images of OpenPAI.
  • SSH service is enabled and share the same username/password and have sudo privilege.
  • NTP service is enabled.
  • Recommend not to install docker or docker's version must be higher than 1.26.
  • OpenPAI reserves memory and CPU for service running, so make sure there are enough resource to run machine learning jobs. Check hardware requirements for details.
  • Dedicated servers for OpenPAI. OpenPAI manages all CPU, memory and GPU resources of servers. If there is any other workload, it may cause unknown problem due to insufficient resource.

Deploy

The Deploy with default configuration part is minimum steps to deploy an OpenPAI cluster, and it's suitable for most small and middle size clusters within 50 servers. Base on the default configuration, the customized deployment can optimize the cluster for different hardware environments and use scenarios.

Deploy with default configuration

For a small or medium size cluster, which is less than 50 servers, it's recommended to deploy with default configuration. if there is only one powerful server, refer to deploy OpenPAI as a single box.

For a large size cluster, this section is still needed to generate default configuration, then customize the deployment.

Customize deployment

As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be optimized. Following Customize deployment part to learn more details.

Validate deployment

After deployment, it's recommended to validate key components of OpenPAI in health status. After validation is success, submit a hello-world job and check if it works end-to-end.

Train users before "train models"

The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve utilization of resources and save time on maintaining environments.

For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users can learn from Train models. But below section of training models is for various scenarios and maybe users don't need all of them. So, administrators can create simplified documents as users' actual scenarios.

FAQ

If there is any question during deployment, check here firstly.

If FAQ doesn't resolve it, refer to here to ask question or submit an issue.

Train models

As all computing platforms, OpenPAI is a productive tool and to maximize utilization of resources. So, it's recommended to submit training jobs and let OpenPAI to allocate resource and run jobs. If there are too many jobs, some jobs may be queued until enough resource available. This is different experience with running code on dedicated servers, so it needs a bit more knowledge about how to submit and manage jobs on OpenPAI.

Note, besides queuing jobs, OpenPAI also supports to allocate dedicated resources. Users can use SSH or Jupyter Notebook like on a physical server, refer to here for details. Though it's not efficient to use resources, but it also saves cost on setup and managing environments on physical servers.

Submit training jobs

Follow the job submission tutorial to learn more how to train models on OpenPAI. It's a good start to learn How to use OpenPAI.

Client tool

OpenPAI VS Code Client is a friendly, GUI based client tool of OpenPAI, and it's highly recommended. It's an extension of Visual Studio Code. It can submit job, simulate jobs locally, manage multiple OpenPAI environments, and so on.

Troubleshooting job failure

Web UI and job log are helpful to analyze job failure, and OpenPAI supports SSH for debugging.

Refer to here for more information about troubleshooting job failure.

Administration

Reference

Users

Get involved

How to contribute

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Call for contribution

We are working on a set of major features improvement and refactor, anyone who is familiar with the features is encouraged to join the design review and discussion in the corresponding issue ticket.

Who should consider contributing to OpenPAI

  • Folks who want to add support for other ML and DL frameworks
  • Folks who want to make OpenPAI a richer AI platform (e.g. support for more ML pipelines, hyperparameter tuning)
  • Folks who want to write tutorials/blog posts showing how to use OpenPAI to solve AI problems

Contributors

One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the components.

PAI operates in an open model. It is initially designed and developed by Microsoft Research (MSR) and Microsoft Search Technology Center (STC) platform team. We are glad to have Peking University, Xi'an Jiaotong University, Zhejiang University, and University of Science and Technology of China join us to develop the platform jointly. Contributions from academia and industry are all highly welcome.