pai/RELEASE_NOTE.md

17 KiB

OpenPAI Release Note

Jan 2021 (version 1.5.0)

  • Improve Web Portal Experience

    • Fix Home page overlap issue #5213 #5180
    • Add filter, search box and export csv button in task detail list #5175
    • Create a new page for yaml editor #5172
  • Marketplace related update

  • Support different types of computing hardware #5138

  • Deployment process refinement

    • master.csv + worker.csv -> layout.yaml
    • move config.yaml, layout.yaml under quick-start folder, remove all the argument parse logic
    • Add support for cpu-only worker installation
    • Add support for heterogeneous workers
    • Unify version requirements: pai version, pai image tag
    • Set default value in config files
    • Generate hiveD config with layout.yaml #5179
    • Check layout before installing k8s #5184 #5181
    • Config folder structure arrangement
    • Refine installation logs
    • Add skip service list argument #5193
  • Log manager

    • Change get logs api return code #5125

Dec 2020 (version 1.4.1)

  • Marketplace
    • Fix initializing blob data issue (#5189)
  • Log Collection
    • Fix getting wrong log for retried task & frontend crash issue (#5190)

Dec 2020 (version 1.4.0)

  • multi-cluster (https://github.com/microsoft/pai/issues/4929)
    • Support job transfer (#5082, #5088)
  • Autoscaler
    • Update docs for Cluster Autoscaler on AKS Engine (#5057)
  • Log Collection (https://github.com/microsoft/pai/issues/4992)
    • Rest API
    • Webportal
  • Https configuration document (#5076, #5078)
  • Marketplace (https://github.com/microsoft/openpaimarketplace/issues/73)
    • Data
      • Move NFS to Azure Blob as backend
      • Upload Job output to Azure Blob
      • Download data from azure blob to local
      • Use Azure storage SDK for privacy
      • Refactor data use logic after change storage to blob
      • Update project development doc and manual
    • Service Deployment
      • Start Local Rest Server
      • Deployed Rest Server in PAI
      • Start database and save items into it
      • Register in PAI pylon (#5066)
      • Add azure storage to service configuration (#5104)
  • Web Portal
    • Fix stop job button issue #5079
  • Admin Experience
    • Prometheus alert rules update (#5021)
    • Refine deployment process (#5077, #5085)
  • Others
    • Fix updateUserGroupList API issue (#5121)
    • Fix hived config issue caused by k8s coreDNS deployment (#5071)

Nov 2020 (version 1.3.0)

  • Marketplace
  • HiveD Scheduler
    • Support cluster autoscale with HiveD scheduler on AKS (#4868)
    • Support dynamic sku types for different vc on webportal (#4900)
  • Advanced job debug mode
  • GPU monitoring and utilization
    • Support job tagging (#4924)
    • Stop low GPU utilization job with alert-manager (#4940)
    • Cordon node with GPU ECC Errors (#4942)
  • Documentation
    • Fix document according to DRI tickets (#4828)
    • Add distributed examples (#4821)
  • Webportal
    • Add help info for items on webportal (#4950)

Oct 2020 (version 1.2.1)

minor release with hot fix.

Sep 2020 (version 1.2.0)

For more details about this release, please refer to detailed release note.

July 2020 (version 1.1.1)

minor release with hot fix.

July 2020 (version 1.1.0)

  • Storage:
    • Support readonly storage. (#4523)
  • Security
    • If ssl is enabled, all requests will use https. (#4550)
  • Authentication
    • Support nested AD group in AAD Mode. (#4639)
  • Marketplace
  • Others
    • Add stress test for PAI API. (#4665)
    • Resolve job always retry for port conflict. (#4384)
    • Webportal/VScode use JS SDK + SDK improvement. (#4660)
    • Align webportal submit default value with backend. (#4682)
    • Document enhance. (#4700)
    • Fix tensorboard v2 the logdir is not correct issue.
    • Fix webPortal submit job help link broken.
    • Fix ssh barrier bug.

For more details about this release, please refer to detailed release note.

May 2020 (version 1.0.1)

minor release with hot fix.

May 2020 (version 1.0.0)

With the v1.0.0 release, OpenPAI is officially switching to pure Kubernetes-based architecture. In addition to this, we had also made efforts on making our component design more modularized by re-organized the code structure to 1 main repo together with 7 standalone key component repos.

Please refer to the system architecture documentation for more detailed design thinkings about this change, and review the following list to get a better understanding about the 7 new component repos:

  • hivedscheduler is a new OpenPAI component providing various advantages over standard k8s scheduler, such as resource isolation for multiple tenants, GPU topology guarantee for virtual clusters, and better topology-aware gang scheduling with no resource starvation.

  • frameworkcontroller is built to orchestrate all kinds of applications on Kubernetes by a single controller.

  • openpai-protocol is the specification of OpenPAI job protocol. It facilitates platform interoperability and job portability. A job described by the protocol can run on different clusters managed by OpenPAI. The protocol also enables great flexibility. Any AI workload can be described by it conveniently.

  • openpai-runtime provides runtime support which is necessary for the OpenPAI protocol. OpenPAI runtime can classify typical runtime error patterns and prevent unnecessay error retries. Therefore cluster resource can be saved.

  • openpaisdk is a JavaScript SDK designed to facilitate the developers of OpenPAI to offer more user-friendly experience.

  • openpaimarketplace is a service which stores examples and job templates. Users can use it from webportal plugin to share their jobs or run-and-learn others' sharing job.

    Features:

    1. Provide a way for team collaboration among pai users.
    2. Provide an easy-to-start and education for new users. Users could refer to shared templates in marketplace and learn how to use pai platform correct.
    3. Provide admin review process to ensure the quality of templates in marketplace.
  • openpaivscode is a VSCode extension, which makes users connect OpenPAI clusters, submit AI jobs, simulate jobs locally and manage files in VSCode easily.

The version of each standalone repo used in OpenPAI v1.0.0 is hivedscheduler v0.3.2, frameworkcontroller v0.6.0, openpai-protocol v2.0.0-alpha, openpai-runtime v0.1.0, openpaisdk v0.1.0, openpaimarketplace v1.2.0 and openpaivscode v0.3.0.

Other major new features and improvements come with this new release are:

For more details about this release, please refer to detailed release note.

July 2019 (version 0.14.0)

Welcome to the July 2019 release of OpenPAI. There are a number of updates in this version that we hope you will like, some of the key highlights include:

For more details about this release, please refer to detailed release note.

June 2019 (version 0.13.0)

  • OpenPAI protocol:
    • Introduce OpenPAI protocol and job submission v2 (#2260)
    • Add new job submission v2 plugin (#2461)
  • Web portal:
    • Add login page for guests (#2544)
    • Add user home page (#2614)
    • Add new user management page (#2726, #2796)
    • User Management UX refactoring with new layout and themes (#2726, #2796)

For more details about this release, please refer to detailed release note.

April 2019 (version 0.12.0)

  • Web portal:
    • Display error message in job detail page #2456
    • Import users from CSV file directly and show the final results #2495
    • Add TotalGpuCount and TotalTaskCount into job list #2499
  • Deployment
    • Add cluster version info #2528
    • Check if the nodes are ubuntu 16.04 #2520
    • Check duplicate hostname #2403

For more details about this release, please refer to detailed release note.

April 2019 (version 0.11.0)

  • Support team wise NFS storage, including:
    • An NFS configuration plug-in and a commandline tool. #2346
    • A simple NFS-job submit plug-in. #2358
  • Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.
  • New alerts for unhealthy GPUs, currently including following alerts #2209:
  • Admin could know all running jobs on a node. #2197
  • Filter supports in Job List View. #302
  • Hold the Env for failed jobs which are caused by user error. #2272

For more details about this release, please refer to detailed release note.

Mar 2019 (version 0.10.1)

  • Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
  • Support Azure RDMA. #2091; how-to doc
  • New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
  • Web portal: add "My jobs" filter button. #2111
  • "Submit Simple Job" web portal plugin. #2131

For more details about this release, please refer to detailed release note.

Feb 2019 (version 0.9.1)

minor release with hot fix.

Feb 2019 (version 0.9.0)

  • Add pai service dashboard to grafana, cluster admin can get pai services resource consumption from paiServiceMetrics page. #1694
  • Support to add custom web pages to the web portal of PAI deployments with WebPortal Plugin, refer to Plugins Doc for how to use the new feature, and refer to PR 1700 for how PAI Marketplace is using it as an example.
  • Support update virtual cluster dynamically from webportal. Please refer to virtual cluster management for how to use this new feature. #1831 #1974
  • Support customized job environment variables. #1544
  • Add VS Code client for PAI, please refer to OpenPAI VS Code Client for more detail.

For more details about this release, please refer to detailed release note.

Dec 2018 (version 0.8.3)

minor release with hot fix.

Nov 2018 (version 0.8.2)

minor release with hot fix.

Nov 2018 (version 0.8.1)

minor release with hot fix.

Oct 2018 (version 0.8.0)

  • All user submitted jobs can be cloned and resubmitted in Job detail page #1448.
  • The new designed Marketplace and Submit Job V2 are under public review. Please refer to the instruction for more information Marketplace and Submit job v2. Any feedback and suggestions are appreciated.
  • Alerting service supports to mute alerts. The instructions can be found via alert-manager.
  • New Feedback Button: users are allowed to submit GitHub Issues with appended OpenPAI version directly from WebUI #1289.

For more details about this release, please refer to detailed release note.

Sep 2018 (version 0.7.2)

minor release with hot fix.

Aug 2018 (version 0.7.1)

  • Administrators can receive email notifications on cluster problems after set up the new supported "Alert Manager". Please read more about how to set up Alert Manager and the notification Rules.

For more details about this release, please refer to detailed release note.

July 2018 (version 0.6.1)

  • The 'paictl' tool: Introducing paictl, the deployment/management tool with the functionalities of image building, service start/stop, k8s bootup/clean, and configuration generation.
  • Single-box deployment: Support single-box deployment for evaluation purpose.
  • New UI for user management: Now the console for administrators to manage PAI users has got a new UI.
  • Documentation: Significant changes on documents -- more comprehensive, more structured, and easier to follow.

For more details about this release, please refer to detailed release note.