pai/frameworklauncher
Yuqi Wang 4af6efa996
[Launcher]: Support ACL Backward Compatibility (#1184)
2018-08-28 17:11:11 +08:00
..
bin [Launcher]: Add execution mode for start.sh (#1141) 2018-08-24 13:42:41 +08:00
conf [Launcher]: Refine GangAllocation (#798) 2018-07-03 21:40:11 +08:00
doc [Launcher]: Add RetryPolicy Doc (#812) 2018-07-05 16:44:33 +08:00
src [Launcher]: Support ACL Backward Compatibility (#1184) 2018-08-28 17:11:11 +08:00
README.md [Launcher]: Refine GangAllocation (#798) 2018-07-03 21:40:11 +08:00
build-internal.bat [Launcher]: Support LAUNCHER_OPTS (#1139) 2018-08-24 13:17:18 +08:00
build.bat Initial aii 2017-11-21 14:45:44 +08:00
build.sh [Launcher]: Support LAUNCHER_OPTS (#1139) 2018-08-24 13:17:18 +08:00
pom.xml [Launcher]: Refine Models (#685) 2018-06-08 14:46:37 +08:00

README.md

Microsoft FrameworkLauncher

FrameworkLauncher (or Launcher for short) is built to enable running Large-Scale Long-Running Services inside YARN Containers without making changes to the Services themselves. It also supports Batch Jobs, such as TensorFlow, CNTK, etc.

Features

  • High Availability

    • All Launcher and Hadoop components are Recoverable and Work Preserving. So, User Services is by designed No Down Time, i.e. always uninterrupted when our components shutdown, crash, upgrade, or even any kinds of outage for a long time.
    • Launcher can tolerate many unexpected errors and has well defined Failure Model, such as dependent components shutdown, machine error, network error, configuration error, environment error, corrupted internal data, etc.
    • User Services can be ensured to Retry on Transient Failures, Migrate to another Node per User's Request, etc.
  • High Usability

    • No User code changes needed to run the existing executable inside Container. User only need to setup the FrameworkDescription in Json format.
    • Idempotent RestAPI is supported.
    • Work Preserving FrameworkDescription Update, such as change TaskNumber, add TaskRole on the fly.
    • Migrate running Task per User's Request
    • Override default ApplicationProgress per User's Request
  • Services and Batch Jobs Requirements

    • Gpu Scheduling: Dynamic Topology-Aware Gpu Allocation
    • Port Scheduling: Static or Dynamic Port Allocation
    • Gang Scheduling: Gang Allocation: Start Services in an all-or-nothing fashion
    • Antiaffinity Scheduling: Antiaffinity Allocation: Start Services on different Nodes
    • Versioned Service Deployment
    • ServiceDiscovery
    • ApplicationCompletionPolicy
    • Framework Tree Management: DeleteOnParentDeleted, StopOnParentStopped
    • DataPartition

Build and Start

Dependencies

Compile-time dependencies:

Run-time dependencies:

  • Hadoop 2.7.2 with YARN-7481 is required to support Gpu Scheduling and Port Scheduling, if you do not need them, any Hadoop 2.7+ is fine.
  • Apache Zookeeper

Build Launcher Distribution

Launcher Distribution is built into folder .\dist.

Windows cmd line:

.\build.bat

GNU/Linux cmd line:

./build.sh

Start Launcher Service

Launcher Distribution is required before Start Launcher Service.

Windows cmd line:

.\dist\start.bat

GNU/Linux cmd line:

./dist/start.sh

User Manual

See User Manual to learn how to use Launcher Service to Launch Framework.