pai/frameworklauncher
Yuqi Wang 37b7823660 [Launcher]: Setup test with coveralls report 2018-04-02 16:50:01 +08:00
..
bin Initial aii 2017-11-21 14:45:44 +08:00
conf [Launcher]: Support Framework ACL (Namespace) (#320) 2018-02-26 14:27:06 +08:00
doc Add Port support in framework launcher (#318) 2018-03-26 17:40:39 +08:00
src [Launcher]: Add queue into SummarizedFrameworkInfo 2018-03-29 16:41:16 +08:00
README.md Add Port support in framework launcher (#318) 2018-03-26 17:40:39 +08:00
build-internal.bat Initial aii 2017-11-21 14:45:44 +08:00
build.bat Initial aii 2017-11-21 14:45:44 +08:00
build.sh change build.sh to executable 2017-11-24 09:28:27 +00:00
pom.xml [Launcher]: Setup test with coveralls report 2018-04-02 16:50:01 +08:00

README.md

Microsoft FrameworkLauncher

FrameworkLauncher (or Launcher for short) is built to enable running Large-Scale Long-Running Services inside YARN Containers without making changes to the Services themselves. It also supports Batch Jobs, such as TensorFlow, CNTK, etc.

Features

  • High Availability

    • All Launcher and Hadoop components are Recoverable and Work Preserving. So, User Services is by designed No Down Time, i.e. always uninterrupted when our components shutdown, crash, upgrade, or even any kinds of outage for a long time.
    • Launcher can tolerate many unexpected errors and has well defined Failure Model, such as dependent components shutdown, machine error, network error, configuration error, environment error, corrupted internal data, etc.
    • User Services can be ensured to Retry on Transient Failures, Migrate to another Machine per User's Request, etc.
  • High Usability

    • No User code changes needed to run the existing executable inside Container. User only need to setup the FrameworkDescription in Json format.
    • RestAPI is supported.
    • Work Preserving FrameworkDescription Update, such as change TaskNumber, add TaskRole on the fly.
    • Migrate running Task per User's Request
    • Override default ApplicationProgress per User's Request
  • Services Requirements

    • Versioned Service Deployment
    • ServiceDiscovery
    • AntiaffinityAllocation: Services running on different Machines
  • Batch Jobs Requirements

    • GPU as a Resource
    • Port as a Resource
    • GangAllocation: Start Services together
    • KillAllOnAnyCompleted and KillAllOnAnyServiceCompleted
    • Framework Tree Management: DeleteOnParentDeleted, StopOnParentStopped
    • DataPartition

Build and Start

Dependencies

Compile-time dependencies:

Run-time dependencies:

  • Hadoop 2.7.2 with YARN-7481 is required to support GPU as a Resource and Port as a Resource, if you do not need it, any Hadoop 2.7+ is fine.
  • Apache Zookeeper

Build Launcher Distribution

Launcher Distribution is built into folder .\dist.

Windows cmd line:

.\build.bat

GNU/Linux cmd line:

./build.sh

Start Launcher Service

Launcher Distribution is required before Start Launcher Service.

Windows cmd line:

.\dist\start.bat

GNU/Linux cmd line:

./dist/start.sh

User Manual

See User Manual to learn how to use Launcher Service to Launch Framework.