An experimental parallel training platform
Перейти к файлу
Zhiqi Lin 5691b5435b Merged PR 1213: CommDSL: Add Feature - Add CommGraph in graph layer
Add CommDSL graph code. Some modifications made to node for compatibility

Code Structure:

```
commdsl
└── graph                   -- graph layer to describe node (send/recv/buffer_op) dependency
    └── graph.py              -- CommGraph which describes the dependency of nodes
```
2020-12-25 04:59:40 +00:00
cmake Merged PR 1062: SuperScaler: Add Feature - Add backend dependency script 2020-10-10 04:42:48 +00:00
docs Merged PR 1186: AISimulator: Remove File - Remove docx version of Design doc, keep only md version 2020-11-25 11:46:55 +00:00
example Merged PR 1226: Superscaler: training with cifar10-dataset 2020-12-23 06:00:38 +00:00
external Merged PR 1182: SuperScaler: Revise Code - Add all files a copyright header 2020-11-25 11:33:16 +00:00
src Merged PR 1213: CommDSL: Add Feature - Add CommGraph in graph layer 2020-12-25 04:59:40 +00:00
tests Merged PR 1213: CommDSL: Add Feature - Add CommGraph in graph layer 2020-12-25 04:59:40 +00:00
tools Merged PR 1226: Superscaler: training with cifar10-dataset 2020-12-23 06:00:38 +00:00
.gitignore Change the file structure 2020-03-01 21:37:59 +08:00
CMakeLists.txt Merged PR 1182: SuperScaler: Revise Code - Add all files a copyright header 2020-11-25 11:33:16 +00:00
Dockerfile.CUDA Merged PR 1215: SuperScaler: Revise Docker - Add update before apt-get 2020-12-11 08:55:11 +00:00
Dockerfile.test.CPU Merged PR 1215: SuperScaler: Revise Docker - Add update before apt-get 2020-12-11 08:55:11 +00:00
Dockerfile.test.CUDA Merged PR 1219: SuperScaler: Fix Pipeline - Use single CMD 2020-12-18 05:42:20 +00:00
README.md Merged PR 1144: SuperScaler: Revise Readme - Add quickstart case model 2020-11-20 04:54:35 +00:00
azure-pipelines-cuda.yml Merged PR 1206: Superscaler: Update Pipeline - Add test for superscaler package 2020-12-08 03:59:51 +00:00
azure-pipelines.yml Merged PR 1098: SuperScaler: Add Feature - Add GPU and CPU Docker for all unitests. 2020-10-27 05:52:34 +00:00
requirements.txt Merged PR 1213: CommDSL: Add Feature - Add CommGraph in graph layer 2020-12-25 04:59:40 +00:00
setup.cfg Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts 2020-11-27 04:52:11 +00:00
setup.py Merged PR 1213: CommDSL: Add Feature - Add CommGraph in graph layer 2020-12-25 04:59:40 +00:00

README.md

SuperScaler

SuperScaler is an open-source distributed platform for deep learning training. SuperScaler aims to provide transparent distributed training support for different platforms with highly adaption to new emerging parallelism algorithms and optimizations. By leveraging existing deep learning frameworks like TensorFlow and NNFusion for local execution while supporting efficient distributed training with highly-optimized communication stacks, SuperScaler is exploring the new oppotunities of parallel deep learning training.

Status

(alpha preview)

  • Data-parallelism enabled for multi-GPU parallel training
  • Support flexible communication, e.g., building AllReduce with primitives Send and Receive
  • TensorFlow 1.x and NNFusion supported

Install

Install on a Bare-metal Machine

  • Install dependencies

    # Install tools needed
    sudo apt-get update && apt-get install build-essential wget
    
    # We require cmake >= 3.17 so we need to install it mannually
    wget -qO- "https://cmake.org/files/v3.18/cmake-3.18.2-Linux-x86_64.tar.gz" | sudo tar --strip-components=1 -xz -C /usr/local
    
    # make sure you use python3.6 or 3.7 because
    # tensorflow 1.15 does not support python3.8 or higher.
    python3 --version
    
    # make sure you use tensorflow1.15 rather than tensorflow2
    pip3 install tensorflow==1.15
    python3 -c 'import tensorflow as tf; print(tf.__version__)'
    # (then '1.15.x' will be printed)
    
  • Install from source code

    Simply use pip to build and install:

    git clone https://github.com/microsoft/superscaler.git
    cd superscaler
    pip3 install .
    

Run with Docker

Using SuperScaler at Docker environment is the easiest method.

  • Build SuperScaler Docker:

    sudo docker build -t superscaler -f Dockerfile.CUDA .
    
  • Or run Docker with interactive mode:

    sudo docker run -it --runtime=nvidia superscaler bash
    
    # (then, you have got into the dockers bash shell)
    

Run your first model with SuperScaler

Here we use a TensorFlow model as an example.

  • First we should create a file 'resource_pool.yaml', and fill in the resource information. You can get a sample resource_pool.yaml here.

  • Then build a tensorflow model and get the train_op and loss_op. You can get a sample tensorflow model here.

  • Finally set up and run the superscaler with this tensorflow model like this ↓

    import superscaler.tensorflow as superscaler
    from superscaler.scaler_graph import DataParallelism
    import argparse
    
    # Here should be a tensorflow model. You can replace it with your own.
    def tensorflow_model():
        ...
        ...
    
        # return the train op and loss op, for superscaler to run this model
        return train_op, loss_op
    
    sc = superscaler()
    
    # To configure SuperScaler
    
    train_op, loss_op = tensorflow_model()
    strategy = DataParallelism(range(2))
    deployment_setting = {"1": "localhost"}
    communication_DSL = "ring"
    resource_pool = "resource_pool.yaml"
    
    sc.init(train_op, loss_op, deployment_setting, strategy,
            communication_DSL, resource_pool)
    
    # To run your program
    
    parser = argparse.ArgumentParser()
    args, _ = parser.parse_known_args()
    
    args.steps = 10
    args.interval = 5
    args.print_info = True
    args.print_fetches_targets = True
    
    sc.run(args)
    
  • Microsoft Open Source Code of Conduct

    This project has adopted the Microsoft Open Source Code of Conduct.

    Resources: