An experimental parallel training platform

Перейти к файлу

Jianfeng Wang bbe13fb69d Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts ## Overview some missed work of cleaning the repo - Add header to scripts - remove gpu_simulator's flake8 exclude		2020-11-27 04:52:11 +00:00
cmake	Merged PR 1062: SuperScaler: Add Feature - Add backend dependency script	2020-10-10 04:42:48 +00:00
docs	Merged PR 1186: AISimulator: Remove File - Remove docx version of Design doc, keep only md version	2020-11-25 11:46:55 +00:00
example	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
external	Merged PR 1182: SuperScaler: Revise Code - Add all files a copyright header	2020-11-25 11:33:16 +00:00
src	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
tests	Merged PR 1187: SuperScaler: Revise Code - Little revise to meet pep2 code style check	2020-11-25 12:12:11 +00:00
tools	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
.gitignore	Change the file structure	2020-03-01 21:37:59 +08:00
CMakeLists.txt	Merged PR 1182: SuperScaler: Revise Code - Add all files a copyright header	2020-11-25 11:33:16 +00:00
Dockerfile.CUDA	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
Dockerfile.test.CPU	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
Dockerfile.test.CUDA	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
README.md	Merged PR 1144: SuperScaler: Revise Readme - Add quickstart case model	2020-11-20 04:54:35 +00:00
azure-pipelines-cuda.yml	Merged PR 1171: SuperScaler: Set Pipeline: Set up CUDA pipeline with gpu support	2020-11-20 01:34:33 +00:00
azure-pipelines.yml	Merged PR 1098: SuperScaler: Add Feature - Add GPU and CPU Docker for all unitests.	2020-10-27 05:52:34 +00:00
requirements.txt	Merged PR 1098: SuperScaler: Add Feature - Add GPU and CPU Docker for all unitests.	2020-10-27 05:52:34 +00:00
setup.cfg	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00
setup.py	Merged PR 1188: SuperScaler: Revise Code - Add copyright headers to some scripts	2020-11-27 04:52:11 +00:00

README.md

SuperScaler

SuperScaler is an open-source distributed platform for deep learning training. SuperScaler aims to provide transparent distributed training support for different platforms with highly adaption to new emerging parallelism algorithms and optimizations. By leveraging existing deep learning frameworks like TensorFlow and NNFusion for local execution while supporting efficient distributed training with highly-optimized communication stacks, SuperScaler is exploring the new oppotunities of parallel deep learning training.

Status

(alpha preview)

Data-parallelism enabled for multi-GPU parallel training
Support flexible communication, e.g., building AllReduce with primitives Send and Receive
TensorFlow 1.x and NNFusion supported

Install

Install on a Bare-metal Machine

Install dependencies

# Install tools needed
sudo apt-get update && apt-get install build-essential wget

# We require cmake >= 3.17 so we need to install it mannually
wget -qO- "https://cmake.org/files/v3.18/cmake-3.18.2-Linux-x86_64.tar.gz" | sudo tar --strip-components=1 -xz -C /usr/local

# make sure you use python3.6 or 3.7 because
# tensorflow 1.15 does not support python3.8 or higher.
python3 --version

# make sure you use tensorflow1.15 rather than tensorflow2
pip3 install tensorflow==1.15
python3 -c 'import tensorflow as tf; print(tf.__version__)'
# (then '1.15.x' will be printed)

Install from source code

Simply use pip to build and install:

git clone https://github.com/microsoft/superscaler.git
cd superscaler
pip3 install .

Run with Docker

Using SuperScaler at Docker environment is the easiest method.

Build SuperScaler Docker:

sudo docker build -t superscaler -f Dockerfile.CUDA .

Or run Docker with interactive mode:

sudo docker run -it --runtime=nvidia superscaler bash

# (then, you have got into the docker‘s bash shell)

Run your first model with SuperScaler

Here we use a TensorFlow model as an example.

First we should create a file 'resource_pool.yaml', and fill in the resource information. You can get a sample resource_pool.yaml here.
Then build a tensorflow model and get the train_op and loss_op. You can get a sample tensorflow model here.

Finally set up and run the superscaler with this tensorflow model like this ↓

import superscaler.tensorflow as superscaler
from superscaler.scaler_graph import DataParallelism
import argparse

# Here should be a tensorflow model. You can replace it with your own.
def tensorflow_model():
    ...
    ...

    # return the train op and loss op, for superscaler to run this model
    return train_op, loss_op

sc = superscaler()

# To configure SuperScaler

train_op, loss_op = tensorflow_model()
strategy = DataParallelism(range(2))
deployment_setting = {"1": "localhost"}
communication_DSL = "ring"
resource_pool = "resource_pool.yaml"

sc.init(train_op, loss_op, deployment_setting, strategy,
        communication_DSL, resource_pool)

# To run your program

parser = argparse.ArgumentParser()
args, _ = parser.parse_known_args()

args.steps = 10
args.interval = 5
args.print_info = True
args.print_fetches_targets = True

sc.run(args)

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct.

Resources:
- Microsoft Open Source Code of Conduct
- Microsoft Code of Conduct FAQ
- Contact opencode@microsoft.com with questions or concerns

README.md Убрать экранирование Экранировать