antares

Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.

Перейти к файлу

ghostplant ed7f3ad849 fix a sharding bug in IPU backend (#251 )		2021-04-26 06:54:22 +08:00
antares	add auto-tuning policies for c-scpu/c-mcpu/c-mcpu_avx512 (#250 )	2021-04-25 12:22:18 +08:00
backends	fix a sharding bug in IPU backend (#251 )	2021-04-26 06:54:22 +08:00
codehub	Upgrade Evaluate Format (#165 )	2021-02-21 14:15:27 +08:00
docker	cleanup initial split values with gcd (#249 )	2021-04-24 22:16:44 +08:00
engine	synchronize patch for make_int (#248 )	2021-04-24 10:12:27 +08:00
frameworks	upgrade tvm base version (#247 )	2021-04-23 17:38:44 +08:00
graph_evaluator	restore input_hash due to HLSL metric impacts (#245 )	2021-04-21 11:17:17 +08:00
hardware	add auto-tuning policies for c-scpu/c-mcpu/c-mcpu_avx512 (#250 )	2021-04-25 12:22:18 +08:00
images	add new backend: c-ocl_android (#224 )	2021-03-29 14:57:25 +08:00
lang	fix a sharding bug in IPU backend (#251 )	2021-04-26 06:54:22 +08:00
public	fix a sharding bug in IPU backend (#251 )	2021-04-26 06:54:22 +08:00
tuner	upgrade tvm base version (#247 )	2021-04-23 17:38:44 +08:00
.dockerignore	fix python timeout not-working issue (#106 )	2020-12-10 14:21:03 +08:00
AntaresIR.md	Add ConvWinograd_3x3 IR. (#217 )	2021-03-26 14:52:43 +08:00
CONTRIBUTING.md	Initial Commit	2020-09-03 23:07:44 +08:00
LICENSE.TXT	Initial Commit	2020-09-03 23:07:44 +08:00
Makefile	Ab debug (#235 )	2021-04-14 17:09:22 +08:00
README.md	synchronize doc for v0.2.0 (#230 )	2021-04-03 20:22:02 +08:00

README.md

What is Antares:

Antares is an automatic engine to generate multi-platform kernels with optimization for DNN developers (targeting to backends like CUDA/ROCm/CPU/DirectX12/Graphcore/OneAPI/..). It is also a framework for Hardware developers to extend new backends/hareware quickly and easily. Antares provides IR that follows "One Language Syntax for All Platforms", and general-purpose device access APIs that hide the differences of not only DNN description but also device mapping.

Features
- Backend Extension
- Effective Auto Tuning
- Einsum-based Antares IR
- Framework JIT Extension (Op Maker Plugin for Pytorch/Tensorflow/Tensorflow2)
How to Use Antares
- Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal
- Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow
Antares Pre-dependencies for Different Backends
- Linux-based: cuda, rocm, mcpu, scpu, gc, sycl_intel, sycl_cuda, ocl_amdgpu, ocl_nvidia, ocl_android, ..
- Windows-based: cuda_win64, rocm_win64, hlsl_win64, ..
About Microsft Open Source

About Antares Features:

a. Backend Extension

The current version of Antares supports code generation for the following backends (in orange blocks) and devices (in black blocks):

b. Effective Auto Tuning

Auto tuning by Antares contributes to not only much less tuning time, but also equivalent or better performance for Intra-op/Inter-op execution (against TVM Ansor).

c. Einsum-based Antares IR

Antares IR is the frontend of both kernel generation and automatic optimization.
The syntax of Antares IR is slim to describe most MLP/CNN/RNN/LSTM/Transformer based models like MNIST/ResNet/BERT/GPT/..

E.g. The following computation logic describes a layer of standard BERT transformer:

  merged_layer_local[R, B, S1, N1, H1] +=! input_tensor[B, S1, N, H] * qkv_weight[R, N, H, N1, H1];
  merged_layer_trans[R, B, N1, S1, H1] = merged_layer_local[R, B, S1, N1, H1] + qkv_bias[R, N1, H1];
  attention_scores[B, N1, S1, S2] +=! merged_layer_trans[0, B, N1, S1, H1] * merged_layer_trans[1, B, N1, S2, H1] / const({H}).cast(`float32`);
    softmax_1_temp0[B, N1] >=! attention_scores[B, N1, S1, S2];
    softmax_1_temp1[B, N1] +=! (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`);
  attention_probs[B, N1, S1, S2] = (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`) / softmax_1_temp1[B, N1];
  ... ...
  layer_norm_2_src[B, S1, N2, H2] = layer_output[B, S1, N2, H2] + attention_output_norm[B, S1, N2, H2];
    layer_norm_2_temp0[B, S1] += layer_norm_2_src[B, S1, N2, H2];
    layer_norm_2_temp1[B, S1] += layer_norm_2_src[B, S1, N2, H2] * layer_norm_2_src[B, S1, N2, H2];
  layer_output_norm[B, S1, N2, H2] = (layer_norm_2_src[B, S1, N2, H2] * {N * H} - layer_norm_2_temp0[B, S1]) * (layer_norm_2_temp0[B, S1] * {N * H} - layer_norm_2_temp1[B, S1] * layer_norm_2_temp1[B, S1]).call(`max`, [1e-8]).call(`rsqrt`);

For more IR usage or examples, please follow documentation here: Antares IR & Examples

d. Pytorch/Tensorflow/Tensorflow2 Op Maker (JIT Plugin)

Antares provides JIT plugin for Pytorch/Tensorflow/Tensorflow2 to help frameworks to easily extend new operators, e.g.:

# Tensorflow/Tensorflow2 Example:
op = antares.make_op(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).emit()
result_1 = sess.run(op)
print('The custom result_1 is:\n%s' % result_1)
result_2 = sess.run(tf.add(op, op))
print('The custom result_2 is:\n%s' % result_2)  

# Pytorch Example:
custom_op = CustomOp(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).to(device, dtype).emit()
result = custom_op()
print('The custom result is:', result)

For complete programs, please follow examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2

How to Use Antares?

Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal:

Step-1: Prepare Environment

sudo apt install docker.io
git clone https://github.com/microsoft/antares --branch v0.2.x
cd antares/

# To set the backend type to environment variable `BACKEND` to build the corresponding environment:
echo 'c-cuda' > backend.default

# Build the environment for this backend: (if this step failed, please go to "Pre-dependencies" section to check which "backend-related dependencies" are missing)
make

All valid backends are listed in directory antares/backends

Step-2: Tune a Specific Workload in Foreground

# Example-1: Run the following command in bash to tune MatMul (4096, 4096) x (4096, 4096) using 2000 trials:
COMMIT=force STEP=2000 COMPUTE_V1='- S = 4096; einstein_v2(input_dict={"input0": {"dtype": "float32", "shape": [S, S]}, "input1": {"dtype": "float32", "shape": [S, S]}}, exprss="output0[N, M] +=! input0[N, K] * input1[K, M]")' make

# Example-2: Run the following command in bash to tune MNIST-inference using 5000 trials:
COMMIT=force STEP=5000 COMPUTE_V1='- einstein_v2(input_dict={"data": {"dtype": "float32", "shape": [64, 784]}, "weight_0": {"dtype": "float32", "shape": [784, 512]}, "weight_1": {"dtype": "float32", "shape": [512, 512]}, "weight_2": {"dtype": "float32", "shape": [512, 10]}, "bias_0": {"dtype": "float32", "shape": [512]}, "bias_1": {"dtype": "float32", "shape": [512]}, "bias_2": {"dtype": "float32", "shape": [10]}}, extra_outputs=[], exprss="data_0[N, M] +=!  data[N, K] * weight_0[K, M];   data_1[N, K] =   (data_0[N, K] + bias_0[K]).call(`max`, [0.0]);   data_2[N, M] +=!  data_1[N, K] * weight_1[K, M];   data_3[N, K] =   (data_2[N, K] + bias_1[K]).call(`max`, [0.0]);   data_4[N, M] +=!  data_3[N, K] * weight_2[K, M];   data_5[N, K] =   (data_4[N, K] + bias_2[K]);")' make

Apart from detailed reporting logs during the tuning procedure, the best kernel record will be saved to directory antares/codehub. If you don't want to create/overwrite existing kernel record in codehub, environment variable COMMIT=force in the tuning command can be removed.

Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow (only for CUDA & ROCm backend currently):

Step-1: Prepare Environment

You need to follow Step-1 from Senario-1 to finish environment preparation beforehand. This prevents many environmental issues when walking to the next step.
Step-2: Set up Background Codegen Service
```
make rest-server
```
By default, it listens on TCP port = 8880, and the purpose of this service is to avoid bringing heavy backend-related dependencies in Pytorch/Tensorflow, which helps JIT plugin to be light-weighted.

Step-3: Set up a corresponding TF/TF2/Pytorch version that matches your CUDA/ROCm driver version. (If you have installed TF/TF2/Pytorch, please just ignore this step)

Here we provide several prebuilt package sources that match different environment requirements:

  For Tensorflow 1.x & 2.x: Recommended Packages (tested in Ubuntu 20.04):
  #   Tensorflow-1 for NVIDIA CUDA 10.0:
  python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==1.15.4
  #   Tensorflow-1 for NVIDIA CUDA 11.0:
  python3 -m pip install --upgrade pip && python3 -m pip install https://github.com/ghostplant/tensorflow-wheel-collections/releases/download/cuda-11/tensorflow_gpu-1.15.4_cuda11+nv-cp38-cp38-linux_x86_64.whl
  #   Tensorflow-1 for AMD ROCm 4.0:
  python3 -m pip install tensorflow-rocm==1.15.9

  #   Tensorflow-2 for NVIDIA CUDA 11.0:
  python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==2.4.0
  #   Tensorflow-2 for AMD ROCm 4.0:
  python3 -m pip install tensorflow-rocm==2.4.0

  For Pytorch 1.x: Recommended Packages (tested in Ubuntu 20.04):
  #   Pytorch for NVIDIA CUDA 10.0:
  python3 -m pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
  #   Pytorch for NVIDIA CUDA 11.0:
  python3 -m pip install torch===1.7.1+cu110 torchvision===0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
  #   Pytorch for AMD ROCm 4.0:
  python3 -m pip install torch torchvision -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html

Step-4: Install JIT Plugin Client and Run Examples

# Set up JIT Plugin for Pytorch:
sudo python3 ./frameworks/pytorch/setup.py

# Set up JIT Plugin for Tensorflow/Tensorflow2:
sudo python3 ./frameworks/tensorflow/setup.py

# Test Examples for Pytorch:
cd ./frameworks/pytorch/examples
./1_hello_world.py

# Test Examples for Tensorflow:
cd ./frameworks/tensorflow/examples
./1_hello_world.py

More examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2

Antares Predependencies for Different Backends:

Before running make command in antares root directory, you need to ensure the corresponding backend driver is installed correctly.

Predependencies for backend c-cuda, c-sycl_cuda:

Requirement: Ubuntu >= 18.04

Requirement: Install NVIDIA CUDA toolkit (>= 10.0) on Host OS

Requirement: docker
Predependencies for backend c-ocl_nvidia:

Requirement: Ubuntu >= 18.04

Requirement: Install NVIDIA CUDA toolkit (>= 10.0) to Host OS

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-ocl_android:

Requirement: Ubuntu >= 18.04

Requirement: Install package "adb", connect to rooted Android device and ensure command "adb shell su -c 'ls /sdcard'" works

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-rocm, c-ocl_amdgpu:

Requirement: Ubuntu >= 18.04

Requirement: Install AMD ROCm (>= 4.0) package "rock-dkms" & "rock-dkms-firmware" from repo http://repo.radeon.com/rocm/apt/debian to Host OS

Requirement: docker
Predependencies for backend c-gc:

Requirement: Ubuntu >= 18.04

Requirement: Install Poplar SDK to Host OS, ensure "popc" command exists in system PATH

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-scpu, c-mcpu, c-sycl_intel:

Requirement: Ubuntu >= 18.04

Requirement: docker
Predependencies for backend c-hlsl_win64, c-hlsl_xbox:

Requirement: Windows 10 64 bit (>= 2004), run "dxdiag.exe" to ensure Direct3D 12.0 Accleration is enabled

Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-rocm_win64:

Requirement: Windows 10 64 bit (>= 2004)

Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

Requirement: Install Official AMD GPU driver (release version >= 2020.11). Ensure C:\Windows\System32\amdhip64.dll exists after installation.

Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

Requirement: run bash command "make install_host" in antares root directory beforehand
Predependencies for backend c-cuda_win64:

Requirement: Windows 10 64 bit (>= 2004)

Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

Requirement: Install Official NVIDIA CUDA driver (>= 10.0). Ensure C:\Windows\System32\nvcuda.dll exists after installation.

Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

Requirement: run bash command "make install_host" in antares root directory beforehand

Current Support Table:

	HIP-C(c-rocm/c-rocm_win64)	CUDA(c-cuda/c-cuda_win64)	CPU(c-mcpu/c-scpu)	DirectX12(c-hlsl_win64)	Graphcore(c-gc)	Intel OneAPI(c-sycl_intel)	Codeplay DPCPP (c-sycl_cuda)
Deploy Environment	Linux/WSL1	Linux	Linux	WSL1	Linux	Linux
Target Device	AMDGPU	NVGPU	Generic CPU	Generic Graphic Card	IPU Device	Intel CPU/HD Graphic/FPGA	NVGPU
Global schedules	Y	Y	Y	Y	Y	Y	Y
Local schedules	Y	Y	Y	Y		Y	Y
Head fusion	Y	Y	Y	Y	Y	Y	Y
Tail fusion	Y	Y		Y			Y
Evaluator	Y	Y	Y	Y	Y	Y	Y
Tensorflow Plugin	Y	Y
Pytorch Plugin	Y	Y
Multi Kernel Eval	Y	Y	Y	Y		Y	Y

About Microsft Open Source

For more information about Microsoft Open Source Policy, please see Microsoft Open Source Code of Conduct

README.md Убрать экранирование Экранировать