Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.

Перейти к файлу

ghostplant 4a4a027ff3 custom matmul layout (#12 )		2020-09-12 14:19:05 +08:00
antares	add nvprof11 for CUDA compat >= 8.0 (#5 )	2020-09-09 16:44:22 +08:00
engine	add more hardware configs (#8 )	2020-09-10 07:59:40 +08:00
frameworks/antares	print error reason for illegal expression (#7 )	2020-09-09 20:17:09 +08:00
hardware	add more hardware configs (#8 )	2020-09-10 07:59:40 +08:00
lang	fix cast translation (#9 )	2020-09-10 13:40:00 +08:00
platforms	custom matmul layout (#12 )	2020-09-12 14:19:05 +08:00
public	custom matmul layout (#12 )	2020-09-12 14:19:05 +08:00
templates/auto	Initial Commit	2020-09-03 23:07:44 +08:00
tuner	Initial Commit	2020-09-03 23:07:44 +08:00
AntaresIR.md	custom matmul layout (#12 )	2020-09-12 14:19:05 +08:00
CONTRIBUTING.md	Initial Commit	2020-09-03 23:07:44 +08:00
Dockerfile	attach compute arch into antares tf plugin	2020-09-04 21:24:11 +08:00
LICENSE.TXT	Initial Commit	2020-09-03 23:07:44 +08:00
Makefile	Initial Commit	2020-09-03 23:07:44 +08:00
README.md	output detailed CUDA intermediate data (#3 )	2020-09-09 12:37:15 +08:00

README.md

What is Antares:

Antares is an automatic engine for multi-platform kernel generation and optimization (targeting to CUDA/ROCm/CPU/DirectX12/Graphcore).
Antares simplifies most TVM's low-level features, making it easier to use for DNN developers on Microsoft related platforms.
Antares follows "One Language Syntax for All Platforms" principle to reduce the description complexity on different platforms.

Antares Functionality:

Antares can convert computing operators from your DNN models into low-level source codes of the target device (e.g. kernels, shaders, ..).
Antares can also automatically tune and optimize these DNN operators on end-to-end device using efficient mechanisms and algorithms.

Antares can especially help you on these cases:

You want to modify fine-grain DNN workloads, but Tensorflow/Pytorch's built-in implementation are limited.
You notice some operators are inefficent, and you want to replace it with a better one easily.
Plus MSRA's NNfusion project, you can port your full DNN models into Window executable and get acceleration with DirectX12 + Intel/AMD/NVIDIA graphic cards.
You want to split fine-grain operator workloads into the local tile node of Graphcore, which benifits the on-ship memory usage and reduces BSP communication overhead.
Evaluate the compiler or potential runtime efficiency within Antares supported accelerators, e.g. A100.
Antares provides a large domain for researchers to develop on kernel optimizations, e.g. custom tuners, custom schedule policies, custom platforms, etc.

Install Antares:

sudo apt install docker.io

git clone https://github.com/microsoft/antares

cd antares/
sudo BACKEND=c-cuda make

# If you need Antares to extend/boost Tensorflow operators, please also run:
sudo python3 ./frameworks/antares/tensorflow/setup.py
# (Recommended Tensorflow CUDA Installation Source: pip3 install --upgrade pip && pip3 install tensorflow-gpu==1.15.3)

# If you need Antares to extend/boost Pytorch operators, please also run:
sudo python3 ./frameworks/antares/pytorch/setup.py
# (Recommended Pytorch CUDA Installation Source: pip3 install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html)

Startup with First Example (CUDA example):

cd ${ANTARES_ROOT}/
sudo BACKEND=c-cuda COMPUTE_V1='- einstein_v2("output0[N, M] +=! input0[N, K] * input1[K, M]", { "input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [512, 512]}})' make
# Other valid platforms for BACKEND variable could be: c-rocm, c-hlsl, c-gc, c-mcpu, c-ocl, ..

Example with Tensorflow-GPU/Pytorch-GPU:

This example shows you an easy way to quickly add custom operators in Tensorflow/Pytorch, but the operator itself is not an optimized version (not tuned).

# First, launch the antares REST server (a CUDA example)

BACKEND=c-cuda make rest-server

# For Tensorflow CUDA frontend, just execute the following python script:

import tensorflow as tf
from tensorflow.contrib import antares

x = tf.random.uniform([1024, 512])
y = tf.random.uniform([1024, 512])

op = antares.make_op('output0[N, M] = input0[N, M] * input1[N, M] + 1234', [x, y])

with tf.Session() as sess:
  print(sess.run(op))

# For Pytorch frontend, just execute the following python script:

import os, torch
from torch.contrib.antares.custom_op import CustomOp

device = torch.device("cuda")
dtype = torch.float32
custom_op = CustomOp().to(device, dtype)

kwargs = {'dtype': dtype,
          'device': device,
          'requires_grad': False}

x = torch.randn(1024, 512, **kwargs)
y = torch.randn(1024, 512, **kwargs)

outputs = custom_op('output0[N, M] = input0[N, M] * input1[N, M] + 1234', [x, y])

print(outputs)

If you want the operator you just extended to run more efficiently, you can consider to take a look at "How to Tune Expressions" sections below.

Documentation for Other Advanced Examples:

For more syntax usage or examples, please follow documentation here: Antares IR & Examples

Antares can support multi-line statements as long as they are fuse-able, for example of ConvReluBias:

    conv_out[N, F, HO, WO] +=! input_data[N, C, HO + KH, WO + KW] * kernel[KH, KW, C, F] where HO in 256, WO in 256;

    conv_bias[N, F, HO, WO] = conv_out[N, F, HO, WO] + bias[0, F, 0, 0];

    output0[N, F, HO, WO] = conv_bias[N, F, HO, WO].when(conv_bias[N, F, HO, WO] > 0.0, 0.0);

Antares Additional Features (comparing to TVM):

	Antares	TVM
Platform: DirectX12	Y	-
Platform: ROCm HIP C	Y	-
Platform: GraphCore	Y	-
Decoupling for Multi-Platforms	Y	-
Workflow: Auto Plan Spaces	Y	-
Workflow: Auto Infershape	Y	-
Language	Simple Antares IR	Hyrbid Script/Topi/..
Framework: Custom Op for Tensorflow	Y	-
Framework: Custom Op for Pytorch	Y	-

Current Feature Table:

	HIP-C(c-rocm)	CUDA(c-cuda)	CPU(c-mcpu)	DirectX12(c-hlsl)	Graphcore(c-gc)
Global schedules	Y	Y	Y	Y	Y
Local schedules	Y	Y	Y	Y
Head fusion	Y	Y	Y	Y	Y
Tail fusion	Y	Y	Y	Y	Y
Evaluator	Y	Y	Y	Y
Tensorflow Plugin	Y	Y
Pytorch Plugin	Y	Y
NNfusion Plugin	Y	Y	Y	Y	Y
Blend Intricsic	Y	Y	Y	Y

How to Tune Expressions:

If you want automatic ways to optimize the operator (described in your environmental variable COMPUTE_V1), you just need to add one more variable in your first-run examples: STEP=1000, which means Antares will take 1000 chances to search for a potenially better kernel version. For example,

    STEP=1000 BACKEND=c-cuda COMPUTE_V1='- einstein_v2("output0[N, F, HO, WO] +=! input0[N, C, HO * 4 + KH, WO * 4 + KW] * input1[F, C, KH, KW] where HO in 55, WO in 55", input_dict={"input0": {"dtype": "float32", "shape": [64, 3, 227, 227]}, "input1": {"dtype": "float32", "shape": [96, 3, 11, 11]}});' make

This will take some times to finish, and as long as your environment is correctly configured, you will finally get a JSON-format configuration which represents the best kernel version Antares found, then you can do 2 things:

Re-evalutation on the case Antares found using CONFIG variable:

    CONFIG='{"axis_0": [-1, 16, 64, 1], "reorder": [0]}' COMPUTE_V1='- einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})' BACKEND=c-cuda make

If you want to save the result so that frontends like Tensorflow/NNfusion can utilize the optimized kernel, you need to append COMMIT=1 for your case, like:

    COMMIT=1 CONFIG='{"axis_0": [-1, 16, 64, 1], "reorder": [0]}' COMPUTE_V1='- einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})' BACKEND=c-cuda make

If you want to auto commit the result together with tuning procedure, you can just merge STEP and COMMIT together:

    COMMIT=1 STEP=1000 BACKEND=c-cuda COMPUTE_V1='- einstein_v2("output0[N, F, HO, WO] +=! input0[N, C, HO * 4 + KH, WO * 4 + KW] * input1[F, C, KH, KW] where HO in 55, WO in 55", input_dict={"input0": {"dtype": "float32", "shape": [64, 3, 227, 227]}, "input1": {"dtype": "float32", "shape": [96, 3, 11, 11]}});' make

After you commit the results, the Antares REST Server will detect this record and response this code version to other frameworks once they newly requests the expression case you saved.

How to run Antares REST Server for different platforms:

You can add environment variable HTTP_PORT=<portnum> to change the listening port, by default, it will be listening on localhost:8880:

    BACKEND=c-cuda make rest-server
    BACKEND=c-hlsl make rest-server
    ...

How to use custom tuners as searching algorithms:

Custom tuners can be chosen by adding variable TUNER=.., and the value can be selected from any filename under folder tuner/:

    STEP=100 BACKEND=c-cuda make

    # Adding `RECORD=<filename>` can help you record the incremental tuning history
    RECORD=history.log STEP=100 BACKEND=c-cuda make

About Microsft Open Source

For more information about Microsoft Open Source Policy, please see Microsoft Open Source Code of Conduct