Граф коммитов

371 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong 664c59a14d
Docs - Update version in README (#529)
Update version in README.
2023-04-28 03:36:11 +00:00
Ziyue Yang 4cb431cab4
Benchmarks - Revise step time collection in distributed inference benchmark (#524)
**Description**
This commit revises distributed inference benchmark to give a unified
step time result by taking maximum step times of different GPUs.
2023-04-24 10:17:49 +08:00
Yifan Xiong 51761b3af1
Release - SuperBench v0.8.0 (#517)
**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)

Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-04-14 12:57:55 +00:00
Yifan Xiong 97c9a41f14
Benchmark - Update TE FP8 model conversion (#499)
__Description__

Update TE FP8 model conversion.

__Major Revisions__
* Add 16-byte alignment comment.
* Fix TE layer parameters type.
2023-03-28 15:01:41 +00:00
Yifan Xiong c88c970943
Benchmarks - Support TE FP8 in BERT/GPT2 models (#496)
Support Transformer Engine FP8 in existing PyTorch BERT/GPT2 models by
converting linear/layernorm to TE layers.
2023-03-25 19:28:27 +08:00
Ziyue Yang 8daef211dd
Benchmarks - Add distributed inference benchmark (#493)
**Description**
This PR adds a micro-benchmark of distributed model inference workloads.

**Major Revision**
- Add a new micro-benchmark dist-inference.
- Add corresponding example and unit tests.
- Update configuration files to include this new micro-benchmark.
- Update micro-benchmark README.

---------

Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2023-03-24 17:15:17 +08:00
guoshzhao a9b45a072e
Monitor - Support cgroup V2 when read system metrics. (#491)
**Description**
Since ubuntu 22.04 will use cgroup V2 and the file structure changed.
Modify the monitor to adapt to cgroup v1 and v2.
2023-03-22 08:33:18 +00:00
Yifan Xiong dbeba8056b
Benchmark - Support batch/shape range in cublaslt gemm (#494)
Support batch and shape range with multiplication factors in cublaslt
gemm benchmark.
2023-03-22 13:22:36 +08:00
rafsalas19 655bd0aa59
Adding HPL benchmark (#482)
**Description**

- Adding HPL benchmark

---------

Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2023-03-21 16:44:08 +00:00
Yifan Xiong 644b5395df
Benchmark - Fix torch.dist init issue with multiple models (#495)
Fix potential barrier timeout in init_process_group due to race
condition of using the same port. Change to different ports when running
multiple models sequentially in one process.
For example, when running vgg11/13/16/19, will use port 29501~29504
respectively.
2023-03-21 12:35:03 +00:00
Yuting Jiang 5a88db1601
Benchmarks: Support error tolerance in micro-benchmark for CuDNN function (#490)
**Description**
Support error tolerance in micro-benchmark for CuDNN function


**Major Revision**
- revise micro_base to support running the remaining commands run when
one command failed in the microbenchmark
- make error tolerance as true in cudnn functions
2023-03-20 21:20:21 +08:00
Yifan Xiong b808135c27
Benchmarks - Support tensor core precisions in cublaslt gemm (#492)
Support FP64/TF32/FP16/BF16 in cublaslt (batch) GEMM.
2023-03-20 10:59:40 +08:00
dependabot[bot] 139d4df55f
Bump webpack from 5.39.1 to 5.76.1 in /website (#489)
Bumps [webpack](https://github.com/webpack/webpack) from 5.39.1 to 5.76.1.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](webpack/webpack@v5.39.1...v5.76.1)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-03-17 20:18:35 +08:00
Yifan Xiong 35f5390512
Pin setuptools version to v65.7.0 (#483)
Pin setuptools version to
[v65.7.0](https://setuptools.pypa.io/en/latest/history.html#v65-7-0) to
avoid breaking changes since v66.0.0.
2023-03-06 11:43:44 +00:00
Yifan Xiong 2cc4cd03e2
Limit ansible_runner version for Python3.6 (#485)
Limit ansible_runner version to less than 2.3.2 for Python3.6.
2023-03-06 18:54:45 +08:00
Yuting Jiang eba298f5f0
Benchmarks: Revision - Support flexible warmup and non-random data initialization in cublas-benchmark (#479)
**Description**
revise cublas-benchmark for flexible warmup and fill data with fixed
number for perf test to improve the running efficiency.

**Major Revision**
- remove num_in_steps for warmup to support more flexible warmup setting
for users
- Add support to generate input with fixed number for perf test
2023-02-28 06:35:18 +08:00
Yuting Jiang 0292366075
Benchmarks: Build Pipeline - Add suppport for cpu-only perftest in makefile (#480)
**Description**
Add suppport to install cpu-only perftest in makefile.

Co-authored-by: Yuting Jiang <yuting.jiang@microsoft.com>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2023-02-24 11:19:46 +08:00
Yifan Xiong bbb86c4a83
CI/CD - Free disk space in GitHub Action VHD (#481)
Free more disk space in GitHub Action VHD.
2023-02-23 17:30:39 +08:00
Yuting Jiang ec7f502c93
CI/CD - Upgrade networkx version to fix installation compatibility issue (#478)
**Description**
Upgrade networkx version to fix installation compatibility issue.
2023-02-17 05:36:21 +00:00
dependabot[bot] f041b6eacc
Bump @sideway/formula from 3.0.0 to 3.0.1 in /website (#477)
Bumps [@sideway/formula](https://github.com/sideway/formula) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/sideway/formula/releases)
- [Commits](hapijs/formula@v3.0.0...v3.0.1)

---
updated-dependencies:
- dependency-name: "@sideway/formula"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-02-17 02:33:08 +00:00
dependabot[bot] e1a489496c
Bump http-cache-semantics from 4.1.0 to 4.1.1 in /website (#474)
Bumps [http-cache-semantics](https://github.com/kornelski/http-cache-semantics) from 4.1.0 to 4.1.1.
- [Release notes](https://github.com/kornelski/http-cache-semantics/releases)
- [Commits](kornelski/http-cache-semantics@v4.1.0...v4.1.1)

---
updated-dependencies:
- dependency-name: http-cache-semantics
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-02-16 14:14:47 +00:00
rafsalas19 32896ca477
Adding Stream Benchmark (#473)
**Description**

- Added stream benchmark
- Added stream unit test
- Added stream example
- Modified docker files to build stream

---------

Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Co-authored-by: Yifan Xiong <xiongyf@yandex.com>
2023-02-13 15:34:37 -05:00
Yuting Jiang 62a2913497
Executor - Support SuperBench Executor running on Windows (#475)
**Description**
Support SuperBench Executor running on Windows.

**Major Revision**
- Lazy import ansible related module
2023-02-13 08:20:07 +00:00
pnunna93 f21bfef2f3
Dockerfile: Remove fixed rccl version in rocm5.1.x docker file (#476)
**Description**
The commit(e08b6d3a1c) installs a rccl
version which is causing "undefined symbol: ncclGetLastError" while
trying to import torch. Revert it to avoid the error.
2023-02-07 15:24:26 +08:00
dependabot[bot] 121a5ddc5e
Bump ua-parser-js from 0.7.28 to 0.7.33 in /website (#469)
Bumps [ua-parser-js](https://github.com/faisalman/ua-parser-js) from 0.7.28 to 0.7.33.
- [Release notes](https://github.com/faisalman/ua-parser-js/releases)
- [Changelog](https://github.com/faisalman/ua-parser-js/blob/master/changelog.md)
- [Commits](faisalman/ua-parser-js@0.7.28...0.7.33)

---
updated-dependencies:
- dependency-name: ua-parser-js
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-01-30 10:58:24 +08:00
Yifan Xiong b07fda155e
Release - SuperBench v0.7.0 (#468)
**Description**

Cherry-pick bug fixes from v0.7.0 to main.

**Major Revisions**

* Benchmarks - Fix missing include in FP8 benchmark (#460)
* Fix bug in TE BERT model (#461)
* Doc - Update benchmark doc (#465)
* Bug: Fix bug for incorrect datatype judgement in cublas-function
source code (#464)
* Support `sb deploy` without pulling image (#466)
* Docs - Upgrade version and release note (#467)

Co-authored-by: Russell J. Hewett <russell.j.hewett@gmail.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-01-28 11:07:06 +08:00
Yuting Jiang f380bc5eff
Bug: Fix bug for incorrect datatype judgement in cublas-function source code (#462)
**Description**
Fix bug for incorrect datatype judgement in cublas-function source code.
2023-01-17 10:51:57 +08:00
dependabot[bot] 65bae28c0d
Bump json5 from 1.0.1 to 1.0.2 in /website (#459)
Bumps [json5](https://github.com/json5/json5) from 1.0.1 to 1.0.2.
- [Release notes](https://github.com/json5/json5/releases)
- [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md)
- [Commits](json5/json5@v1.0.1...v1.0.2)

---
updated-dependencies:
- dependency-name: json5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 17:10:40 +08:00
Yang Wang ccccd988df
Benchmarks - Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark (#454)
Support traffic patterns under the different devices in NCCL/RCCL test
* change the metrics format if specified the pattern
2023-01-04 12:30:32 +00:00
Yang Wang 8e748d5649
Runner - Generate host groups file in mpi mode (#458)
**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo
2023-01-04 19:49:14 +08:00
Yifan Xiong 5197cdf5cb
Benchmarks - Support FP8 in BERT models (#446)
Support FP8 in PyTorch BERT models:

* add fp8 hybrid/e4m3/e5m2 in precision arguments
* build BERT encoders with `te.TransformerLayer` to repalce
`transformers.BertModel`
* wrap forward steps with fp8 autocast
2023-01-04 11:12:05 +08:00
Yang Wang 65e433c0c6
Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437)
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
2023-01-03 10:28:35 +00:00
Yifan Xiong fc661f7db3
Support GEMM benchmark on Hopper GPUs (#456)
Support GEMM benchmark on Hopper GPUs.
2023-01-03 09:45:27 +00:00
Yifan Xiong 616e7a5a5a
Benchmarks - Integrate cublaslt micro-benchmark (#455)
Integrate cublaslt-gemm micro-benchmark #451.
2023-01-03 08:54:40 +00:00
Yuting Jiang 75573f59da
Benchmarks: Micro benchmarks - Add correctness check in cublas-function benchmark (#452)
**Description**
 Add correctness check in cublas-function benchmark.

**Major Revision**
- add python code of correctness check in cublas-function benchmark and test
2023-01-03 14:59:30 +08:00
Yifan Xiong 0591da5f49
Benchmarks - Add cuBLASLt FP16 and FP8 GEMM micro-benchmark (#451)
Add micro-benchmark for cublaslt fp8 gemm.
2023-01-03 05:28:56 +00:00
Yuting Jiang 678b1251f1
Benchmarks: Micro benchmarks - add source code of correctness check for cublas functions (#450)
**Description**
Add c source code of correctness check for cublas functions.

**Major Revision**
- add correctness check for all supported cublas functions
- add --correctness option into binary

**Minor Revision**
- fix bug and template fill_data and prepare_tensor to get right memory-alignment output matrix for different datatype
2023-01-03 04:20:10 +00:00
Yuting Jiang 9dfefce350
Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445)
**Description**
Add stdout logging util module and enable real-time logging flushing in executor

**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`

**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks 
- udpate related docs
2022-12-30 09:40:28 +00:00
Yang Wang f2634d8608
Benchmarks - Support `pair-wise` pattern in IB validation benchmark (#453)
**Description**
* Reuse `gen_pair_wise_config` in micro-benchmark
2022-12-30 13:02:52 +08:00
Yifan Xiong a3c65b2a57
Dockerfile - Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
Add Docker image for arch90 NVIDIA GPUs:

* add CUDA11.8 Dockerfile
* update archs in Makefile and benchmarks accordingly
* update image build pipeline
2022-12-29 12:19:38 +00:00
Yang Wang 7838b6b154
Runner - Support `pair-wise` pattern in `mpi` mode (#447)
* Extract pair-wise pattern from ib_validation
2022-12-29 08:23:36 +00:00
dependabot[bot] 6186146d59
Bump qs and express in /website (#440)
Bumps [qs](https://github.com/ljharb/qs) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `qs` from 6.7.0 to 6.11.0
- [Release notes](https://github.com/ljharb/qs/releases)
- [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ljharb/qs/compare/v6.7.0...v6.11.0)

Updates `express` from 4.17.1 to 4.18.2
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/master/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.17.1...4.18.2)

---
updated-dependencies:
- dependency-name: qs
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-28 13:47:06 +08:00
dependabot[bot] de6deb0e2d
Bump decode-uri-component from 0.2.0 to 0.2.2 in /website (#439)
Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2.
- [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases)
- [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2)

---
updated-dependencies:
- dependency-name: decode-uri-component
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-27 13:40:49 +08:00
Yuting Jiang 6583ba2e40
Benchmark: Revision - Add wait time option to resolve mem-bw unstable issue (#438)
**Description**
Add wait time option to resolve mem-bw unstable issue.
2022-12-14 17:21:02 +08:00
Yuting Jiang 1deb2eaa29
downgrage transformers version to fix tersorrt (#441)
**Description**
Downgrage transformers version to fix tersorrt test failure.
2022-12-14 14:19:32 +08:00
Yang Wang e4eeda0afd
Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430)
* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments
2022-11-29 12:30:10 +08:00
dependabot[bot] 3c97381fd2
Bump loader-utils from 1.4.0 to 1.4.2 in /website (#431)
Bumps [loader-utils](https://github.com/webpack/loader-utils) from 1.4.0 to 1.4.2.
- [Release notes](https://github.com/webpack/loader-utils/releases)
- [Changelog](https://github.com/webpack/loader-utils/blob/v1.4.2/CHANGELOG.md)
- [Commits](https://github.com/webpack/loader-utils/compare/v1.4.0...v1.4.2)

---
updated-dependencies:
- dependency-name: loader-utils
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-18 15:13:04 +08:00
Yang Wang 57f7403c47
Update typing-extensions version to fix pipeline issue (#432) 2022-11-17 19:39:52 +08:00
Yifan Xiong 1b86503d1e
CLI - Add non-zero return code for `sb [deploy,run]` (#425)
Add non-zero return code for `sb deploy` and `sb run` command when
there're Ansible failures in control plane.
Return code is set to count of failure.

For failures caused by benchmarks, return code is still set per benchmark
in results json file.
2022-11-01 10:46:19 +08:00
Yifan Xiong d7bb8303fb
CLI - Update version to include revision hash and date (#427)
Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028
2022-10-31 10:44:41 +08:00