Граф коммитов

305 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong ea2c10abc4
Config - Add T4 configurations for inference (#311)
Add T4 configurations for inference.
2022-02-20 13:00:55 +00:00
Yuting Jiang 97ed12f97f
Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289)
**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests
2022-02-20 16:59:38 +08:00
Yifan Xiong 1f48268bf5
Bug - Fix env file path (#310)
Fix env file path for `docker run`.
2022-02-15 15:23:43 +08:00
dependabot[bot] 53fe0c4798
Bump follow-redirects from 1.14.7 to 1.14.8 in /website (#309)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-15 13:04:37 +08:00
Yuting Jiang e31b8c9e08
Benchmarks: Revise Code - Add support for pytorch>=1.9.0 of init_process_group (#305)
**Description**
Add support for pytorch>=1.9.0 of init_process_group.

**Major Revision**
- Use PrefixStore(TCPStore) to init_process_group manully for each model run
2022-02-10 22:44:01 +08:00
Yuting Jiang 4abda6f5d4
Benchmarks: Build Pipeline - Update rccl-tests submodule to fix divide by zero error (#306)
**Description**
Update rccl-tests submodule to fix divide by zero error.
2022-02-09 14:46:29 +00:00
Ziyue Yang 6cdf759543
Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302)
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
2022-02-09 20:30:42 +08:00
Ziyue Yang 433785fd0c
Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299)
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
2022-02-08 17:59:48 +08:00
Ziyue Yang 682b2c120d
Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301)
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
2022-02-08 10:59:27 +08:00
Ziyue Yang 853890559a
Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298)
**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.
2022-02-07 13:16:13 +08:00
Yuting Jiang 28195be6db
Bug - Fix typo in document (#297)
Fix typo in document.
2022-01-30 13:38:00 +08:00
Yifan Xiong 3419447c11
Benchmarks - Support T4 and A10 in GEMM benchmark (#294)
Support T4 and A10 in GEMM benchmark.
2022-01-29 13:26:00 +00:00
Yifan Xiong 3524975cfc
Config - Support customized env for all modes (#295)
Support customized env for all modes in configuration.
2022-01-29 08:19:48 +00:00
Ziyue Yang f3d05006d4
Benchmarks: Fix Bug - Fix GPU scan logic in gpu_copy (#296)
Fix bug of GPU scan logic in bidirectional tests.
2022-01-29 14:04:03 +08:00
guoshzhao d03d110f55
Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287)
**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.
2022-01-28 20:35:53 +08:00
guoshzhao d877ca2322
Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288)
**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.
2022-01-28 08:16:32 +00:00
Yuting Jiang f283b53638
Config - Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml (#292)
**Description**
Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml
2022-01-28 06:15:19 +08:00
Yifan Xiong 7d7cd3dc63
Config - Update benchmark naming to support annotations (#284)
__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.
2022-01-25 09:54:58 +00:00
Yuting Jiang 35fc06ebd1
Bug: Fix code insecure issue that binds a socket to all network interfaces (#291)
**Description**
Fix code insecure issue that binds a socket to all network interfaces.
2022-01-24 10:59:06 +00:00
Yuting Jiang 380ce4001c
Bug: Fix code incesure issue of integer overflow in cublas function (#290)
**Description**
Fix insecure issue of Multiplication result converted to larger type.

**Major Revision**
- Use a cast to ensure that the multiplication is done using the long long to avoid overflow.
2022-01-24 18:15:54 +08:00
dependabot[bot] 5f6ad0cd63
Bump nanoid from 3.1.23 to 3.2.0 in /website (#286)
Bumps [nanoid](https://github.com/ai/nanoid) from 3.1.23 to 3.2.0.
- [Release notes](https://github.com/ai/nanoid/releases)
- [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ai/nanoid/compare/3.1.23...3.2.0)

---
updated-dependencies:
- dependency-name: nanoid
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-23 21:05:11 +08:00
Ziyue Yang 74421ffee0
Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)
**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
2022-01-21 13:45:37 +08:00
guoshzhao fd2bc9e048
Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283)
**Description**
Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
2022-01-19 10:49:56 +08:00
Yifan Xiong f7ffc54522
CLI - Add command sb benchmark [list,list-parameters] (#279)
__Description__

Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.

<details>
<summary>Examples</summary>
<pre>
$ sb benchmark list -n [a-z]+-bw -o table
Result
--------
mem-bw
nccl-bw
rccl-bw
</pre>
<pre>
$ sb benchmark list-parameters -n mem-bw
=== mem-bw ===
optional arguments:
  --bin_dir str         Specify the directory of the benchmark binary.
  --duration int        The elapsed time of benchmark in seconds.
  --mem_type str [str ...]
                        Memory types to benchmark. E.g. htod dtoh dtod.
  --memory str          Memory argument for bandwidthtest. E.g. pinned unpinned.
  --run_count int       The run count of benchmark.
  --shmoo_mode          Enable shmoo mode for bandwidthtest.
default values:
{'bin_dir': None,
 'duration': 0,
 'mem_type': ['htod', 'dtoh'],
 'memory': 'pinned',
 'run_count': 1}
</pre>
</details>

__Major Revisions__
* Add `sb benchmark list` to list benchmarks matching given name.
* Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.

__Minor Revisions__
* Sort format help text for argparse.
2022-01-18 08:40:03 +00:00
dependabot[bot] 9a909d2bed
Bump follow-redirects from 1.14.1 to 1.14.7 in /website (#282)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.1 to 1.14.7.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.1...v1.14.7)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-17 12:56:30 +08:00
dependabot[bot] 2538a7eedd
Bump shelljs from 0.8.4 to 0.8.5 in /website (#281)
Bumps [shelljs](https://github.com/shelljs/shelljs) from 0.8.4 to 0.8.5.
- [Release notes](https://github.com/shelljs/shelljs/releases)
- [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5)

---
updated-dependencies:
- dependency-name: shelljs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-17 10:51:54 +08:00
Yifan Xiong ff563b66af
Release - SuperBench v0.4.0 (#278)
__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2021-12-30 16:24:00 +08:00
Yuting Jiang 682ed06aee
Docs - Add usage for data diagnosis (#266)
**Description**
Add usage for data diagnosis.
2021-12-14 03:10:29 +00:00
guoshzhao 2e10fb0dcd
Docs - Update docs for monitor. (#265)
**Description**
Update docs for monitor.
2021-12-13 14:07:28 +00:00
Yifan Xiong cb8a3cfb15
Benchmarks - Add transformers for TensorRT inference (#254)
Add transformers for TensorRT inference.
2021-12-13 13:21:32 +00:00
Ziyue Yang 10012a0a47
Docs - Add benchmark metrics for cpu-memory-bw-latency (#264)
**Description**
Add benchmark metrics for cpu-memory-bw-latency.
2021-12-13 19:08:19 +08:00
Ziyue Yang b6781968f2
Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py #263
**Description**
Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py.
2021-12-13 07:02:39 +00:00
Hossein Pourreza b590409e0f
Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216)
**Description**
Add mlc memory bandwidth and latency micro benchmark to Superbench.

**Major Revision**
- Add mlc benchmark with test and example files
2021-12-13 13:47:42 +08:00
yangpanMS c403b1ca76
Docs - Add a small note for using release container version (#262)
**Description**
Minor doc change to highlight sb CLI version is independent of the sb container version.
2021-12-13 03:48:11 +00:00
guoshzhao 4d85630abb
Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245)
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
2021-12-10 13:53:11 +00:00
Yuting Jiang c2f942cb6f
Analyzer: Add Feature - Add basic analysis features (#248)
**Description**
Add basic analysis features.

**Major Revision**
- Add statistics, correlations of the raw data
- Add numeric outlier detection(inter_quartile_range)
- Add boxplot for selected metric
2021-12-10 11:01:59 +00:00
guoshzhao 6e357fb9d2
Monitor: Integration - Integrate monitor into Superbench (#259)
**Description**
Integrate monitor into Superbench.

**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.
2021-12-10 09:33:13 +00:00
guoshzhao afea9913ae
Benchmarks: Fix Bug - Set reduce_op type for metirc return_code (#261)
**Description**
Set the `reduce_op` type for metirc `return_code` as `None`.
2021-12-10 16:02:29 +08:00
Yuting Jiang ed2f3c3c82
CLI - Integrate data diagnosis (#260)
**Description**
Add cli to integrate data diagnosis module.
2021-12-10 06:11:00 +00:00
Yuting Jiang 9f56b2198f
Benchmarks: Unify metric names of benchmarks (#252)
**Description**
Unify metric names of benchmarks.
2021-12-09 04:48:42 +00:00
Yuting Jiang c13ed2a297
Analyzer: Initialization - Add baseline-based data diagnosis module (#242)
**Description**
Add data diagnosis module.

**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators
2021-12-08 18:22:00 +08:00
Yifan Xiong 213ab14bea
Bug - Fix issues for distributed runs (#258)
Fix issues for distributed runs:
* fix config for memory bandwidth benchmarks
* add throttling for high concurrency docker pull
* update rsync path and exclude directories
* handle exceptions when creating summary
* tune for logging
2021-12-08 06:55:13 +00:00
guoshzhao 44f0270ec4
Benchmarks: Add Feature - Add return_code metric into result (#256)
**Description**
Add return_code metric into result and revise unit tests.
2021-12-07 07:32:37 +00:00
Yuting Jiang 655f238dbb
Docs - Add doc for data diagnosis (#249)
**Description**
Add doc for data diagnosis, including input, output and baseline file schema.
2021-12-06 02:49:38 +00:00
Yifan Xiong bd8f105d2e
Benchmarks - Add config file for NDm A100 v4 (#255)
Add config file for Azure NDm A100 v4 SKU.
2021-12-04 01:17:23 +08:00
guoshzhao 8042fa34cf
Benchmarks: Configuration - Add gpt-small into config files. (#253)
**Description**
Add gpt-small into config files.
2021-12-02 11:12:55 +00:00
guoshzhao 371fd61cea
Benchmarks: Add Feature - Add 'ignore_invalid' option when register benchmarks. (#247)
**Description**
If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.
2021-12-02 10:26:56 +00:00
Yifan Xiong b4ea97bfa4
Benchmark: Replace `-c` argument with `-N` for `numactl` in Configuration (#250)
**Description**
Replace `-c` argument with `-N` for `numactl` since the old `-c`/`--cpubind` argument is deprecated.
2021-12-02 09:27:03 +00:00
Ziyue Yang b0e759f599
Benchmarks: Build Pipeline - Upgrade FIO benchmark tool (#251)
**Description**
Upgrade FIO benchmark tool from 3.27 to 3.28.
2021-12-01 20:33:09 +08:00
Yuting Jiang 978e88efdd
Docs: Update ib validation microbenchmark metrics (#246)
**Description**
Update ib validtion mirobenchmark metrics.
2021-11-30 12:58:34 +00:00