Граф коммитов

437 Коммитов

Автор SHA1 Сообщение Дата
Yuting Jiang 54da021b4d
Analyzer - Fix bugs in data diagnosis (#355)
**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel
2022-06-01 17:12:38 +08:00
Yuting Jiang 3f135e4669
Dockerfile - Add support to run sb command inside docker image (#356)
**Description**
Add support to run sb command inside docker image - install missing dependency.
2022-06-01 01:11:28 +08:00
Yuting Jiang e08b6d3a1c
Dockerfile: Update rccl version and fix issue in rocm5.1.1 dockerfile (#354)
**Description**
Update rccl version and fix issue in rocm5.1.1 dockerfile.
2022-05-27 10:46:40 +08:00
Yuting Jiang 81a4146bc1
Dockerfile - Add dockerfile for rocm5.1.1 (#353)
**Description**
Add dockerfile for rocm5.1.1.
2022-05-25 20:28:11 +08:00
Yifan Xiong 6681c72043
Release - SuperBench v0.5.0 (#350)
**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2022-04-29 16:22:55 +08:00
Yuting Jiang 712eafc373
Docs - Update links using relative file paths with extensions (#346)
**Description**
Update links of referencing other docs using relative file paths with extensions.
2022-04-21 07:28:19 +08:00
Jared Bowden cb26691173
Docs - Update link to cli.md (#341)
**Description**
Fixes relative link in documentation: point to `../cli.md`.
2022-04-15 22:11:14 +08:00
guoshzhao 80dcc8aaec
Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338)
**Description**
Integrate FAMBench into superbench based on docker implementation:
https://github.com/facebookresearch/FAMBench

The script to run all benchmarks is:
https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh
2022-04-11 15:31:07 +08:00
Yuting Jiang 8dc19ca4af
CLI - Integrate output all nodes diagnosis results (#339)
**Description**
Integrate output all nodes diagnosis results.
2022-04-11 13:42:04 +08:00
Yuting Jiang 55b0f9d239
Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336)
**Description**
Output results of all nodes in data diagnosis.
2022-04-10 18:57:15 +08:00
Yuting Jiang 56c9a711a8
Docs - Add usage for result summary (#337)
**Description**
Add usage for result summary.
2022-04-08 20:44:25 +00:00
Yuting Jiang f15da60b2b
CLI - Integrage result summary and update output format of data diagnosis (#335)
**Description**
Integrage result summary and update output format of data diagnosis.

**Major Revision**
- integrage result summary 
- add md and html format for data diagnosis
2022-04-08 18:48:43 +08:00
guoshzhao 6d895da83c
Benchmarks: Add Feature - Provide option to save raw data into file. (#333)
**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
2022-04-01 16:26:09 +08:00
dependabot[bot] d368d90e21
Bump minimist from 1.2.5 to 1.2.6 in /website (#334)
Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases)
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6)

---
updated-dependencies:
- dependency-name: minimist
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-31 12:41:48 +08:00
Yuting Jiang 84fed1ce18
Analyzer: Add feature - Add result summary in excel,md,html format (#320)
**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
2022-03-24 15:32:01 +08:00
Yuting Jiang c5aa4f4e38
Bug: Benchmarks - remove fp16 samples type converting time (#332)
**Description**
Remove fp16 samples type converting time for training cnn and lstm inference.
2022-03-22 12:51:52 +08:00
Yifan Xiong a9634ef5a8
Config - Add inference config for NC A100 and NV A10 series (#329)
Add inference config for preview SKUs, including:
* [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series)
* [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)
2022-03-21 14:24:37 +08:00
Yuting Jiang 6e74918044
Bug: Benchmarks - remove fp16 samples type converting time for cnn and lstm models (#330)
**Description**
Remove fp16  samples type converting time for cnn and lstm models.
2022-03-17 14:02:40 +08:00
rafsalas19 ff51a3cee9
Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324)
**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn
2022-03-16 16:20:11 +08:00
Yuting Jiang 84359fd806
Bug: Executor - fix bug in result writing to files for mpi mode (#328)
**Description**
fix the bug in result writing to files for mpi mode.
2022-03-15 16:35:03 +00:00
Yuting Jiang b3c95f1827
Analyzer - Add md and html output format for DataDiagnosis (#325)
**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis
2022-03-15 18:04:11 +08:00
Yifan Xiong f755c0b659
Bug - Fix env path to absolute path (#327)
Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.
2022-03-09 17:16:43 +08:00
Yuting Jiang 1ec055e1c2
Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321)
**Description**
Abstract RuleBase from DataDiagnosis.
2022-03-07 17:25:07 +08:00
dependabot[bot] 9759527111
Bump url-parse from 1.5.8 to 1.5.10 in /website (#323)
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-07 03:24:22 +00:00
Jeff Daily a9ef0f99ab
Benchmarks - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322)
**Description**
The BatchNorm operator is not numerically stable in fp16.  PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models.  Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32.  Preserving BN in fp32 for superbench more accurately reflects real workloads.
2022-03-06 13:22:43 +00:00
Yuting Jiang 425b9ff865
Dockerfile - Add dockerfile for rocm5.0.1 (#319)
**Description**
Add dockerfile for rocm5.0.1.
2022-02-28 19:30:43 +08:00
dependabot[bot] 74a3b1231a
Bump prismjs from 1.23.0 to 1.27.0 in /website (#318)
Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0.
- [Release notes](https://github.com/PrismJS/prism/releases)
- [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0)

---
updated-dependencies:
- dependency-name: prismjs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-28 14:55:12 +08:00
Yuting Jiang a4950a707e
Dockerfile - Add rocm5.0 dockerfile (#307)
**Description**
Add rocm5.0 dockerfile.
2022-02-26 07:12:45 +08:00
Ziyue Yang 01304706ed
Bug Fix - Fix P2P detection in gpu_copy (#317)
**Description**
Fix invalid reference of P2P detection result in gpu_copy.
2022-02-25 05:48:38 +08:00
Yuting Jiang 4f5027dbda
Benchmarks: Build Pipeline - Make gpcnet only for cuda (#316)
**Description**
Make gpcnet only for cuda.
2022-02-24 18:18:49 +08:00
Yuting Jiang e0c491425d
Bug - Fix empty HIP_ARCHITECTURES issue in cmake>=3.21.0 (#315)
**Description**
Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0.
Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364
2022-02-22 12:38:58 +00:00
dependabot[bot] 0740780bcc
Bump url-parse from 1.5.1 to 1.5.8 in /website (#313)
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-21 13:03:27 +08:00
Yifan Xiong ea2c10abc4
Config - Add T4 configurations for inference (#311)
Add T4 configurations for inference.
2022-02-20 13:00:55 +00:00
Yuting Jiang 97ed12f97f
Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289)
**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests
2022-02-20 16:59:38 +08:00
Yifan Xiong 1f48268bf5
Bug - Fix env file path (#310)
Fix env file path for `docker run`.
2022-02-15 15:23:43 +08:00
dependabot[bot] 53fe0c4798
Bump follow-redirects from 1.14.7 to 1.14.8 in /website (#309)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-15 13:04:37 +08:00
Yuting Jiang e31b8c9e08
Benchmarks: Revise Code - Add support for pytorch>=1.9.0 of init_process_group (#305)
**Description**
Add support for pytorch>=1.9.0 of init_process_group.

**Major Revision**
- Use PrefixStore(TCPStore) to init_process_group manully for each model run
2022-02-10 22:44:01 +08:00
Yuting Jiang 4abda6f5d4
Benchmarks: Build Pipeline - Update rccl-tests submodule to fix divide by zero error (#306)
**Description**
Update rccl-tests submodule to fix divide by zero error.
2022-02-09 14:46:29 +00:00
Ziyue Yang 6cdf759543
Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302)
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
2022-02-09 20:30:42 +08:00
Ziyue Yang 433785fd0c
Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299)
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
2022-02-08 17:59:48 +08:00
Ziyue Yang 682b2c120d
Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301)
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
2022-02-08 10:59:27 +08:00
Ziyue Yang 853890559a
Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298)
**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.
2022-02-07 13:16:13 +08:00
Yuting Jiang 28195be6db
Bug - Fix typo in document (#297)
Fix typo in document.
2022-01-30 13:38:00 +08:00
Yifan Xiong 3419447c11
Benchmarks - Support T4 and A10 in GEMM benchmark (#294)
Support T4 and A10 in GEMM benchmark.
2022-01-29 13:26:00 +00:00
Yifan Xiong 3524975cfc
Config - Support customized env for all modes (#295)
Support customized env for all modes in configuration.
2022-01-29 08:19:48 +00:00
Ziyue Yang f3d05006d4
Benchmarks: Fix Bug - Fix GPU scan logic in gpu_copy (#296)
Fix bug of GPU scan logic in bidirectional tests.
2022-01-29 14:04:03 +08:00
guoshzhao d03d110f55
Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287)
**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.
2022-01-28 20:35:53 +08:00
guoshzhao d877ca2322
Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288)
**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.
2022-01-28 08:16:32 +00:00
Yuting Jiang f283b53638
Config - Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml (#292)
**Description**
Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml
2022-01-28 06:15:19 +08:00
Yifan Xiong 7d7cd3dc63
Config - Update benchmark naming to support annotations (#284)
__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.
2022-01-25 09:54:58 +00:00