Граф коммитов

24 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong 2c88db907f
Release - SuperBench v0.10.0 (#607)
**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606

Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>
2024-01-08 05:40:52 +00:00
Ziyue Yang 719a427fe7
Benchmarks: Microbenchmark - Add distributed inference benchmark cpp implementation (#586)
**Description**
Add distributed inference benchmark cpp implementation.
2023-12-11 06:53:51 +08:00
Ziyue Yang 4fa60be7cd
Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588)
**Description**
Add one-to-all, all-to-one, all-to-all support to
gpu_copy_bw_performance, and fix performance bug in gpu_copy
2023-12-08 23:22:38 +08:00
Yuting Jiang dd5a6329ed
Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582)
**Description**
Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark
2023-12-07 09:37:09 +08:00
Yifan Xiong 7184bdd1ed
Benchmarks - Update result parsing in tensorrt inference (#541)
* Update result parsing for newer tensorrt versions
* Update arguments when load torchvision models
2023-06-30 11:22:46 +08:00
rafsalas19 655bd0aa59
Adding HPL benchmark (#482)
**Description**

- Adding HPL benchmark

---------

Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2023-03-21 16:44:08 +00:00
rafsalas19 32896ca477
Adding Stream Benchmark (#473)
**Description**

- Added stream benchmark
- Added stream unit test
- Added stream example
- Modified docker files to build stream

---------

Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Co-authored-by: Yifan Xiong <xiongyf@yandex.com>
2023-02-13 15:34:37 -05:00
Yang Wang 8e748d5649
Runner - Generate host groups file in mpi mode (#458)
**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo
2023-01-04 19:49:14 +08:00
Yang Wang 65e433c0c6
Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437)
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
2023-01-03 10:28:35 +00:00
Yifan Xiong 63e9b2d1bc
Release - SuperBench v0.6.0 (#409)
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2022-09-06 18:06:05 +08:00
Yuting Jiang 10a79c4ea8
Analyzer - Add support for both jsonl and json format in data diagnosis (#388)
**Description**
Add support for both jsonl and json format in data diagnosis.

**Major Revision**
- Add support for both jsonl and json format in data diagnosis


**Minor Revision**
- change related doc
- add jsonl support in cli
2022-08-22 10:57:00 +08:00
Yuting Jiang b5c7c85d17
Analyzer: Rename fields in json of data diagnosis to be more readable (#382)
**Description**
Rename field in data diagnosis to be more readable.

**Major Revision**
- rename fields according to diagnosis/metric format

**Minor Revision**
- change type of diagnosis/issue_num to be int
2022-08-09 10:03:50 +08:00
Yuting Jiang ec16d42564
Analyzer - Add failure check feature in data diagnosis (#378)
**Description**
Add failure check feature in data diagnosis.

**Major Revision**
- Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest
- Split performance issue and failedtest in categories


**Minor Revision**
- replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
2022-08-01 12:35:35 +08:00
Jie Zhang ef4d65745b
Support topo-aware IB performance validation (#373)
* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
2022-07-26 16:56:19 -07:00
Yuting Jiang 54da021b4d
Analyzer - Fix bugs in data diagnosis (#355)
**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel
2022-06-01 17:12:38 +08:00
Yuting Jiang 55b0f9d239
Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336)
**Description**
Output results of all nodes in data diagnosis.
2022-04-10 18:57:15 +08:00
Yuting Jiang 84fed1ce18
Analyzer: Add feature - Add result summary in excel,md,html format (#320)
**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
2022-03-24 15:32:01 +08:00
rafsalas19 ff51a3cee9
Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324)
**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn
2022-03-16 16:20:11 +08:00
Yuting Jiang b3c95f1827
Analyzer - Add md and html output format for DataDiagnosis (#325)
**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis
2022-03-15 18:04:11 +08:00
Ziyue Yang 6cdf759543
Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302)
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
2022-02-09 20:30:42 +08:00
Ziyue Yang 74421ffee0
Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)
**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
2022-01-21 13:45:37 +08:00
Yifan Xiong ff563b66af
Release - SuperBench v0.4.0 (#278)
__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2021-12-30 16:24:00 +08:00
guoshzhao 6e357fb9d2
Monitor: Integration - Integrate monitor into Superbench (#259)
**Description**
Integrate monitor into Superbench.

**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.
2021-12-10 09:33:13 +00:00
Yifan Xiong 8a00c8a03b
Benchmarks - Add TensorRT inference benchmark (#236)
__Description__

Add TensorRT inference benchmark for torchvision models.

__Major Revision__
- Measure TensorRT inference performance.
2021-11-12 15:27:16 +08:00