Граф коммитов

65 Коммитов

Автор SHA1 Сообщение Дата
Yuting Jiang e39489c404
Benchmarks - Add configurations for NDv5 and AMD MI300 (#652)
**Description**
Add configurations for NDv5 and AMD MI300.

**Major Revision**
- Add cublaslt , FP8 transformers training, dist-inference cpp in NDv5
config example
- Add hipblaslt , FP8 transformers training, dist-inference cpp in AMD
MI300 config example
2024-09-26 22:40:00 +08:00
Yuting Jiang e65629a89b
Dockerfile - Add ROCm6.2 dockerfile (#647)
**Description**
Add ROCm6.2 dockerfile.
2024-09-22 11:49:08 +08:00
Yuting Jiang 1d2d054b86
Dockerfile - upgrade nccl version and install ucx to fix bug in cuda 12.4 docker file (#646)
**Description**
fix bug in cuda 12.4 docker file


**Major Revision**
- upgrade nccl due to OOM bug in nccl v2.20 graph mode
- install ucx 1.16 for mutli thread support for mpi in ib-traffic
2024-09-20 09:37:25 +00:00
Yuting Jiang f42835aa87
Dockerfile - Update hpcx link in cuda11.1 dockerfile to fix CI (#648)
**Description**
Update hpcx link in cuda11.1 dockerfile to fix CI
2024-09-20 06:20:11 +00:00
Yang Wang 46a5792915
Bug Fix - Update Docker Exec Command for Persistent HPCX Environment (#635)
Add 10-hpcx.sh to /etc/profile.d
Update the Docker exec command to ensure a persistent HPCX environment.
2024-08-13 16:35:01 +00:00
Yuting Jiang 2101e933cc
CI/CD - Fix MSCCL build error in CUDA12.4 docker build pipeline (#633)
**Description**
Fix MSCCL build error in CUDA12.4 docker build pipeline due to OOM
issue.
2024-07-28 23:43:06 +00:00
Yuting Jiang 7435f10a22
Dockerfile - Add CUDA 12.4 dockerfile (#619)
**Description**
Add CUDA 12.4 dockerfile.

**Major Revision**
- upgrade nvidia docker into 23.04


**Minor Revision**
- upgrade hpcx into 2.18
2024-04-22 06:36:19 +00:00
Yuting Jiang dc3846cbd4
Dockerfile - Upgrade mlc to v3.11 (#620)
**Description**
Upgrade mlc to v3.11.
2024-04-18 10:59:36 +08:00
Yang Wang eeaa9b1ac9
Bug Fix - Bug fix for cuda 12.2 dockerfile LD_LIBRARY_PATH issue (#614)
**Description**
Cuda 12.2 image will report undfined symbol error due to incomplete
LD_LIBRARY_PATH:


![image](https://github.com/microsoft/superbenchmark/assets/25875482/1a7c48c7-cb6b-4e3a-abbe-dde23007a96b)

### How to reproduce:
1. Deploy sb with cuda12.2 image
```
sb deploy -f local.ini -i superbench/superbench:v0.10.0-cuda12.2
```
2. Enter to the container
```
sudo docker exec -it sb-workspace bash
```
3. Execute `mpirun`:
```
root@sb-container:~# mpirun
mpirun: symbol lookup error: mpirun: undefined symbol: opal_libevent2022_event_base_loop
```
### Fix to fix
* Append hpcx_load into /etc/bash.bashrc for updaing env LD_LIBRARY_PATH in each time

---------
2024-03-21 15:05:55 +00:00
Yifan Xiong 2c88db907f
Release - SuperBench v0.10.0 (#607)
**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606

Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>
2024-01-08 05:40:52 +00:00
Yuting Jiang 1f5031bd74
Dockerfile - Upgrade to rocm5.7 dockerfile (#587)
**Description**
upgrade to rocm5.7 dockerfile.

---------

Co-authored-by: yukirora <yuting.jiang@microsoft.com>
2023-12-09 17:41:12 +00:00
Ziyue Yang 6ef3a0110f
Benchmarks: Add MSCCL Support for Nvidia GPU (#584)
**Description**
Add MSCCL support for Nvidia GPU
2023-12-07 19:57:28 +08:00
Yuting Jiang dd5a6329ed
Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582)
**Description**
Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark
2023-12-07 09:37:09 +08:00
Yifan Xiong 1ad1c21c38
Dockerfile - Upgrade Docker image to CUDA 12.2 (#577)
Upgrade Docker image to CUDA 12.2 for H100:
* upgrade base image to 23.10
* fix onnxruntime version in python3.10
* fix compilation errors
2023-11-22 13:48:18 +00:00
Yuting Jiang 79089b6517
Benchmarks: Micro benchmark - Add hipBLASLt function benchmark (#576)
**Description**
hipblaslt function benchmark and rebase cublaslt function benchmark.
2023-11-22 19:48:10 +08:00
guoshzhao 9f4880cb8e
Analyzer - Generate baseline given results from multiple nodes. (#575)
**Description**
Generate baseline given results from multiple nodes. 

**Major Revision**
- Add sub command `sb result generate-baseline`
- Add UT and docs

---------

Co-authored-by: 454314380 <454314380@qq.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-11-22 14:42:32 +08:00
Yuting Jiang d246bab430
Dockerfile - update mlc version into 3.10 for cuda and rocm dockerfiles (#562)
**Description**
Update mlc version into 3.10 for cuda and rocm dockerfiles to be
consistent with cuda12 dockerfile

Co-authored-by: yukirora <yuting.jiang@microsoft.com>
2023-10-23 11:21:17 +08:00
Yuting Jiang 27a10811af
Benchmarks: micro benchmark - source code for evaluating NVDEC decoding performance (#560)
**Description**
source code for evaluating NVDEC decoding performance.

---------

Co-authored-by: yukirora <yuting.jiang@microsoft.com>
2023-08-22 10:56:33 +00:00
Yuting Jiang e8ac0b1e28
Benchmarks: micro benchmarks - add python code for DirectXGPUEncodingLatency (#548)
**Description**
add python code for DirectXGPUEncodingLatency.
2023-07-06 15:31:28 +08:00
Yuting Jiang 865472177f
Benchmarks: Build Pipeline - add AMF in third party and build AMF encoding latency test (#543)
**Description**
add AMF in third party and build AMF encoding latency test.
2023-07-03 14:43:21 +00:00
Yuting Jiang 44ef531465
Dockerfile - Add SuperBench Windows Dockerfile (#534)
**Description**
Add dockerfile for win10 and building script for directx_benchmarks.

**Major Revision**
- Add docker file for win10 and required scripts to install the
dependency
- Add building script to build all directx vs benchmarks
- Add call of building script in Makefile

---------

Co-authored-by: yukirora <yuting.jiang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
2023-06-28 05:35:11 +00:00
Yifan Xiong 51761b3af1
Release - SuperBench v0.8.0 (#517)
**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)

Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-04-14 12:57:55 +00:00
rafsalas19 655bd0aa59
Adding HPL benchmark (#482)
**Description**

- Adding HPL benchmark

---------

Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2023-03-21 16:44:08 +00:00
Yifan Xiong 35f5390512
Pin setuptools version to v65.7.0 (#483)
Pin setuptools version to
[v65.7.0](https://setuptools.pypa.io/en/latest/history.html#v65-7-0) to
avoid breaking changes since v66.0.0.
2023-03-06 11:43:44 +00:00
rafsalas19 32896ca477
Adding Stream Benchmark (#473)
**Description**

- Added stream benchmark
- Added stream unit test
- Added stream example
- Modified docker files to build stream

---------

Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Co-authored-by: Yifan Xiong <xiongyf@yandex.com>
2023-02-13 15:34:37 -05:00
pnunna93 f21bfef2f3
Dockerfile: Remove fixed rccl version in rocm5.1.x docker file (#476)
**Description**
The commit(e08b6d3a1c) installs a rccl
version which is causing "undefined symbol: ncclGetLastError" while
trying to import torch. Revert it to avoid the error.
2023-02-07 15:24:26 +08:00
Yifan Xiong a3c65b2a57
Dockerfile - Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
Add Docker image for arch90 NVIDIA GPUs:

* add CUDA11.8 Dockerfile
* update archs in Makefile and benchmarks accordingly
* update image build pipeline
2022-12-29 12:19:38 +00:00
Yifan Xiong d7bb8303fb
CLI - Update version to include revision hash and date (#427)
Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028
2022-10-31 10:44:41 +08:00
Yifan Xiong 63e9b2d1bc
Release - SuperBench v0.6.0 (#409)
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2022-09-06 18:06:05 +08:00
Yifan Xiong 626ac0a463
Update Python setup for require packages (#387)
__Description__

Update Python setup for require packages.

__Major Revisions__
* downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
* add extra entry in extras_require for nested packages
* update `pip install` contents accordingly
2022-08-17 11:33:57 +08:00
Yang Wang faeee0a7cc
Auto generate ibstat file for topo aware traffic pattern (#381)
An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
2022-08-13 18:20:42 +08:00
Yifan Xiong 16b6385dee
Add dependencies (#374)
Add dependencies

* include ndv4-topo.xml in cuda docker images
* require requests version to avoid RequestsDependencyWarning
2022-07-13 08:42:53 +00:00
Yifan Xiong 9f03d5687a
Update dependencies and Dockerfile (#371)
Update dependencies and Dockerfile:
* upgrade nccl-tests and rccl-tests to current latest version to match
  NCCL/RCCL versions
* unify image tag names on DockerHub
* remove verbose output in Dockerfile and minor fix some flags
2022-07-06 10:31:41 +00:00
Yifan Xiong 325a7338bf
Fix incorrect ulimit config in Dockerfile (#364)
Fix incorrect ulimit nofile config in Dockerfile.

Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
2022-06-24 14:14:00 +00:00
Yifan Xiong bfaa1c837b
Support multiple IB/GPU in ib validation (#363)
**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.
2022-06-24 08:35:20 +00:00
Yifan Xiong 483bf782e1
Update ROCm Dockerfile (#361)
**Description**

Update ROCm Dockerfile.

**Major Revisions**
- Add dockerfile for ROCm 5.1.3
- Merge 5.1.x and 5.0.x dockerfile
- Remove 4.2 and 4.0 legacy
- Update build pipeline accordingly
2022-06-19 17:26:39 +08:00
Yifan Xiong 60a3c74306
Fix cmake and build issues (#360)
**Description**

Fix cmake and build issues.

**Major Revision**

* Remove unnecessary boost build
* Remove user-agent for mlc
* Remove -j for third party to build each project in sequence
* Fix ansible collections installation path
2022-06-15 13:07:57 +08:00
Yuting Jiang 3f135e4669
Dockerfile - Add support to run sb command inside docker image (#356)
**Description**
Add support to run sb command inside docker image - install missing dependency.
2022-06-01 01:11:28 +08:00
Yuting Jiang e08b6d3a1c
Dockerfile: Update rccl version and fix issue in rocm5.1.1 dockerfile (#354)
**Description**
Update rccl version and fix issue in rocm5.1.1 dockerfile.
2022-05-27 10:46:40 +08:00
Yuting Jiang 81a4146bc1
Dockerfile - Add dockerfile for rocm5.1.1 (#353)
**Description**
Add dockerfile for rocm5.1.1.
2022-05-25 20:28:11 +08:00
Yuting Jiang 425b9ff865
Dockerfile - Add dockerfile for rocm5.0.1 (#319)
**Description**
Add dockerfile for rocm5.0.1.
2022-02-28 19:30:43 +08:00
Yuting Jiang a4950a707e
Dockerfile - Add rocm5.0 dockerfile (#307)
**Description**
Add rocm5.0 dockerfile.
2022-02-26 07:12:45 +08:00
Ziyue Yang 433785fd0c
Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299)
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
2022-02-08 17:59:48 +08:00
Yifan Xiong ff563b66af
Release - SuperBench v0.4.0 (#278)
__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2021-12-30 16:24:00 +08:00
Hossein Pourreza b590409e0f
Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216)
**Description**
Add mlc memory bandwidth and latency micro benchmark to Superbench.

**Major Revision**
- Add mlc benchmark with test and example files
2021-12-13 13:47:42 +08:00
guoshzhao 4d85630abb
Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245)
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
2021-12-10 13:53:11 +00:00
Ziyue Yang 008e0fe1d8
Benchmarks: Add Feature - Add CPU-initiated copy and dtod support to gpu-sm-copy benchmark (#230)
**Description**
This commit does the following:
1) Adds CPU-initiated copy benchmark;
2) Adds dtod benchmark;
3) Support scanning NUMA nodes and GPUs inside the benchmark program;
4) Change the name of gpu-sm-copy to gpu-copy.
2021-10-30 11:19:09 +08:00
Ziyue Yang 6a068e2534
Dockerfiles: Fix Bug - Fix ROCm GPG file URL (#234)
**Description**
This commit fixes the URL of ROCm GPG file.
2021-10-30 07:26:52 +08:00
Yuting Jiang b592a7c793
Benchmarks: Build Pipeline - Add gpcnet as git submodule and building logic (#228)
**Description**
Add gpcnet as git submodule and building logic.

**Major Revision**
- add gpcnet as a submodule
- add build logic in third_party/Makefile
2021-10-21 11:28:51 +00:00
Yifan Xiong dfbd70b129
Release - SuperBench v0.3.0 (#212)
**Description**

Cherry-pick  bug fixes from v0.3.0 to main.

**Major Revisions**
* Docs - Upgrade version and release note (#209)
* Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
* Benchmarks: Update - Update benchmarks in configuration file (#208)
* CI/CD - Update GitHub Action VM (#211)
* Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
* CI/CD - Fix bug in build image for push event (#205)
* Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
* Tool: Fix bug - Fix function naming issue in system info  (#200)
* CI/CD - Push images in GitHub Action (#202)
* Bug - Fix torch.distributed command for single node (#201)
* CLI - Integrate system info for node (#199)
* Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
* CI/CD - Add ROCm image build in GitHub Actions (#194)
* Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
* Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
* Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
* Bug - Revise 'docker run' in sb deploy (#195)
* Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)

Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2021-09-26 09:30:31 +08:00