Граф коммитов

36 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong 2c88db907f
Release - SuperBench v0.10.0 (#607)
**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606

Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>
2024-01-08 05:40:52 +00:00
guoshzhao 028819b388
Monitor - Add support for AMD GPU. (#580)
**Description**
Add AMD support in monitor.

**Major Revision**
- Add library pyrsmi to collect metrics.
- Currently can get device_utilization, device_power, device_used_memory
and device_total_memory.
2023-11-27 18:45:56 +08:00
pnunna93 67f2aa7237
Benchmarks: model benchmarks - change torch.distributed.launch to torchrun (#556)
This PR has following changes
- torch.distributed.launch changed to torchrun. torch.distributed.launch
is deprecated in latest Pytorch and is recommended to move to torchrun -
https://pytorch.org/docs/stable/elastic/run.html
- Changes to AMD GPU detection logic. The AMD GPU detection logic throws
warning when containers have only renderD in /dev/dri, this change would
resolve those warnings

---------

Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-08-08 13:03:32 +08:00
Yuting Jiang af4cfd5bbf
Benchmarks: micro benchmarks - add python code for DirecXGPUMemBw (#547)
**Description**
add python code for DirecXGPUMemBw.
2023-07-05 22:07:13 +08:00
Yuting Jiang bbb0e24342
Benchmarks - Add support for DirectX GPU platform (#536)
**Description**
Add support for DirectX GPU platform.

**Major Revision**
- Add DirectX platform for benchmark registry
- Add gpu_vendor identify for AMD and NVIDIA with win driver
2023-06-21 01:58:13 +00:00
Yifan Xiong 51761b3af1
Release - SuperBench v0.8.0 (#517)
**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)

Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-04-14 12:57:55 +00:00
guoshzhao a9b45a072e
Monitor - Support cgroup V2 when read system metrics. (#491)
**Description**
Since ubuntu 22.04 will use cgroup V2 and the file structure changed.
Modify the monitor to adapt to cgroup v1 and v2.
2023-03-22 08:33:18 +00:00
Yuting Jiang 62a2913497
Executor - Support SuperBench Executor running on Windows (#475)
**Description**
Support SuperBench Executor running on Windows.

**Major Revision**
- Lazy import ansible related module
2023-02-13 08:20:07 +00:00
Yang Wang 8e748d5649
Runner - Generate host groups file in mpi mode (#458)
**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo
2023-01-04 19:49:14 +08:00
Yang Wang 65e433c0c6
Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437)
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
2023-01-03 10:28:35 +00:00
Yuting Jiang 9dfefce350
Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445)
**Description**
Add stdout logging util module and enable real-time logging flushing in executor

**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`

**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks 
- udpate related docs
2022-12-30 09:40:28 +00:00
Yang Wang 7838b6b154
Runner - Support `pair-wise` pattern in `mpi` mode (#447)
* Extract pair-wise pattern from ib_validation
2022-12-29 08:23:36 +00:00
Yang Wang e4eeda0afd
Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430)
* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments
2022-11-29 12:30:10 +08:00
Yifan Xiong 63e9b2d1bc
Release - SuperBench v0.6.0 (#409)
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2022-09-06 18:06:05 +08:00
Yang Wang faeee0a7cc
Auto generate ibstat file for topo aware traffic pattern (#381)
An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
2022-08-13 18:20:42 +08:00
Jie Zhang ef4d65745b
Support topo-aware IB performance validation (#373)
* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
2022-07-26 16:56:19 -07:00
Yifan Xiong a94ead34b0
CLI - Support SKU auto detect if running on Azure VM (#365)
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
2022-07-05 10:52:39 +08:00
Yuting Jiang 35fc06ebd1
Bug: Fix code insecure issue that binds a socket to all network interfaces (#291)
**Description**
Fix code insecure issue that binds a socket to all network interfaces.
2022-01-24 10:59:06 +00:00
guoshzhao cc70f9c18c
Benchmarks: Add Feature - Extend the device manager utility to support more functions. (#239)
**Description**
Rename `nvidia_helper` utility as `device_manager` module and support more functions:
```
device_manager.get_device_count()
device_manager.get_device_utilization(idx)
device_manager.get_device_temperature(idx)
device_manager.get_device_power_limit(idx)
device_manager.get_device_memory(idx)
device_manager.get_device_row_remapped_info(idx)
device_manager.get_device_ecc_error(idx)
```
2021-11-15 14:24:04 +08:00
guoshzhao 8cd264fdeb
Benchmarks: Code Revision - Revise subprocess invoke (#178)
**Description**
Package frequently-used subprocess invoke into function.
2021-08-31 15:34:04 +08:00
Yuting Jiang 71c1617b2e
Utils: Code Revision - Update network common utils (#118)
Update network common utils. Add get_ib_devices in network common utils and move get_free_port from test utils to network common utils
2021-07-13 16:05:01 +08:00
guoshzhao 9c984c7eb0
Bug bash - Merge fix from release/0.2 to main (#124)
* Bug Fix - Fix race condition issue for multi ranks (#117)

Fix race condition issue when multi ranks rotating the same directory.

* Update pipeline for release branch (#122)

* Bug Fix - Fix bug when convert bool config to store_true argument. (#120)

Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
2021-07-09 16:54:42 +08:00
Yifan Xiong 7458f83a9b
Runner & Executor - Support AMD GPU (#119)
Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.

* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.
2021-07-09 00:42:49 +08:00
Yifan Xiong fb7d4a7396
Runner - Fetch benchmarks results on all nodes (#116)
Fetch benchmarks results on all nodes, will rsync after each benchmark.
The results directory structure on control node is as follows:

```
outputs/
└── datetime
    ├── nodes
    │   └── node-0
    │       ├── benchmarks
    │       │   ├── benchmark-0
    │       │   │   ├── rank-0
    │       │   │   │   └── results.json
    │       └── sb-exec.log
    ├── sb-run.log
    └── sb.config.yaml
```
2021-07-02 21:45:56 +08:00
Yifan Xiong 7b0b0e9add
CLI - Support custom output directory (#110)
* Support custom output directory.
* Update document.
2021-07-01 21:10:12 +08:00
Yifan Xiong c0c43b8f81
Bug bash - Fix bugs in multi GPU benchmarks (#98)
* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.
2021-06-23 18:16:43 +08:00
Yifan Xiong ddbc51a135
Bug bash - Fix bugs and refine log in single GPU benchmarks (#97)
Fix bugs and refine log in single GPU benchmarks:

* Fix none framework issue
* Fix empty parameter bug
* Remove missed mobilenet_v3 models
* Change benchmark registration log to debug level
* Add pid in logging
* Add missing benchmarks in default config
* Fix deprecated logging warn
2021-06-16 13:51:22 +08:00
guoshzhao 331c740a15
Benchmarks: Add Feature - Add nvml package to provide python interfaces of nvidia. (#91) 2021-06-01 23:31:07 +08:00
Yifan Xiong 977b1a7355
CLI - Refine CLI handlers (#68)
* use absolute path of input file
* parse registry uri from image
* merge common parts for arguments processing
2021-05-18 11:34:15 +08:00
Yifan Xiong 57ce473a02
Utils - Support lazy import (#67)
__Major Revision__

* Support lazy import.
* Not importing benchmarks when running `help`, `version`, `deploy` commands, etc.
2021-05-11 10:49:22 +08:00
Yifan Xiong 0e2b2b0829
Update logger (#28)
Update logger class.
* add file handler along with stream handler
* add colored formatter
2021-03-29 14:06:55 +08:00
Yifan Xiong 91b44bc5a1
CLI: Code Revision - Use omegaconf to replace hydra for configuration (#27)
Use omegaconf to replace hydra for configuration system:
* remove hydra
* use omegaconf to merge configurations
2021-03-26 21:19:17 +08:00
guoshzhao 31b6f0851a
Benchmarks: Code Revision - Support benchmark re-registration, keep the latest one. (#23)
* support benchmark re-registration.

* address comments

Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
2021-03-18 13:47:10 +08:00
Yifan Xiong 5d11579a10
CLI - Add command sb [version,deploy,exec,run] (#10)
- Add CLI commands
  * sb version
  * sb deploy
  * sb exec
  * sb run
- Add interface with executor and runner
- Add cli test cases
2021-03-12 13:16:43 +08:00
guoshzhao abc6c991af
fix typos (#14)
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
2021-03-04 13:15:03 +08:00
guoshzhao 4c87a3e419
Benchmarks: Initialization - Add base class, registry, and result (#1)
* benchmarks init.

Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
2021-02-24 12:43:24 +08:00