Граф коммитов

49 Коммитов

Автор SHA1 Сообщение Дата
Yang Wang 96cc4d9397
Bug: Executor - Fix executor for Benchmark Execution Without Explicit Framework Field (#636)
**Description**
Fix executor for Benchmark Execution Without Explicit Framework Field
2024-08-20 16:52:20 -07:00
Yang Wang 46a5792915
Bug Fix - Update Docker Exec Command for Persistent HPCX Environment (#635)
Add 10-hpcx.sh to /etc/profile.d
Update the Docker exec command to ensure a persistent HPCX environment.
2024-08-13 16:35:01 +00:00
Yang Wang 9a3ce39d5a
Update omegaconf version to 2.3.0 (#631)
Update `omegaconf` version to
[2.3.0](https://pypi.org/project/omegaconf/2.3.0/) as omegaconf 2.0.6
has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will
enforce this behaviour change.
Discussion can be found at https://github.com/pypa/pip/issues/12063.
2024-07-23 14:46:28 -07:00
Yifan Xiong 2c88db907f
Release - SuperBench v0.10.0 (#607)
**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606

Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>
2024-01-08 05:40:52 +00:00
pnunna93 67f2aa7237
Benchmarks: model benchmarks - change torch.distributed.launch to torchrun (#556)
This PR has following changes
- torch.distributed.launch changed to torchrun. torch.distributed.launch
is deprecated in latest Pytorch and is recommended to move to torchrun -
https://pytorch.org/docs/stable/elastic/run.html
- Changes to AMD GPU detection logic. The AMD GPU detection logic throws
warning when containers have only renderD in /dev/dri, this change would
resolve those warnings

---------

Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-08-08 13:03:32 +08:00
Yuting Jiang ed027e4c8e
Tools - Add runner for sys info and update docs (#532)
**Description**
Add runner for sys info to automatically collect on multiple nodes and
update related docs.

**Major Revision**
- add runner for sys info which will check docker status and run `sb
node info` on all nodes' docker and fetch results from all nodes

**Minor Revision**
- update cli and system-info doc
- update sb node info to save output info output-dir/sys-info.json
2023-06-29 06:09:44 +00:00
Yifan Xiong a1cd3c9475
Runner - Add signal handler in runner (#530)
Add signal handler in runner to gracefully exit when receiving SIGINT
(<kbd>Ctrl</kbd>+<kbd>C</kbd>) or SIGTERM during benchmark execution.
2023-05-23 17:25:35 +08:00
Yifan Xiong 51761b3af1
Release - SuperBench v0.8.0 (#517)
**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)

Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-04-14 12:57:55 +00:00
Yuting Jiang 62a2913497
Executor - Support SuperBench Executor running on Windows (#475)
**Description**
Support SuperBench Executor running on Windows.

**Major Revision**
- Lazy import ansible related module
2023-02-13 08:20:07 +00:00
Yifan Xiong b07fda155e
Release - SuperBench v0.7.0 (#468)
**Description**

Cherry-pick bug fixes from v0.7.0 to main.

**Major Revisions**

* Benchmarks - Fix missing include in FP8 benchmark (#460)
* Fix bug in TE BERT model (#461)
* Doc - Update benchmark doc (#465)
* Bug: Fix bug for incorrect datatype judgement in cublas-function
source code (#464)
* Support `sb deploy` without pulling image (#466)
* Docs - Upgrade version and release note (#467)

Co-authored-by: Russell J. Hewett <russell.j.hewett@gmail.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2023-01-28 11:07:06 +08:00
Yang Wang 8e748d5649
Runner - Generate host groups file in mpi mode (#458)
**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo
2023-01-04 19:49:14 +08:00
Yang Wang 65e433c0c6
Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437)
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
2023-01-03 10:28:35 +00:00
Yang Wang 7838b6b154
Runner - Support `pair-wise` pattern in `mpi` mode (#447)
* Extract pair-wise pattern from ib_validation
2022-12-29 08:23:36 +00:00
Yang Wang e4eeda0afd
Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430)
* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments
2022-11-29 12:30:10 +08:00
Yifan Xiong 1b86503d1e
CLI - Add non-zero return code for `sb [deploy,run]` (#425)
Add non-zero return code for `sb deploy` and `sb run` command when
there're Ansible failures in control plane.
Return code is set to count of failure.

For failures caused by benchmarks, return code is still set per benchmark
in results json file.
2022-11-01 10:46:19 +08:00
Yifan Xiong 63e9b2d1bc
Release - SuperBench v0.6.0 (#409)
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2022-09-06 18:06:05 +08:00
Yifan Xiong 9c29c93114
Runner - Fix minimum timeout (#385)
Fix minimum timeout: use 60s if config is shorter.
2022-08-08 11:22:24 +08:00
Yifan Xiong 9b8df883ae
Gracefully exit when timeout (#383)
* Gracefully exit when timeout, add corresponding log and return code.
* Set minimum timeout to 1 minute and enlarge Ansible timeout.
2022-08-04 13:05:34 +08:00
Yifan Xiong b2875179bf
Fix issues in ib validation benchmark (#370)
Fix several issues in ib validation benchmark:
* continue running when timeout in the middle, instead of aborting whole mpi process
* make timeout parameter configurable, set default to 120 seconds
* avoid mixture of stdio and iostream when print to stdout
* set default message size to 8M which will saturate ib in most cases
* fix hostfile path issue so that it can be auto found in different cases
2022-07-09 19:57:11 +08:00
Yifan Xiong e00a8180f6
Support node_num=1 in mpi mode (#372)
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
both 1 node and all nodes in one config by changing `node_num`.
Update docs and add test case accordingly.
2022-07-08 09:24:17 +08:00
Yifan Xiong 8ef7163a18
Deployment - Refine error message when GPU is not detected (#368)
Refine error message when GPU is not detected.

Possible solutions if hardware exists and drivers are already installed:
* nvidia gpus:
  ```sh
  /sbin/modprobe nvidia-uvm
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
  ```

* amd gpus
  ```sh
  modprobe amdgpu
  ```
2022-06-30 01:12:25 +08:00
Yifan Xiong bfaa1c837b
Support multiple IB/GPU in ib validation (#363)
**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.
2022-06-24 08:35:20 +00:00
Yifan Xiong 0f7b057a2d
Runner - Fix sudo issue when running without Docker (#362)
Fix sudo issue when running without Docker, user account could be
arbitrary in such case.
2022-06-19 11:56:36 +00:00
Yifan Xiong a4937e95c6
Support `sb run` on host directly without Docker (#358)
**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.
2022-06-14 10:57:01 +08:00
Yifan Xiong f755c0b659
Bug - Fix env path to absolute path (#327)
Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.
2022-03-09 17:16:43 +08:00
Yifan Xiong 1f48268bf5
Bug - Fix env file path (#310)
Fix env file path for `docker run`.
2022-02-15 15:23:43 +08:00
Yifan Xiong 3524975cfc
Config - Support customized env for all modes (#295)
Support customized env for all modes in configuration.
2022-01-29 08:19:48 +00:00
guoshzhao d03d110f55
Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287)
**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.
2022-01-28 20:35:53 +08:00
guoshzhao d877ca2322
Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288)
**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.
2022-01-28 08:16:32 +00:00
Yifan Xiong 7d7cd3dc63
Config - Update benchmark naming to support annotations (#284)
__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.
2022-01-25 09:54:58 +00:00
Yifan Xiong ff563b66af
Release - SuperBench v0.4.0 (#278)
__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2021-12-30 16:24:00 +08:00
guoshzhao 6e357fb9d2
Monitor: Integration - Integrate monitor into Superbench (#259)
**Description**
Integrate monitor into Superbench.

**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.
2021-12-10 09:33:13 +00:00
Yifan Xiong 213ab14bea
Bug - Fix issues for distributed runs (#258)
Fix issues for distributed runs:
* fix config for memory bandwidth benchmarks
* add throttling for high concurrency docker pull
* update rsync path and exclude directories
* handle exceptions when creating summary
* tune for logging
2021-12-08 06:55:13 +00:00
Yifan Xiong dfbd70b129
Release - SuperBench v0.3.0 (#212)
**Description**

Cherry-pick  bug fixes from v0.3.0 to main.

**Major Revisions**
* Docs - Upgrade version and release note (#209)
* Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
* Benchmarks: Update - Update benchmarks in configuration file (#208)
* CI/CD - Update GitHub Action VM (#211)
* Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
* CI/CD - Fix bug in build image for push event (#205)
* Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
* Tool: Fix bug - Fix function naming issue in system info  (#200)
* CI/CD - Push images in GitHub Action (#202)
* Bug - Fix torch.distributed command for single node (#201)
* CLI - Integrate system info for node (#199)
* Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
* CI/CD - Add ROCm image build in GitHub Actions (#194)
* Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
* Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
* Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
* Bug - Revise 'docker run' in sb deploy (#195)
* Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)

Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2021-09-26 09:30:31 +08:00
Yifan Xiong e2453e1cae
Runner - Fix inventory issue in ansible_runner (#185)
__Description__

Fix inventory bug in ansible_runner when host list is provided with multiple hosts.

It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.
2021-09-02 13:24:48 +08:00
guoshzhao 7d947757ea
Benchmarks: Docker Benchmarks - Setup Docker-in-Docker environment (#180)
**Description**
Setup docker environment in docker container.

**Major Revision**
- Install docker client for cuda and rocm images.
- Mount /var/run/docker.sock from host
2021-09-01 16:35:00 +08:00
guoshzhao 7595d79434
Runner: Add Feature - Generate summarized output files. (#157)
**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
2021-08-20 16:48:40 +08:00
Yifan Xiong 98b6c0e3ca
Runner - Support mpi mode (#146)
Support mpi mode in runner:
* concate mpirun command
* support mca and env config
* prepare hostfile and update Ansible host pattern

Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2021-08-19 15:59:17 +08:00
Yifan Xiong 69b2c631fc
Release - SuperBench v0.2.1 (#142)
__Description__
Cherry-pick bug fixes from v0.2.1 to main.

__Major Revisions__
* Fix bug of VGG models failed on A100 GPU with batch_size=128.
* Fix Ansible connection issue when running in localhost.
* Update version in packages and docs.
2021-07-29 17:52:28 +08:00
Yifan Xiong 7458f83a9b
Runner & Executor - Support AMD GPU (#119)
Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.

* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.
2021-07-09 00:42:49 +08:00
Yifan Xiong fb7d4a7396
Runner - Fetch benchmarks results on all nodes (#116)
Fetch benchmarks results on all nodes, will rsync after each benchmark.
The results directory structure on control node is as follows:

```
outputs/
└── datetime
    ├── nodes
    │   └── node-0
    │       ├── benchmarks
    │       │   ├── benchmark-0
    │       │   │   ├── rank-0
    │       │   │   │   └── results.json
    │       └── sb-exec.log
    ├── sb-run.log
    └── sb.config.yaml
```
2021-07-02 21:45:56 +08:00
Yifan Xiong 60ba63bb11
CLI - Support host-list for deploy and run commands (#108)
Support `--host-list` for deploy and run commands.

Before this change, an inventory file is needed to use `sb deploy/run`.
Now, `--host-list localhost` or `-l localhost` is sufficient for quick try.
2021-07-01 21:48:36 +08:00
Yifan Xiong 7b0b0e9add
CLI - Support custom output directory (#110)
* Support custom output directory.
* Update document.
2021-07-01 21:10:12 +08:00
Yifan Xiong c0c43b8f81
Bug bash - Fix bugs in multi GPU benchmarks (#98)
* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.
2021-06-23 18:16:43 +08:00
Yifan Xiong 6b0ca1cb05
Runner - Support local mode in runner (#88)
* Support local mode in runner.
2021-06-02 23:58:44 +08:00
Yifan Xiong 8b4f613a76
Runner - Support torch.distributed mode in runner (#81)
* Support `torch.distributed` mode in runner.
* Support given `proc_num` and `node_num` in `torch.distributed` mode.
2021-05-28 12:29:39 +08:00
Yifan Xiong e7f6d8ba78
CI/CD - Add integration tests for Ansible playbooks (#82)
* Add integration tests for Ansible playbooks
* Add `gpu_vendor` var to bypass gpu mount
2021-05-26 20:04:49 +08:00
Yifan Xiong c05e173b3d
Runner - Implement ansible client and runner (#69)
Implement ansible client and runner:
* add ansible client
* add deploy and check_env playbooks
2021-05-23 23:53:37 +08:00
Yifan Xiong f73d1adec5
Runner: Init - Add superbench runner class (#38)
* init runner class with not implemented
2021-04-12 12:02:31 +08:00