Граф коммитов

436 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong 0591da5f49
Benchmarks - Add cuBLASLt FP16 and FP8 GEMM micro-benchmark (#451)
Add micro-benchmark for cublaslt fp8 gemm.
2023-01-03 05:28:56 +00:00
Yuting Jiang 678b1251f1
Benchmarks: Micro benchmarks - add source code of correctness check for cublas functions (#450)
**Description**
Add c source code of correctness check for cublas functions.

**Major Revision**
- add correctness check for all supported cublas functions
- add --correctness option into binary

**Minor Revision**
- fix bug and template fill_data and prepare_tensor to get right memory-alignment output matrix for different datatype
2023-01-03 04:20:10 +00:00
Yuting Jiang 9dfefce350
Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445)
**Description**
Add stdout logging util module and enable real-time logging flushing in executor

**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`

**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks 
- udpate related docs
2022-12-30 09:40:28 +00:00
Yang Wang f2634d8608
Benchmarks - Support `pair-wise` pattern in IB validation benchmark (#453)
**Description**
* Reuse `gen_pair_wise_config` in micro-benchmark
2022-12-30 13:02:52 +08:00
Yifan Xiong a3c65b2a57
Dockerfile - Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
Add Docker image for arch90 NVIDIA GPUs:

* add CUDA11.8 Dockerfile
* update archs in Makefile and benchmarks accordingly
* update image build pipeline
2022-12-29 12:19:38 +00:00
Yang Wang 7838b6b154
Runner - Support `pair-wise` pattern in `mpi` mode (#447)
* Extract pair-wise pattern from ib_validation
2022-12-29 08:23:36 +00:00
dependabot[bot] 6186146d59
Bump qs and express in /website (#440)
Bumps [qs](https://github.com/ljharb/qs) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `qs` from 6.7.0 to 6.11.0
- [Release notes](https://github.com/ljharb/qs/releases)
- [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ljharb/qs/compare/v6.7.0...v6.11.0)

Updates `express` from 4.17.1 to 4.18.2
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/master/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.17.1...4.18.2)

---
updated-dependencies:
- dependency-name: qs
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-28 13:47:06 +08:00
dependabot[bot] de6deb0e2d
Bump decode-uri-component from 0.2.0 to 0.2.2 in /website (#439)
Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2.
- [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases)
- [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2)

---
updated-dependencies:
- dependency-name: decode-uri-component
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-27 13:40:49 +08:00
Yuting Jiang 6583ba2e40
Benchmark: Revision - Add wait time option to resolve mem-bw unstable issue (#438)
**Description**
Add wait time option to resolve mem-bw unstable issue.
2022-12-14 17:21:02 +08:00
Yuting Jiang 1deb2eaa29
downgrage transformers version to fix tersorrt (#441)
**Description**
Downgrage transformers version to fix tersorrt test failure.
2022-12-14 14:19:32 +08:00
Yang Wang e4eeda0afd
Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430)
* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments
2022-11-29 12:30:10 +08:00
dependabot[bot] 3c97381fd2
Bump loader-utils from 1.4.0 to 1.4.2 in /website (#431)
Bumps [loader-utils](https://github.com/webpack/loader-utils) from 1.4.0 to 1.4.2.
- [Release notes](https://github.com/webpack/loader-utils/releases)
- [Changelog](https://github.com/webpack/loader-utils/blob/v1.4.2/CHANGELOG.md)
- [Commits](https://github.com/webpack/loader-utils/compare/v1.4.0...v1.4.2)

---
updated-dependencies:
- dependency-name: loader-utils
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-18 15:13:04 +08:00
Yang Wang 57f7403c47
Update typing-extensions version to fix pipeline issue (#432) 2022-11-17 19:39:52 +08:00
Yifan Xiong 1b86503d1e
CLI - Add non-zero return code for `sb [deploy,run]` (#425)
Add non-zero return code for `sb deploy` and `sb run` command when
there're Ansible failures in control plane.
Return code is set to count of failure.

For failures caused by benchmarks, return code is still set per benchmark
in results json file.
2022-11-01 10:46:19 +08:00
Yifan Xiong d7bb8303fb
CLI - Update version to include revision hash and date (#427)
Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028
2022-10-31 10:44:41 +08:00
Yifan Xiong 7a27732e97
CI/CD - Fix codecov status issue (#426)
Fix codecov status issue due to service upgrade.
2022-10-27 17:41:06 +08:00
dependabot[bot] 12c48627d7
Bump ansi-html and webpack-dev-server in /website (#419)
Removes [ansi-html](https://github.com/Tjatse/ansi-html). It's no longer used after updating ancestor dependency [webpack-dev-server](https://github.com/webpack/webpack-dev-server). These dependencies need to be updated together.


Removes `ansi-html`

Updates `webpack-dev-server` from 3.11.2 to 3.11.3
- [Release notes](https://github.com/webpack/webpack-dev-server/releases)
- [Changelog](https://github.com/webpack/webpack-dev-server/blob/v3.11.3/CHANGELOG.md)
- [Commits](https://github.com/webpack/webpack-dev-server/compare/v3.11.2...v3.11.3)

---
updated-dependencies:
- dependency-name: ansi-html
  dependency-type: indirect
- dependency-name: webpack-dev-server
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-10-20 20:39:20 +08:00
Yifan Xiong 13c6919553
CI/CD - Update Codecov config (#418)
Update codecov flags in configuration.
2022-10-18 12:30:59 +08:00
Yuting Jiang 3367c4f6cc
Benchmarks - Add support to allow list of custom config string in cudnn-functions and cublas-functions (#414)
**Description**
Add support to allow list of custom config string in cudnn-functions and cublas-functions.
2022-10-18 09:59:51 +08:00
Yifan Xiong 63e9b2d1bc
Release - SuperBench v0.6.0 (#409)
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2022-09-06 18:06:05 +08:00
Yuting Jiang 733860d715
Analyzer - Add support to store values of metrics in data diagnosis (#392)
**Description**
Add support to store values of metrics in data diagnosis.

Take the following rules as example: 
```
    nccl_store_rule:
      categories: NCCL_DIS
      store: True
      metrics:
        - nccl-bw:allreduce-run0/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run1/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run2/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run3/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run4/allreduce_1073741824_busbw
    nccl_rule:
      function: multi_rules
      criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False'
      categories: NCCL_DIS
```
**nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
2022-08-23 03:25:32 +00:00
Yuting Jiang 10a79c4ea8
Analyzer - Add support for both jsonl and json format in data diagnosis (#388)
**Description**
Add support for both jsonl and json format in data diagnosis.

**Major Revision**
- Add support for both jsonl and json format in data diagnosis


**Minor Revision**
- change related doc
- add jsonl support in cli
2022-08-22 10:57:00 +08:00
Yifan Xiong 626ac0a463
Update Python setup for require packages (#387)
__Description__

Update Python setup for require packages.

__Major Revisions__
* downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
* add extra entry in extras_require for nested packages
* update `pip install` contents accordingly
2022-08-17 11:33:57 +08:00
Yuting Jiang e335556d7a
Benchmarks: Build Pipeline - Degrade perftest submodule to fix stability issue (#386)
**Description**
Degrade perftest submodule to v4.4-0.37 to fix stability issue.

Issue: rdma-loopback is not stable on public version(v0.5/v0.6-rc1)
Docker Version: v0.6-rc1-cuda11.1
Testbed: 8 A100 40GB GPUs (1 NDv4 node)
Result: 
New perftest version introduce the variance, max-min/mean = 2% for v4.4-0.37, 8% for v4.5-0.2
2022-08-16 11:49:10 +08:00
Yang Wang faeee0a7cc
Auto generate ibstat file for topo aware traffic pattern (#381)
An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
2022-08-13 18:20:42 +08:00
Yuting Jiang b5c7c85d17
Analyzer: Rename fields in json of data diagnosis to be more readable (#382)
**Description**
Rename field in data diagnosis to be more readable.

**Major Revision**
- rename fields according to diagnosis/metric format

**Minor Revision**
- change type of diagnosis/issue_num to be int
2022-08-09 10:03:50 +08:00
Yifan Xiong 9c29c93114
Runner - Fix minimum timeout (#385)
Fix minimum timeout: use 60s if config is shorter.
2022-08-08 11:22:24 +08:00
Yifan Xiong 9b8df883ae
Gracefully exit when timeout (#383)
* Gracefully exit when timeout, add corresponding log and return code.
* Set minimum timeout to 1 minute and enlarge Ansible timeout.
2022-08-04 13:05:34 +08:00
Yuting Jiang ec16d42564
Analyzer - Add failure check feature in data diagnosis (#378)
**Description**
Add failure check feature in data diagnosis.

**Major Revision**
- Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest
- Split performance issue and failedtest in categories


**Minor Revision**
- replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
2022-08-01 12:35:35 +08:00
Jie Zhang ef4d65745b
Support topo-aware IB performance validation (#373)
* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
2022-07-26 16:56:19 -07:00
Yang Wang 5d448eedbf
Fix unexpected base conversion when the result value is negative (#377)
Fix an unexpected result value (`-0.125`) issue in ib traffic benchmark when encountering `-1` in raw output
* Check if the value is valid before the base conversion
* Add a test case to cover this situation
2022-07-25 15:27:46 +08:00
dependabot[bot] 02941e6e09
Bump terser from 4.8.0 to 4.8.1 in /website (#376)
Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1.
- [Release notes](https://github.com/terser/terser/releases)
- [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md)
- [Commits](https://github.com/terser/terser/commits)

---
updated-dependencies:
- dependency-name: terser
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-07-22 11:52:43 +08:00
Yifan Xiong 352ae0c95f
Fix port conflict in ib loopback (#375)
Fix potential port conflict due to race condition between time-to-check
to time-to-use, by binding the port all through.

Modify the function to resolve flake8 C901 while keeping the logic same.
2022-07-20 11:30:00 +08:00
Yifan Xiong 16b6385dee
Add dependencies (#374)
Add dependencies

* include ndv4-topo.xml in cuda docker images
* require requests version to avoid RequestsDependencyWarning
2022-07-13 08:42:53 +00:00
Yifan Xiong b2875179bf
Fix issues in ib validation benchmark (#370)
Fix several issues in ib validation benchmark:
* continue running when timeout in the middle, instead of aborting whole mpi process
* make timeout parameter configurable, set default to 120 seconds
* avoid mixture of stdio and iostream when print to stdout
* set default message size to 8M which will saturate ib in most cases
* fix hostfile path issue so that it can be auto found in different cases
2022-07-09 19:57:11 +08:00
Yifan Xiong e00a8180f6
Support node_num=1 in mpi mode (#372)
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
both 1 node and all nodes in one config by changing `node_num`.
Update docs and add test case accordingly.
2022-07-08 09:24:17 +08:00
Yifan Xiong 9f03d5687a
Update dependencies and Dockerfile (#371)
Update dependencies and Dockerfile:
* upgrade nccl-tests and rccl-tests to current latest version to match
  NCCL/RCCL versions
* unify image tag names on DockerHub
* remove verbose output in Dockerfile and minor fix some flags
2022-07-06 10:31:41 +00:00
Yifan Xiong a94ead34b0
CLI - Support SKU auto detect if running on Azure VM (#365)
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
2022-07-05 10:52:39 +08:00
Yifan Xiong 620192a242
Fix issues in ib loopback benchmark (#369)
Fix several issues in ib loopback benchmark:
* use `--report_gbits` and divide by 8 to get GB/s, previous results are
  MiB/s / 1000
* use the ib_write_bw binary built in third_party instead of system path
* update the metrics name so that different hca indices have same metric
2022-06-29 17:53:02 +00:00
Yifan Xiong 8ef7163a18
Deployment - Refine error message when GPU is not detected (#368)
Refine error message when GPU is not detected.

Possible solutions if hardware exists and drivers are already installed:
* nvidia gpus:
  ```sh
  /sbin/modprobe nvidia-uvm
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
  ```

* amd gpus
  ```sh
  modprobe amdgpu
  ```
2022-06-30 01:12:25 +08:00
Yifan Xiong 325a7338bf
Fix incorrect ulimit config in Dockerfile (#364)
Fix incorrect ulimit nofile config in Dockerfile.

Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
2022-06-24 14:14:00 +00:00
Yifan Xiong bfaa1c837b
Support multiple IB/GPU in ib validation (#363)
**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.
2022-06-24 08:35:20 +00:00
Yifan Xiong 0f7b057a2d
Runner - Fix sudo issue when running without Docker (#362)
Fix sudo issue when running without Docker, user account could be
arbitrary in such case.
2022-06-19 11:56:36 +00:00
Yifan Xiong 483bf782e1
Update ROCm Dockerfile (#361)
**Description**

Update ROCm Dockerfile.

**Major Revisions**
- Add dockerfile for ROCm 5.1.3
- Merge 5.1.x and 5.0.x dockerfile
- Remove 4.2 and 4.0 legacy
- Update build pipeline accordingly
2022-06-19 17:26:39 +08:00
Yifan Xiong 60a3c74306
Fix cmake and build issues (#360)
**Description**

Fix cmake and build issues.

**Major Revision**

* Remove unnecessary boost build
* Remove user-agent for mlc
* Remove -j for third party to build each project in sequence
* Fix ansible collections installation path
2022-06-15 13:07:57 +08:00
Yifan Xiong a4937e95c6
Support `sb run` on host directly without Docker (#358)
**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.
2022-06-14 10:57:01 +08:00
dependabot[bot] 528d69bd13
Bump eventsource from 1.1.0 to 1.1.1 in /website (#357)
Bumps [eventsource](https://github.com/EventSource/eventsource) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/EventSource/eventsource/releases)
- [Changelog](https://github.com/EventSource/eventsource/blob/master/HISTORY.md)
- [Commits](https://github.com/EventSource/eventsource/compare/v1.1.0...v1.1.1)

---
updated-dependencies:
- dependency-name: eventsource
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-06 11:39:35 +08:00
dependabot[bot] 77f8048ad8
Bump cross-fetch from 3.1.4 to 3.1.5 in /website (#349)
Bumps [cross-fetch](https://github.com/lquixada/cross-fetch) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/lquixada/cross-fetch/releases)
- [Commits](https://github.com/lquixada/cross-fetch/compare/v3.1.4...v3.1.5)

---
updated-dependencies:
- dependency-name: cross-fetch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-02 19:06:54 +08:00
dependabot[bot] cdd19e6f30
Bump async from 2.6.3 to 2.6.4 in /website (#351)
Bumps [async](https://github.com/caolan/async) from 2.6.3 to 2.6.4.
- [Release notes](https://github.com/caolan/async/releases)
- [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md)
- [Commits](https://github.com/caolan/async/compare/v2.6.3...v2.6.4)

---
updated-dependencies:
- dependency-name: async
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-02 13:26:35 +08:00
Yuting Jiang 54da021b4d
Analyzer - Fix bugs in data diagnosis (#355)
**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel
2022-06-01 17:12:38 +08:00