Граф коммитов

305 Коммитов

Автор SHA1 Сообщение Дата
dependabot[bot] 02941e6e09
Bump terser from 4.8.0 to 4.8.1 in /website (#376)
Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1.
- [Release notes](https://github.com/terser/terser/releases)
- [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md)
- [Commits](https://github.com/terser/terser/commits)

---
updated-dependencies:
- dependency-name: terser
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-07-22 11:52:43 +08:00
Yifan Xiong 352ae0c95f
Fix port conflict in ib loopback (#375)
Fix potential port conflict due to race condition between time-to-check
to time-to-use, by binding the port all through.

Modify the function to resolve flake8 C901 while keeping the logic same.
2022-07-20 11:30:00 +08:00
Yifan Xiong 16b6385dee
Add dependencies (#374)
Add dependencies

* include ndv4-topo.xml in cuda docker images
* require requests version to avoid RequestsDependencyWarning
2022-07-13 08:42:53 +00:00
Yifan Xiong b2875179bf
Fix issues in ib validation benchmark (#370)
Fix several issues in ib validation benchmark:
* continue running when timeout in the middle, instead of aborting whole mpi process
* make timeout parameter configurable, set default to 120 seconds
* avoid mixture of stdio and iostream when print to stdout
* set default message size to 8M which will saturate ib in most cases
* fix hostfile path issue so that it can be auto found in different cases
2022-07-09 19:57:11 +08:00
Yifan Xiong e00a8180f6
Support node_num=1 in mpi mode (#372)
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
both 1 node and all nodes in one config by changing `node_num`.
Update docs and add test case accordingly.
2022-07-08 09:24:17 +08:00
Yifan Xiong 9f03d5687a
Update dependencies and Dockerfile (#371)
Update dependencies and Dockerfile:
* upgrade nccl-tests and rccl-tests to current latest version to match
  NCCL/RCCL versions
* unify image tag names on DockerHub
* remove verbose output in Dockerfile and minor fix some flags
2022-07-06 10:31:41 +00:00
Yifan Xiong a94ead34b0
CLI - Support SKU auto detect if running on Azure VM (#365)
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
2022-07-05 10:52:39 +08:00
Yifan Xiong 620192a242
Fix issues in ib loopback benchmark (#369)
Fix several issues in ib loopback benchmark:
* use `--report_gbits` and divide by 8 to get GB/s, previous results are
  MiB/s / 1000
* use the ib_write_bw binary built in third_party instead of system path
* update the metrics name so that different hca indices have same metric
2022-06-29 17:53:02 +00:00
Yifan Xiong 8ef7163a18
Deployment - Refine error message when GPU is not detected (#368)
Refine error message when GPU is not detected.

Possible solutions if hardware exists and drivers are already installed:
* nvidia gpus:
  ```sh
  /sbin/modprobe nvidia-uvm
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
  ```

* amd gpus
  ```sh
  modprobe amdgpu
  ```
2022-06-30 01:12:25 +08:00
Yifan Xiong 325a7338bf
Fix incorrect ulimit config in Dockerfile (#364)
Fix incorrect ulimit nofile config in Dockerfile.

Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
2022-06-24 14:14:00 +00:00
Yifan Xiong bfaa1c837b
Support multiple IB/GPU in ib validation (#363)
**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.
2022-06-24 08:35:20 +00:00
Yifan Xiong 0f7b057a2d
Runner - Fix sudo issue when running without Docker (#362)
Fix sudo issue when running without Docker, user account could be
arbitrary in such case.
2022-06-19 11:56:36 +00:00
Yifan Xiong 483bf782e1
Update ROCm Dockerfile (#361)
**Description**

Update ROCm Dockerfile.

**Major Revisions**
- Add dockerfile for ROCm 5.1.3
- Merge 5.1.x and 5.0.x dockerfile
- Remove 4.2 and 4.0 legacy
- Update build pipeline accordingly
2022-06-19 17:26:39 +08:00
Yifan Xiong 60a3c74306
Fix cmake and build issues (#360)
**Description**

Fix cmake and build issues.

**Major Revision**

* Remove unnecessary boost build
* Remove user-agent for mlc
* Remove -j for third party to build each project in sequence
* Fix ansible collections installation path
2022-06-15 13:07:57 +08:00
Yifan Xiong a4937e95c6
Support `sb run` on host directly without Docker (#358)
**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.
2022-06-14 10:57:01 +08:00
dependabot[bot] 528d69bd13
Bump eventsource from 1.1.0 to 1.1.1 in /website (#357)
Bumps [eventsource](https://github.com/EventSource/eventsource) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/EventSource/eventsource/releases)
- [Changelog](https://github.com/EventSource/eventsource/blob/master/HISTORY.md)
- [Commits](https://github.com/EventSource/eventsource/compare/v1.1.0...v1.1.1)

---
updated-dependencies:
- dependency-name: eventsource
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-06 11:39:35 +08:00
dependabot[bot] 77f8048ad8
Bump cross-fetch from 3.1.4 to 3.1.5 in /website (#349)
Bumps [cross-fetch](https://github.com/lquixada/cross-fetch) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/lquixada/cross-fetch/releases)
- [Commits](https://github.com/lquixada/cross-fetch/compare/v3.1.4...v3.1.5)

---
updated-dependencies:
- dependency-name: cross-fetch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-02 19:06:54 +08:00
dependabot[bot] cdd19e6f30
Bump async from 2.6.3 to 2.6.4 in /website (#351)
Bumps [async](https://github.com/caolan/async) from 2.6.3 to 2.6.4.
- [Release notes](https://github.com/caolan/async/releases)
- [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md)
- [Commits](https://github.com/caolan/async/compare/v2.6.3...v2.6.4)

---
updated-dependencies:
- dependency-name: async
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-02 13:26:35 +08:00
Yuting Jiang 54da021b4d
Analyzer - Fix bugs in data diagnosis (#355)
**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel
2022-06-01 17:12:38 +08:00
Yuting Jiang 3f135e4669
Dockerfile - Add support to run sb command inside docker image (#356)
**Description**
Add support to run sb command inside docker image - install missing dependency.
2022-06-01 01:11:28 +08:00
Yuting Jiang e08b6d3a1c
Dockerfile: Update rccl version and fix issue in rocm5.1.1 dockerfile (#354)
**Description**
Update rccl version and fix issue in rocm5.1.1 dockerfile.
2022-05-27 10:46:40 +08:00
Yuting Jiang 81a4146bc1
Dockerfile - Add dockerfile for rocm5.1.1 (#353)
**Description**
Add dockerfile for rocm5.1.1.
2022-05-25 20:28:11 +08:00
Yifan Xiong 6681c72043
Release - SuperBench v0.5.0 (#350)
**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2022-04-29 16:22:55 +08:00
Yuting Jiang 712eafc373
Docs - Update links using relative file paths with extensions (#346)
**Description**
Update links of referencing other docs using relative file paths with extensions.
2022-04-21 07:28:19 +08:00
Jared Bowden cb26691173
Docs - Update link to cli.md (#341)
**Description**
Fixes relative link in documentation: point to `../cli.md`.
2022-04-15 22:11:14 +08:00
guoshzhao 80dcc8aaec
Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338)
**Description**
Integrate FAMBench into superbench based on docker implementation:
https://github.com/facebookresearch/FAMBench

The script to run all benchmarks is:
https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh
2022-04-11 15:31:07 +08:00
Yuting Jiang 8dc19ca4af
CLI - Integrate output all nodes diagnosis results (#339)
**Description**
Integrate output all nodes diagnosis results.
2022-04-11 13:42:04 +08:00
Yuting Jiang 55b0f9d239
Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336)
**Description**
Output results of all nodes in data diagnosis.
2022-04-10 18:57:15 +08:00
Yuting Jiang 56c9a711a8
Docs - Add usage for result summary (#337)
**Description**
Add usage for result summary.
2022-04-08 20:44:25 +00:00
Yuting Jiang f15da60b2b
CLI - Integrage result summary and update output format of data diagnosis (#335)
**Description**
Integrage result summary and update output format of data diagnosis.

**Major Revision**
- integrage result summary 
- add md and html format for data diagnosis
2022-04-08 18:48:43 +08:00
guoshzhao 6d895da83c
Benchmarks: Add Feature - Provide option to save raw data into file. (#333)
**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
2022-04-01 16:26:09 +08:00
dependabot[bot] d368d90e21
Bump minimist from 1.2.5 to 1.2.6 in /website (#334)
Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases)
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6)

---
updated-dependencies:
- dependency-name: minimist
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-31 12:41:48 +08:00
Yuting Jiang 84fed1ce18
Analyzer: Add feature - Add result summary in excel,md,html format (#320)
**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
2022-03-24 15:32:01 +08:00
Yuting Jiang c5aa4f4e38
Bug: Benchmarks - remove fp16 samples type converting time (#332)
**Description**
Remove fp16 samples type converting time for training cnn and lstm inference.
2022-03-22 12:51:52 +08:00
Yifan Xiong a9634ef5a8
Config - Add inference config for NC A100 and NV A10 series (#329)
Add inference config for preview SKUs, including:
* [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series)
* [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)
2022-03-21 14:24:37 +08:00
Yuting Jiang 6e74918044
Bug: Benchmarks - remove fp16 samples type converting time for cnn and lstm models (#330)
**Description**
Remove fp16  samples type converting time for cnn and lstm models.
2022-03-17 14:02:40 +08:00
rafsalas19 ff51a3cee9
Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324)
**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn
2022-03-16 16:20:11 +08:00
Yuting Jiang 84359fd806
Bug: Executor - fix bug in result writing to files for mpi mode (#328)
**Description**
fix the bug in result writing to files for mpi mode.
2022-03-15 16:35:03 +00:00
Yuting Jiang b3c95f1827
Analyzer - Add md and html output format for DataDiagnosis (#325)
**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis
2022-03-15 18:04:11 +08:00
Yifan Xiong f755c0b659
Bug - Fix env path to absolute path (#327)
Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.
2022-03-09 17:16:43 +08:00
Yuting Jiang 1ec055e1c2
Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321)
**Description**
Abstract RuleBase from DataDiagnosis.
2022-03-07 17:25:07 +08:00
dependabot[bot] 9759527111
Bump url-parse from 1.5.8 to 1.5.10 in /website (#323)
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-07 03:24:22 +00:00
Jeff Daily a9ef0f99ab
Benchmarks - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322)
**Description**
The BatchNorm operator is not numerically stable in fp16.  PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models.  Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32.  Preserving BN in fp32 for superbench more accurately reflects real workloads.
2022-03-06 13:22:43 +00:00
Yuting Jiang 425b9ff865
Dockerfile - Add dockerfile for rocm5.0.1 (#319)
**Description**
Add dockerfile for rocm5.0.1.
2022-02-28 19:30:43 +08:00
dependabot[bot] 74a3b1231a
Bump prismjs from 1.23.0 to 1.27.0 in /website (#318)
Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0.
- [Release notes](https://github.com/PrismJS/prism/releases)
- [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0)

---
updated-dependencies:
- dependency-name: prismjs
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-28 14:55:12 +08:00
Yuting Jiang a4950a707e
Dockerfile - Add rocm5.0 dockerfile (#307)
**Description**
Add rocm5.0 dockerfile.
2022-02-26 07:12:45 +08:00
Ziyue Yang 01304706ed
Bug Fix - Fix P2P detection in gpu_copy (#317)
**Description**
Fix invalid reference of P2P detection result in gpu_copy.
2022-02-25 05:48:38 +08:00
Yuting Jiang 4f5027dbda
Benchmarks: Build Pipeline - Make gpcnet only for cuda (#316)
**Description**
Make gpcnet only for cuda.
2022-02-24 18:18:49 +08:00
Yuting Jiang e0c491425d
Bug - Fix empty HIP_ARCHITECTURES issue in cmake>=3.21.0 (#315)
**Description**
Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0.
Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364
2022-02-22 12:38:58 +00:00
dependabot[bot] 0740780bcc
Bump url-parse from 1.5.1 to 1.5.8 in /website (#313)
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-21 13:03:27 +08:00