Граф коммитов

183 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong 352ae0c95f
Fix port conflict in ib loopback (#375)
Fix potential port conflict due to race condition between time-to-check
to time-to-use, by binding the port all through.

Modify the function to resolve flake8 C901 while keeping the logic same.
2022-07-20 11:30:00 +08:00
Yifan Xiong b2875179bf
Fix issues in ib validation benchmark (#370)
Fix several issues in ib validation benchmark:
* continue running when timeout in the middle, instead of aborting whole mpi process
* make timeout parameter configurable, set default to 120 seconds
* avoid mixture of stdio and iostream when print to stdout
* set default message size to 8M which will saturate ib in most cases
* fix hostfile path issue so that it can be auto found in different cases
2022-07-09 19:57:11 +08:00
Yifan Xiong e00a8180f6
Support node_num=1 in mpi mode (#372)
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
both 1 node and all nodes in one config by changing `node_num`.
Update docs and add test case accordingly.
2022-07-08 09:24:17 +08:00
Yifan Xiong a94ead34b0
CLI - Support SKU auto detect if running on Azure VM (#365)
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
2022-07-05 10:52:39 +08:00
Yifan Xiong 620192a242
Fix issues in ib loopback benchmark (#369)
Fix several issues in ib loopback benchmark:
* use `--report_gbits` and divide by 8 to get GB/s, previous results are
  MiB/s / 1000
* use the ib_write_bw binary built in third_party instead of system path
* update the metrics name so that different hca indices have same metric
2022-06-29 17:53:02 +00:00
Yifan Xiong bfaa1c837b
Support multiple IB/GPU in ib validation (#363)
**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.
2022-06-24 08:35:20 +00:00
Yifan Xiong a4937e95c6
Support `sb run` on host directly without Docker (#358)
**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.
2022-06-14 10:57:01 +08:00
Yuting Jiang 54da021b4d
Analyzer - Fix bugs in data diagnosis (#355)
**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel
2022-06-01 17:12:38 +08:00
Yifan Xiong 6681c72043
Release - SuperBench v0.5.0 (#350)
**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2022-04-29 16:22:55 +08:00
guoshzhao 80dcc8aaec
Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338)
**Description**
Integrate FAMBench into superbench based on docker implementation:
https://github.com/facebookresearch/FAMBench

The script to run all benchmarks is:
https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh
2022-04-11 15:31:07 +08:00
Yuting Jiang 8dc19ca4af
CLI - Integrate output all nodes diagnosis results (#339)
**Description**
Integrate output all nodes diagnosis results.
2022-04-11 13:42:04 +08:00
Yuting Jiang 55b0f9d239
Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336)
**Description**
Output results of all nodes in data diagnosis.
2022-04-10 18:57:15 +08:00
Yuting Jiang f15da60b2b
CLI - Integrage result summary and update output format of data diagnosis (#335)
**Description**
Integrage result summary and update output format of data diagnosis.

**Major Revision**
- integrage result summary 
- add md and html format for data diagnosis
2022-04-08 18:48:43 +08:00
guoshzhao 6d895da83c
Benchmarks: Add Feature - Provide option to save raw data into file. (#333)
**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
2022-04-01 16:26:09 +08:00
Yuting Jiang 84fed1ce18
Analyzer: Add feature - Add result summary in excel,md,html format (#320)
**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
2022-03-24 15:32:01 +08:00
rafsalas19 ff51a3cee9
Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324)
**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn
2022-03-16 16:20:11 +08:00
Yuting Jiang b3c95f1827
Analyzer - Add md and html output format for DataDiagnosis (#325)
**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis
2022-03-15 18:04:11 +08:00
Yuting Jiang 1ec055e1c2
Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321)
**Description**
Abstract RuleBase from DataDiagnosis.
2022-03-07 17:25:07 +08:00
Yuting Jiang 97ed12f97f
Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289)
**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests
2022-02-20 16:59:38 +08:00
Ziyue Yang 6cdf759543
Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302)
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
2022-02-09 20:30:42 +08:00
Ziyue Yang 682b2c120d
Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301)
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
2022-02-08 10:59:27 +08:00
Ziyue Yang 853890559a
Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298)
**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.
2022-02-07 13:16:13 +08:00
Yifan Xiong 3419447c11
Benchmarks - Support T4 and A10 in GEMM benchmark (#294)
Support T4 and A10 in GEMM benchmark.
2022-01-29 13:26:00 +00:00
Yifan Xiong 3524975cfc
Config - Support customized env for all modes (#295)
Support customized env for all modes in configuration.
2022-01-29 08:19:48 +00:00
guoshzhao d03d110f55
Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287)
**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.
2022-01-28 20:35:53 +08:00
guoshzhao d877ca2322
Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288)
**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.
2022-01-28 08:16:32 +00:00
Yifan Xiong 7d7cd3dc63
Config - Update benchmark naming to support annotations (#284)
__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.
2022-01-25 09:54:58 +00:00
Ziyue Yang 74421ffee0
Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)
**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
2022-01-21 13:45:37 +08:00
guoshzhao fd2bc9e048
Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283)
**Description**
Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
2022-01-19 10:49:56 +08:00
Yifan Xiong f7ffc54522
CLI - Add command sb benchmark [list,list-parameters] (#279)
__Description__

Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.

<details>
<summary>Examples</summary>
<pre>
$ sb benchmark list -n [a-z]+-bw -o table
Result
--------
mem-bw
nccl-bw
rccl-bw
</pre>
<pre>
$ sb benchmark list-parameters -n mem-bw
=== mem-bw ===
optional arguments:
  --bin_dir str         Specify the directory of the benchmark binary.
  --duration int        The elapsed time of benchmark in seconds.
  --mem_type str [str ...]
                        Memory types to benchmark. E.g. htod dtoh dtod.
  --memory str          Memory argument for bandwidthtest. E.g. pinned unpinned.
  --run_count int       The run count of benchmark.
  --shmoo_mode          Enable shmoo mode for bandwidthtest.
default values:
{'bin_dir': None,
 'duration': 0,
 'mem_type': ['htod', 'dtoh'],
 'memory': 'pinned',
 'run_count': 1}
</pre>
</details>

__Major Revisions__
* Add `sb benchmark list` to list benchmarks matching given name.
* Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.

__Minor Revisions__
* Sort format help text for argparse.
2022-01-18 08:40:03 +00:00
Yifan Xiong ff563b66af
Release - SuperBench v0.4.0 (#278)
__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2021-12-30 16:24:00 +08:00
Yifan Xiong cb8a3cfb15
Benchmarks - Add transformers for TensorRT inference (#254)
Add transformers for TensorRT inference.
2021-12-13 13:21:32 +00:00
Ziyue Yang 10012a0a47
Docs - Add benchmark metrics for cpu-memory-bw-latency (#264)
**Description**
Add benchmark metrics for cpu-memory-bw-latency.
2021-12-13 19:08:19 +08:00
Ziyue Yang b6781968f2
Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py #263
**Description**
Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py.
2021-12-13 07:02:39 +00:00
Hossein Pourreza b590409e0f
Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216)
**Description**
Add mlc memory bandwidth and latency micro benchmark to Superbench.

**Major Revision**
- Add mlc benchmark with test and example files
2021-12-13 13:47:42 +08:00
guoshzhao 4d85630abb
Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245)
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
2021-12-10 13:53:11 +00:00
Yuting Jiang c2f942cb6f
Analyzer: Add Feature - Add basic analysis features (#248)
**Description**
Add basic analysis features.

**Major Revision**
- Add statistics, correlations of the raw data
- Add numeric outlier detection(inter_quartile_range)
- Add boxplot for selected metric
2021-12-10 11:01:59 +00:00
guoshzhao 6e357fb9d2
Monitor: Integration - Integrate monitor into Superbench (#259)
**Description**
Integrate monitor into Superbench.

**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.
2021-12-10 09:33:13 +00:00
guoshzhao afea9913ae
Benchmarks: Fix Bug - Set reduce_op type for metirc return_code (#261)
**Description**
Set the `reduce_op` type for metirc `return_code` as `None`.
2021-12-10 16:02:29 +08:00
Yuting Jiang ed2f3c3c82
CLI - Integrate data diagnosis (#260)
**Description**
Add cli to integrate data diagnosis module.
2021-12-10 06:11:00 +00:00
Yuting Jiang 9f56b2198f
Benchmarks: Unify metric names of benchmarks (#252)
**Description**
Unify metric names of benchmarks.
2021-12-09 04:48:42 +00:00
Yuting Jiang c13ed2a297
Analyzer: Initialization - Add baseline-based data diagnosis module (#242)
**Description**
Add data diagnosis module.

**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators
2021-12-08 18:22:00 +08:00
guoshzhao 44f0270ec4
Benchmarks: Add Feature - Add return_code metric into result (#256)
**Description**
Add return_code metric into result and revise unit tests.
2021-12-07 07:32:37 +00:00
guoshzhao 371fd61cea
Benchmarks: Add Feature - Add 'ignore_invalid' option when register benchmarks. (#247)
**Description**
If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.
2021-12-02 10:26:56 +00:00
Yifan Xiong b4ea97bfa4
Benchmark: Replace `-c` argument with `-N` for `numactl` in Configuration (#250)
**Description**
Replace `-c` argument with `-N` for `numactl` since the old `-c`/`--cpubind` argument is deprecated.
2021-12-02 09:27:03 +00:00
guoshzhao 4074f12c1c
Monitor: Initialization - Add Monitor and MonitorRecord class (#240)
**Description**
Add the initial version of Monitor.

**Major Revision**
- Add `Monitor` class to launch background process for monitoring.
- Add `MonitorRecord` class to save the data one time capturing.
2021-11-18 15:54:18 +08:00
guoshzhao cc70f9c18c
Benchmarks: Add Feature - Extend the device manager utility to support more functions. (#239)
**Description**
Rename `nvidia_helper` utility as `device_manager` module and support more functions:
```
device_manager.get_device_count()
device_manager.get_device_utilization(idx)
device_manager.get_device_temperature(idx)
device_manager.get_device_power_limit(idx)
device_manager.get_device_memory(idx)
device_manager.get_device_row_remapped_info(idx)
device_manager.get_device_ecc_error(idx)
```
2021-11-15 14:24:04 +08:00
Yifan Xiong 8a00c8a03b
Benchmarks - Add TensorRT inference benchmark (#236)
__Description__

Add TensorRT inference benchmark for torchvision models.

__Major Revision__
- Measure TensorRT inference performance.
2021-11-12 15:27:16 +08:00
Yuting Jiang 54919424c3
Benchmarks: Add Benchmark - Add ib traffic validation distributed benchmark (#215)
**Description**
Add ib traffic validation distributed benchmark.

**Major Revision**
- Add ib traffic validation distributed benchmark, example and test
2021-11-10 01:18:41 +08:00
Ziyue Yang 008e0fe1d8
Benchmarks: Add Feature - Add CPU-initiated copy and dtod support to gpu-sm-copy benchmark (#230)
**Description**
This commit does the following:
1) Adds CPU-initiated copy benchmark;
2) Adds dtod benchmark;
3) Support scanning NUMA nodes and GPUs inside the benchmark program;
4) Change the name of gpu-sm-copy to gpu-copy.
2021-10-30 11:19:09 +08:00
guoshzhao e98a68124e
Benchmarks: Add Benchmark - Add onnx model benchmarks based on docker image. (#227)
Add RocmOnnxModelBenchmark class to run benchmarks packaged in superbench/benchmark:rocm4.3.1-onnxruntime1.9.0
2021-10-27 18:41:40 +08:00
Yuting Jiang 6003f2c2a2
Benchmarks: Add Benchmark - Add gpcnet microbenchmark (#229)
**Description**
Add gpcnet microbenchmark

**Major Revision**
- add 2 microbenmark for gpcnet, gpc-network-test, gpc-network-load-test
- add related test and example file
2021-10-22 08:40:01 +00:00
guoshzhao f841c8f466
Benchmarks: Add Feature - Support AMD and CUDA platform for DockerBenchmark. (#226)
Description
Add CudaDockerBenchmark and RocmDockerBenchmark to support amd and cuda platform for DockerBenchmark.
2021-10-22 15:22:15 +08:00
guoshzhao 455ad1f873
revise the term onnx to onnxruntime. (#232)
**Description**
Revise the all the term `onnx` to `onnxruntime`.
2021-10-21 04:29:27 +00:00
Yuting Jiang 49cc8f9a8c
Benchmarks: Add Benchmark - Add tcp connectivity validation microbenchmark (#217)
**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.

**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example
2021-10-12 23:42:12 +00:00
guoshzhao f944245694
Benchmarks: Add Feature - Add option to use fp32 instead of tf32 (#213)
**Description**
Add option `force_fp32` to use fp32 instead of tf32, only takes effect on Ampere or newer GPUs.
2021-09-28 05:53:01 +08:00
Yifan Xiong dfbd70b129
Release - SuperBench v0.3.0 (#212)
**Description**

Cherry-pick  bug fixes from v0.3.0 to main.

**Major Revisions**
* Docs - Upgrade version and release note (#209)
* Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
* Benchmarks: Update - Update benchmarks in configuration file (#208)
* CI/CD - Update GitHub Action VM (#211)
* Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
* CI/CD - Fix bug in build image for push event (#205)
* Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
* Tool: Fix bug - Fix function naming issue in system info  (#200)
* CI/CD - Push images in GitHub Action (#202)
* Bug - Fix torch.distributed command for single node (#201)
* CLI - Integrate system info for node (#199)
* Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
* CI/CD - Add ROCm image build in GitHub Actions (#194)
* Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
* Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
* Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
* Bug - Revise 'docker run' in sb deploy (#195)
* Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)

Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2021-09-26 09:30:31 +08:00
Yuting Jiang 6076251816
Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode and rename metric (#189)
**Description**
Revise arguments of nccl/rccl to support mpi mode for (mpi can not run in nccl/rccl due to multiple operators run in sequence without barrier) and rename metric .

**Major Revision**
- revise argument operators to be a single one

**Minor Revision**
- rename metric to remove benchmark name info
- change argument ngpus default value to be 1
2021-09-03 14:23:19 +08:00
Yifan Xiong e2453e1cae
Runner - Fix inventory issue in ansible_runner (#185)
__Description__

Fix inventory bug in ansible_runner when host list is provided with multiple hosts.

It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.
2021-09-02 13:24:48 +08:00
guoshzhao 37d5dfd5ed
Benchmarks: Code Revision - revise the DockerBenchmark base class (#179)
**Description**
Revise the DockerBenchmark base to support image pull, image rm etc.

**Major Revision**
- image pull in _preprocess()
- image clean in _postprocess()
- execute customized commands in _benchmark()
- add unit tests
2021-09-01 22:15:42 +08:00
Ziyue Yang 024a870be1
Benchmarks: Code Revision - Revise metric name generation and default config for disk performance benchmark (#175)
**Description**
This commit revises disk performance benchmark, including:
1) Add missing benchmark name in default config;
2) Avoid using reserved character ':' in metric name.
2021-08-31 19:21:42 +08:00
Ziyue Yang b97197f08e
Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#169)
**Description**
This commit adds gpu_sm_copy benchmark and related tests.
2021-08-30 18:54:26 +08:00
Yuting Jiang f3d53c3d5f
Benchmarks: Add Benchmark - Add gemm flops microbenchmark for amd (#152)
**Description**
Add gemm flops microbenchmark for amd.

**Major Revision**
- Add gemm flops microbenchmark for amd.
- Add related example and test file.
2021-08-30 13:40:46 +08:00
Yuting Jiang b0df66f7a2
Benchmarks: Code Revision - Extract base class for gemm flops microbenchmark (#165)
**Description**
Extract base class for gemm flops microbenchmark.

**Major Revision**
- extract base class for gemm flops microbenchmark and add related test.
- revise gemm_flops_performance for cuda.
2021-08-30 10:01:28 +08:00
guoshzhao 35114bae9d
Benchmarks: Code Revision - Rename kernel_launch_overhead metrics (#171)
**Description**
Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.
2021-08-28 06:36:41 +08:00
Yuting Jiang 666e3a9471
Benchmarks: Add Benchmark - Add memory bus bandwidth performance microbenchmark for amd (#153)
**Description**
Add memory bus bandwidth performance microbenchmark for amd.

**Major Revision**
- Add memory bus bandwidth performance microbenchmark for amd.
- Add related example and test file.
2021-08-27 21:17:39 +08:00
Yuting Jiang e5e84a2ece
Benchmarks: Code Revision - Extract base class for memory bandwidth microbenchmark (#159)
**Description**
extract base class for memory bandwidth microbenchmark.

**Major Revision**
- revise and optimize cuda_memory_bandwidth_performance
- extract base class for memory bandwidth microbenchmark
- add test for base class
2021-08-26 07:48:07 +08:00
Yuting Jiang 0583862d2d
Benchmarks: Code Revision - fix typo in test of nccl microbenchmark. (#163)
**Description**
 fix typo in test_nccl_bw_performance.py.

**Major Revision**
-  fix typo in test_nccl_bw_performance.py.
2021-08-23 13:53:47 +08:00
Ziyue Yang 6774d7b702
Benchmarks: Revise Benchmark - Add readwrite I/O pattern (#161)
**Description**
This commit adds readwrite I/O pattern for FIO benchmark. Read/write ratio is fixed at 4:1.
2021-08-22 22:38:25 +08:00
guoshzhao 7595d79434
Runner: Add Feature - Generate summarized output files. (#157)
**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
2021-08-20 16:48:40 +08:00
Yifan Xiong 98b6c0e3ca
Runner - Support mpi mode (#146)
Support mpi mode in runner:
* concate mpirun command
* support mca and env config
* prepare hostfile and update Ansible host pattern

Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
2021-08-19 15:59:17 +08:00
guoshzhao 7293e783f1
Benchmarks: Code Revision - change 'reduce' to 'reduce_op' (#156)
**Description**
Change the field name `reduce` to `reduce_op`.
2021-08-16 11:33:39 +08:00
guoshzhao acf365a856
Benchmarks: Add Feature - Set reduce type for current benchmarks' metrics. (#149)
**Description**
Set reduce type for current benchmarks' metrics, including model benchmarks and ShardingMatmul.
2021-08-06 17:23:14 +08:00
guoshzhao bc1a61b91a
Benchmarks: Code Revision - Calculate average value by using statistics module. (#148)
**Description**
Replace `sum(results) / len(results)` with `statistics.mean(results)`
2021-08-06 13:37:18 +08:00
guoshzhao e41b1f6225
Benchmarks: Add Feature - Add reduce function support for output summary. (#147)
**Description**
Add reduce function support for output summary.

**Major Revision**
- Add reducer class to maintain all reduce functions.
- Save reduce type of each metric into `BenchmarkResult`
- Fix UT.
2021-08-05 16:52:49 +08:00
Yuting Jiang e083a598cf
Benchmarks: Add Benchmark - Add NCCL performance benchmark (#113)
**Description**
Add NCCL performance microbenchmark.

**Major Revision**
- Add microbenchmark, example, test, config for NCCL
2021-07-26 10:54:47 +08:00
Yuting Jiang b0c5addcac
Benchmarks: Add Benchmark - Add IB Loopback performance benchmark. (#112)
**Description**
Add RDMA Loopback performance microbenchmark.

**Major Revision**
- Add microbenchmark, example, test, config for RDMA Loopback
2021-07-24 03:40:24 +08:00
Ziyue Yang db297fb4ed
Benchmarks: Add Benchmark - Add disk performance benchmark (#132)
**Description**
Add disk performance microbenchmark.

**Major Revision**
- Add microbenchmark, example, test, config for disk performance.

**Minor Revision**
- Fix bugs in executor unit test related to default enabled tests.
2021-07-23 14:49:05 +08:00
Ziyue Yang 477fbb0ad2
Benchmarks: Fix bug - fix bug in test_executor.py to test default enabled tests only (#133)
**Description**
Fix bug of tests/executor/test_executor.py.

**Major Revision**
- Test default enabled benchmarks only instead of all benchmarks.
2021-07-20 20:11:08 +08:00
Yuting Jiang f9550bd693
Benchmarks: Add Benchmark - Add memory bandwidth benchmark for cuda. (#114)
Add microbenchmark, example, test, config for cuda memory performance and Add cuda-samples(tag with cuda version) as git submodule and update related makefile
2021-07-13 17:30:19 +08:00
Yuting Jiang 71c1617b2e
Utils: Code Revision - Update network common utils (#118)
Update network common utils. Add get_ib_devices in network common utils and move get_free_port from test utils to network common utils
2021-07-13 16:05:01 +08:00
guoshzhao 9c984c7eb0
Bug bash - Merge fix from release/0.2 to main (#124)
* Bug Fix - Fix race condition issue for multi ranks (#117)

Fix race condition issue when multi ranks rotating the same directory.

* Update pipeline for release branch (#122)

* Bug Fix - Fix bug when convert bool config to store_true argument. (#120)

Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
2021-07-09 16:54:42 +08:00
Yifan Xiong 7458f83a9b
Runner & Executor - Support AMD GPU (#119)
Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.

* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.
2021-07-09 00:42:49 +08:00
Yifan Xiong fb7d4a7396
Runner - Fetch benchmarks results on all nodes (#116)
Fetch benchmarks results on all nodes, will rsync after each benchmark.
The results directory structure on control node is as follows:

```
outputs/
└── datetime
    ├── nodes
    │   └── node-0
    │       ├── benchmarks
    │       │   ├── benchmark-0
    │       │   │   ├── rank-0
    │       │   │   │   └── results.json
    │       └── sb-exec.log
    ├── sb-run.log
    └── sb.config.yaml
```
2021-07-02 21:45:56 +08:00
Yifan Xiong 7b0b0e9add
CLI - Support custom output directory (#110)
* Support custom output directory.
* Update document.
2021-07-01 21:10:12 +08:00
guoshzhao 8ffaddfaef
Benchmarks: Fix Bug - Fix gemm kernel bug for nvidia v100. (#105)
* fix bug for nvidia v100
* hard code the supported dict for different arch.
2021-06-29 18:46:44 +08:00
guoshzhao 9c7485276b
Benchmarks: Code Revision - Replace torch.optim.AdamW with transformers.AdamW. (#106)
* replace torch.optim.AdamW with transformers.AdamW.
2021-06-28 15:24:39 +08:00
Yifan Xiong c0c43b8f81
Bug bash - Fix bugs in multi GPU benchmarks (#98)
* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.
2021-06-23 18:16:43 +08:00
Yifan Xiong ddbc51a135
Bug bash - Fix bugs and refine log in single GPU benchmarks (#97)
Fix bugs and refine log in single GPU benchmarks:

* Fix none framework issue
* Fix empty parameter bug
* Remove missed mobilenet_v3 models
* Change benchmark registration log to debug level
* Add pid in logging
* Add missing benchmarks in default config
* Fix deprecated logging warn
2021-06-16 13:51:22 +08:00
guoshzhao 03b41be145
Benchmarks: Fix Bug - Fix OOM issue when run pytorch models sequentially. (#93)
* Clean up the cache.
2021-06-07 10:19:05 +08:00
guoshzhao 2d9be807a9
Benchmarks: Fix Bug - Fix return code overwrite issue (#94)
* fix return code reset issue
2021-06-04 18:02:12 +08:00
Yifan Xiong 6b0ca1cb05
Runner - Support local mode in runner (#88)
* Support local mode in runner.
2021-06-02 23:58:44 +08:00
guoshzhao 44c5103b5c
Benchmarks: Code Revision - Change default shape of sharding-matmul. (#92)
* Change default shape of sharding-matmul.
2021-06-02 10:50:09 +08:00
guoshzhao 6c6f526937
Benchmarks: Add Benchmark - Add FLOPs performance benchmark for cuda. (#87)
* add cuda flops performance benchmark.
2021-06-02 09:15:58 +08:00
Yuting Jiang 83235433b2
Benchmarks: Add benchmark - add micro benchmark for cudnn test (#89)
* add python related cudnn microbenchmark
2021-06-01 22:24:35 +08:00
Yifan Xiong 5e9f948df2
Executor - Save benchmark results to file (#86)
* Save benchmark results to json file.
2021-05-31 13:05:12 +08:00
Yuting Jiang 18398fbaa2
Benchmarks: Add benchmark - add micro benchmark for cublas test (#80)
* add benchmark for cublas test

* format

* revise error handling and test

* add interface to read json file, revise json file path and include .json in packaging

* add random_seed in arguments

* revise preprocess of cublas benchmark

* fix lint error and note error in source code

* update according comments

* revise input arguments from json file to custom str and convert json file to built-in dict list

* restore package config

* fit lint issue

* update platform and comments

* rename files to match source code dir and fix comments error

Co-authored-by: root <root@sb-validation-000001.51z1chmys5fuzfqyo4niepozre.bx.internal.cloudapp.net>
2021-05-31 10:31:53 +08:00
Yifan Xiong 8b4f613a76
Runner - Support torch.distributed mode in runner (#81)
* Support `torch.distributed` mode in runner.
* Support given `proc_num` and `node_num` in `torch.distributed` mode.
2021-05-28 12:29:39 +08:00
Yifan Xiong e7f6d8ba78
CI/CD - Add integration tests for Ansible playbooks (#82)
* Add integration tests for Ansible playbooks
* Add `gpu_vendor` var to bypass gpu mount
2021-05-26 20:04:49 +08:00
Yifan Xiong c05e173b3d
Runner - Implement ansible client and runner (#69)
Implement ansible client and runner:
* add ansible client
* add deploy and check_env playbooks
2021-05-23 23:53:37 +08:00