**Description**
Add result summary in excel,md,html format.
**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn
**Description**
Add md and html output format for DataDiagnosis.
**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output
**Minor Revision**
- move excel and json output interface into DataDiagnosis
**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.
**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.
**Description**
Please write a brief description and link the related issue if have.
**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.
**Description**
Add timeout feature for each benchmark.
**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
[ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.
__Description__
Update benchmark naming to support annotations.
__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.
**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
__Description__
Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.
<details>
<summary>Examples</summary>
<pre>
$ sb benchmark list -n [a-z]+-bw -o table
Result
--------
mem-bw
nccl-bw
rccl-bw
</pre>
<pre>
$ sb benchmark list-parameters -n mem-bw
=== mem-bw ===
optional arguments:
--bin_dir str Specify the directory of the benchmark binary.
--duration int The elapsed time of benchmark in seconds.
--mem_type str [str ...]
Memory types to benchmark. E.g. htod dtoh dtod.
--memory str Memory argument for bandwidthtest. E.g. pinned unpinned.
--run_count int The run count of benchmark.
--shmoo_mode Enable shmoo mode for bandwidthtest.
default values:
{'bin_dir': None,
'duration': 0,
'mem_type': ['htod', 'dtoh'],
'memory': 'pinned',
'run_count': 1}
</pre>
</details>
__Major Revisions__
* Add `sb benchmark list` to list benchmarks matching given name.
* Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.
__Minor Revisions__
* Sort format help text for argparse.
__Description__
Cherry-pick bug fixes from v0.4.0 to main.
__Major Revisions__
* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.
**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
**Description**
Integrate monitor into Superbench.
**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.
**Description**
Add data diagnosis module.
**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators
**Description**
If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.
**Description**
Add the initial version of Monitor.
**Major Revision**
- Add `Monitor` class to launch background process for monitoring.
- Add `MonitorRecord` class to save the data one time capturing.
**Description**
Rename `nvidia_helper` utility as `device_manager` module and support more functions:
```
device_manager.get_device_count()
device_manager.get_device_utilization(idx)
device_manager.get_device_temperature(idx)
device_manager.get_device_power_limit(idx)
device_manager.get_device_memory(idx)
device_manager.get_device_row_remapped_info(idx)
device_manager.get_device_ecc_error(idx)
```
**Description**
This commit does the following:
1) Adds CPU-initiated copy benchmark;
2) Adds dtod benchmark;
3) Support scanning NUMA nodes and GPUs inside the benchmark program;
4) Change the name of gpu-sm-copy to gpu-copy.
**Description**
Add gpcnet microbenchmark
**Major Revision**
- add 2 microbenmark for gpcnet, gpc-network-test, gpc-network-load-test
- add related test and example file
**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.
**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example
**Description**
Revise arguments of nccl/rccl to support mpi mode for (mpi can not run in nccl/rccl due to multiple operators run in sequence without barrier) and rename metric .
**Major Revision**
- revise argument operators to be a single one
**Minor Revision**
- rename metric to remove benchmark name info
- change argument ngpus default value to be 1
__Description__
Fix inventory bug in ansible_runner when host list is provided with multiple hosts.
It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.
**Description**
Revise the DockerBenchmark base to support image pull, image rm etc.
**Major Revision**
- image pull in _preprocess()
- image clean in _postprocess()
- execute customized commands in _benchmark()
- add unit tests
**Description**
This commit revises disk performance benchmark, including:
1) Add missing benchmark name in default config;
2) Avoid using reserved character ':' in metric name.
**Description**
Add gemm flops microbenchmark for amd.
**Major Revision**
- Add gemm flops microbenchmark for amd.
- Add related example and test file.
**Description**
Extract base class for gemm flops microbenchmark.
**Major Revision**
- extract base class for gemm flops microbenchmark and add related test.
- revise gemm_flops_performance for cuda.
**Description**
Add memory bus bandwidth performance microbenchmark for amd.
**Major Revision**
- Add memory bus bandwidth performance microbenchmark for amd.
- Add related example and test file.
**Description**
extract base class for memory bandwidth microbenchmark.
**Major Revision**
- revise and optimize cuda_memory_bandwidth_performance
- extract base class for memory bandwidth microbenchmark
- add test for base class
**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`
**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
"kernel-launch/overhead_event:0": 0.00583,
"kernel-launch/overhead_event:1": 0.00545,
"kernel-launch/overhead_event:2": 0.00581,
"kernel-launch/overhead_event:3": 0.00572,
"kernel-launch/overhead_event:4": 0.00559,
"kernel-launch/overhead_event:5": 0.00591,
"kernel-launch/overhead_event:6": 0.00562,
"kernel-launch/overhead_event:7": 0.00586,
"resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
"resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
"resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
"resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
"pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
"pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
"pytorch-sharding-matmul/0/allgather": 10.088025093078613,
"pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
**Description**
Add reduce function support for output summary.
**Major Revision**
- Add reducer class to maintain all reduce functions.
- Save reduce type of each metric into `BenchmarkResult`
- Fix UT.
**Description**
Add disk performance microbenchmark.
**Major Revision**
- Add microbenchmark, example, test, config for disk performance.
**Minor Revision**
- Fix bugs in executor unit test related to default enabled tests.
Add microbenchmark, example, test, config for cuda memory performance and Add cuda-samples(tag with cuda version) as git submodule and update related makefile
Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.
* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.
* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.
* add benchmark for cublas test
* format
* revise error handling and test
* add interface to read json file, revise json file path and include .json in packaging
* add random_seed in arguments
* revise preprocess of cublas benchmark
* fix lint error and note error in source code
* update according comments
* revise input arguments from json file to custom str and convert json file to built-in dict list
* restore package config
* fit lint issue
* update platform and comments
* rename files to match source code dir and fix comments error
Co-authored-by: root <root@sb-validation-000001.51z1chmys5fuzfqyo4niepozre.bx.internal.cloudapp.net>
* Benchmarks: Add Benchmark - add computation and communication overlap micro benchmark
* Benchmarks: Add benchmark - fix some format issues and typo
* Benchmarks: Add Benchmark - update according comments and add test
* revise tests
* skip multi gpu test due to no multi gpu
Co-authored-by: v-yujiang <v-yujiang@microsoft.com>