__Description__
Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.
<details>
<summary>Examples</summary>
<pre>
$ sb benchmark list -n [a-z]+-bw -o table
Result
--------
mem-bw
nccl-bw
rccl-bw
</pre>
<pre>
$ sb benchmark list-parameters -n mem-bw
=== mem-bw ===
optional arguments:
--bin_dir str Specify the directory of the benchmark binary.
--duration int The elapsed time of benchmark in seconds.
--mem_type str [str ...]
Memory types to benchmark. E.g. htod dtoh dtod.
--memory str Memory argument for bandwidthtest. E.g. pinned unpinned.
--run_count int The run count of benchmark.
--shmoo_mode Enable shmoo mode for bandwidthtest.
default values:
{'bin_dir': None,
'duration': 0,
'mem_type': ['htod', 'dtoh'],
'memory': 'pinned',
'run_count': 1}
</pre>
</details>
__Major Revisions__
* Add `sb benchmark list` to list benchmarks matching given name.
* Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.
__Minor Revisions__
* Sort format help text for argparse.
__Description__
Cherry-pick bug fixes from v0.4.0 to main.
__Major Revisions__
* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.
**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
**Description**
Integrate monitor into Superbench.
**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.
**Description**
Add data diagnosis module.
**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators
**Description**
If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.
**Description**
Add the initial version of Monitor.
**Major Revision**
- Add `Monitor` class to launch background process for monitoring.
- Add `MonitorRecord` class to save the data one time capturing.
**Description**
Rename `nvidia_helper` utility as `device_manager` module and support more functions:
```
device_manager.get_device_count()
device_manager.get_device_utilization(idx)
device_manager.get_device_temperature(idx)
device_manager.get_device_power_limit(idx)
device_manager.get_device_memory(idx)
device_manager.get_device_row_remapped_info(idx)
device_manager.get_device_ecc_error(idx)
```
**Description**
This commit does the following:
1) Adds CPU-initiated copy benchmark;
2) Adds dtod benchmark;
3) Support scanning NUMA nodes and GPUs inside the benchmark program;
4) Change the name of gpu-sm-copy to gpu-copy.
**Description**
Add gpcnet microbenchmark
**Major Revision**
- add 2 microbenmark for gpcnet, gpc-network-test, gpc-network-load-test
- add related test and example file
**Description**
Add gpcnet as git submodule and building logic.
**Major Revision**
- add gpcnet as a submodule
- add build logic in third_party/Makefile
**Description**
Add IB validation tool source code. IB validation tool is a tool to validate IB traffic of different pattern in multi nodes flexibly
**Major Revision**
- Add ib validation tool source code
- Add cmake file to build the source code
**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.
**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example
__Major Revisions__
* Refine document structure for user tutorial.
__Minor Revisions__
* Add AMD part in installation.
* Change default config file to latest link.