Add support for arm64 build:
- Updated dockerfile for arm64 build
- extend cpu stream compilation for neoverse
- handle onnxruntime-gpu installation
- third party builds filtering based on arch
- disable cuda decode perf build for non x86
**Description**
Add AMD support in monitor.
**Major Revision**
- Add library pyrsmi to collect metrics.
- Currently can get device_utilization, device_power, device_used_memory
and device_total_memory.
Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028
**Description**
Cherry-pick bug fixes from v0.6.0 to main.
**Major Revisions**
* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
__Description__
Update Python setup for require packages.
__Major Revisions__
* downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
* add extra entry in extras_require for nested packages
* update `pip install` contents accordingly
An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
* Support topo-aware IB performance validation
Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.
To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern topo-aware
--ibstat path to ibstat output
--ibnetdiscover path to ibnetdiscover output
--min_dist minimum distance of VM pairs (optional, default 2)
--max_dist maximum distance of VM pairs (optional, default 6)
The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).
The specified IB test will then be running across these
generated VM pairs.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
* Add description about topology aware ib traffic tests
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
* Add unit test to verify generated topology aware config file
This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
* Fix lint issue on Azure pipeline
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
**Description**
Support `sb run` on host directly without Docker
**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.
**Description**
Cherry-pick bug fixes from v0.5.0 to main.
**Major Revisions**
* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
**Description**
Add md and html output format for DataDiagnosis.
**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output
**Minor Revision**
- move excel and json output interface into DataDiagnosis
__Description__
Cherry-pick bug fixes from v0.4.0 to main.
__Major Revisions__
* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.
**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
**Description**
Add data diagnosis module.
**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators
**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.
**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example
**Description**
Add script to generate system config info.
**Major Revision**
- Add script to generate system config info into the dict in superbench/tools.
**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`
**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
"kernel-launch/overhead_event:0": 0.00583,
"kernel-launch/overhead_event:1": 0.00545,
"kernel-launch/overhead_event:2": 0.00581,
"kernel-launch/overhead_event:3": 0.00572,
"kernel-launch/overhead_event:4": 0.00559,
"kernel-launch/overhead_event:5": 0.00591,
"kernel-launch/overhead_event:6": 0.00562,
"kernel-launch/overhead_event:7": 0.00586,
"resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
"resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
"resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
"resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
"pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
"pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
"pytorch-sharding-matmul/0/allgather": 10.088025093078613,
"pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.
* add random dataset.
* install pytorch-cpu for test docker.
* fix typo
* add more test cases.
* address comments.
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>