**Description**
Fix bugs in data diagnosis.
**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel
**Description**
Cherry-pick bug fixes from v0.5.0 to main.
**Major Revisions**
* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
**Description**
Integrage result summary and update output format of data diagnosis.
**Major Revision**
- integrage result summary
- add md and html format for data diagnosis
**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
**Description**
Add result summary in excel,md,html format.
**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn
**Description**
Add md and html output format for DataDiagnosis.
**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output
**Minor Revision**
- move excel and json output interface into DataDiagnosis
**Description**
The BatchNorm operator is not numerically stable in fp16. PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models. Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32. Preserving BN in fp32 for superbench more accurately reflects real workloads.
**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.
**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests
**Description**
Add support for pytorch>=1.9.0 of init_process_group.
**Major Revision**
- Use PrefixStore(TCPStore) to init_process_group manully for each model run
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.
**Description**
Please write a brief description and link the related issue if have.
**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.
**Description**
Add timeout feature for each benchmark.
**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
[ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.
__Description__
Update benchmark naming to support annotations.
__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.