**Description**
Cherry-pick bug fixes from v0.8.0 to main.
**Major Revisions**
* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
**Description**
This PR adds a micro-benchmark of distributed model inference workloads.
**Major Revision**
- Add a new micro-benchmark dist-inference.
- Add corresponding example and unit tests.
- Update configuration files to include this new micro-benchmark.
- Update micro-benchmark README.
---------
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Fix potential barrier timeout in init_process_group due to race
condition of using the same port. Change to different ports when running
multiple models sequentially in one process.
For example, when running vgg11/13/16/19, will use port 29501~29504
respectively.
**Description**
Support error tolerance in micro-benchmark for CuDNN function
**Major Revision**
- revise micro_base to support running the remaining commands run when
one command failed in the microbenchmark
- make error tolerance as true in cudnn functions
**Description**
revise cublas-benchmark for flexible warmup and fill data with fixed
number for perf test to improve the running efficiency.
**Major Revision**
- remove num_in_steps for warmup to support more flexible warmup setting
for users
- Add support to generate input with fixed number for perf test
**Description**
The commit(e08b6d3a1c) installs a rccl
version which is causing "undefined symbol: ncclGetLastError" while
trying to import torch. Revert it to avoid the error.
**Description**
Cherry-pick bug fixes from v0.7.0 to main.
**Major Revisions**
* Benchmarks - Fix missing include in FP8 benchmark (#460)
* Fix bug in TE BERT model (#461)
* Doc - Update benchmark doc (#465)
* Bug: Fix bug for incorrect datatype judgement in cublas-function
source code (#464)
* Support `sb deploy` without pulling image (#466)
* Docs - Upgrade version and release note (#467)
Co-authored-by: Russell J. Hewett <russell.j.hewett@gmail.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
**Major Revision**
- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.
**Minor Revision**
- Fix typo
Support FP8 in PyTorch BERT models:
* add fp8 hybrid/e4m3/e5m2 in precision arguments
* build BERT encoders with `te.TransformerLayer` to repalce
`transformers.BertModel`
* wrap forward steps with fp8 autocast
**Description**
Add correctness check in cublas-function benchmark.
**Major Revision**
- add python code of correctness check in cublas-function benchmark and test
**Description**
Add c source code of correctness check for cublas functions.
**Major Revision**
- add correctness check for all supported cublas functions
- add --correctness option into binary
**Minor Revision**
- fix bug and template fill_data and prepare_tensor to get right memory-alignment output matrix for different datatype
**Description**
Add stdout logging util module and enable real-time logging flushing in executor
**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`
**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks
- udpate related docs
Add non-zero return code for `sb deploy` and `sb run` command when
there're Ansible failures in control plane.
Return code is set to count of failure.
For failures caused by benchmarks, return code is still set per benchmark
in results json file.
Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028