**Description**
add auto selecting algorithm support for cudnn functions.
**Major Revision**
- add auto selecting algorithm support for cudnn functions in source
code
- add 'auto_algo' option in benchmark
- add related test
**Description**
Add runner for sys info to automatically collect on multiple nodes and
update related docs.
**Major Revision**
- add runner for sys info which will check docker status and run `sb
node info` on all nodes' docker and fetch results from all nodes
**Minor Revision**
- update cli and system-info doc
- update sb node info to save output info output-dir/sys-info.json
**Description**
Add dockerfile for win10 and building script for directx_benchmarks.
**Major Revision**
- Add docker file for win10 and required scripts to install the
dependency
- Add building script to build all directx vs benchmarks
- Add call of building script in Makefile
---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
**Description**
Add support for DirectX GPU platform.
**Major Revision**
- Add DirectX platform for benchmark registry
- Add gpu_vendor identify for AMD and NVIDIA with win driver
**Description**
Model benchmarks can stop due to `num_steps` or `duration` config which
will take effect when the value is set greater than 0.
If both are set greater than 0, the earliest condition reached will
work.
**Description**
Cherry-pick bug fixes from v0.8.0 to main.
**Major Revisions**
* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
**Description**
This PR adds a micro-benchmark of distributed model inference workloads.
**Major Revision**
- Add a new micro-benchmark dist-inference.
- Add corresponding example and unit tests.
- Update configuration files to include this new micro-benchmark.
- Update micro-benchmark README.
---------
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Fix potential barrier timeout in init_process_group due to race
condition of using the same port. Change to different ports when running
multiple models sequentially in one process.
For example, when running vgg11/13/16/19, will use port 29501~29504
respectively.
**Description**
Support error tolerance in micro-benchmark for CuDNN function
**Major Revision**
- revise micro_base to support running the remaining commands run when
one command failed in the microbenchmark
- make error tolerance as true in cudnn functions
**Description**
revise cublas-benchmark for flexible warmup and fill data with fixed
number for perf test to improve the running efficiency.
**Major Revision**
- remove num_in_steps for warmup to support more flexible warmup setting
for users
- Add support to generate input with fixed number for perf test
**Description**
The commit(e08b6d3a1c) installs a rccl
version which is causing "undefined symbol: ncclGetLastError" while
trying to import torch. Revert it to avoid the error.
**Description**
Cherry-pick bug fixes from v0.7.0 to main.
**Major Revisions**
* Benchmarks - Fix missing include in FP8 benchmark (#460)
* Fix bug in TE BERT model (#461)
* Doc - Update benchmark doc (#465)
* Bug: Fix bug for incorrect datatype judgement in cublas-function
source code (#464)
* Support `sb deploy` without pulling image (#466)
* Docs - Upgrade version and release note (#467)
Co-authored-by: Russell J. Hewett <russell.j.hewett@gmail.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
**Major Revision**
- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.
**Minor Revision**
- Fix typo
Support FP8 in PyTorch BERT models:
* add fp8 hybrid/e4m3/e5m2 in precision arguments
* build BERT encoders with `te.TransformerLayer` to repalce
`transformers.BertModel`
* wrap forward steps with fp8 autocast