**Description**
Add memory bus bandwidth performance microbenchmark for amd.
**Major Revision**
- Add memory bus bandwidth performance microbenchmark for amd.
- Add related example and test file.
**Description**
Fix bug of microbenmark building cublas and cudnn for amd
**Major Revision**
- remove cuda LANGUAGES in project()
- check CUDAToolkit quiet and then build if found
**Description**
extract base class for memory bandwidth microbenchmark.
**Major Revision**
- revise and optimize cuda_memory_bandwidth_performance
- extract base class for memory bandwidth microbenchmark
- add test for base class
**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`
**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
"kernel-launch/overhead_event:0": 0.00583,
"kernel-launch/overhead_event:1": 0.00545,
"kernel-launch/overhead_event:2": 0.00581,
"kernel-launch/overhead_event:3": 0.00572,
"kernel-launch/overhead_event:4": 0.00559,
"kernel-launch/overhead_event:5": 0.00591,
"kernel-launch/overhead_event:6": 0.00562,
"kernel-launch/overhead_event:7": 0.00586,
"resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
"resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
"resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
"resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
"pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
"pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
"pytorch-sharding-matmul/0/allgather": 10.088025093078613,
"pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
Add config and docs for development experience.
__Major Revision__
- Add settings and extensions config for VSCode.
- Add devcontainer config for Codespaces.
- Update document accordingly.
**Description**
Add reduce function support for output summary.
**Major Revision**
- Add reducer class to maintain all reduce functions.
- Save reduce type of each metric into `BenchmarkResult`
- Fix UT.
**Description**
Add rocBLAS building logic in third_party.
**Major Revision**
- Add rocm_rocblas target in third_party/Makefile.
- Add rocblas building logic
**Description**
Support rocm in third_party/makefile and add rccl-tests as a submodule with building logic.
**Major Revision**
- Support rocm in third_party/makefile
- Add rccl-tests as a submodule
- Add build logic in third_party/Makefile for rccl-tests
__Description__
Cherry-pick bug fixes from v0.2.1 to main.
__Major Revisions__
* Fix bug of VGG models failed on A100 GPU with batch_size=128.
* Fix Ansible connection issue when running in localhost.
* Update version in packages and docs.
**Description**
Add the source code of rocm kernel launch overhead benchmark.
**Major Revision**
- Revise cmake build logic to support both cuda and rocm
**Description**
Add disk performance microbenchmark.
**Major Revision**
- Add microbenchmark, example, test, config for disk performance.
**Minor Revision**
- Fix bugs in executor unit test related to default enabled tests.
**Description**
Add FIO benchmark tool into third-party dependency.
**Major Revision**
- Add FIO submodule into third-party directory and modify Makefile to enable it.
Add microbenchmark, example, test, config for cuda memory performance and Add cuda-samples(tag with cuda version) as git submodule and update related makefile
Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.
* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.
Support `--host-list` for deploy and run commands.
Before this change, an inventory file is needed to use `sb deploy/run`.
Now, `--host-list localhost` or `-l localhost` is sufficient for quick try.