Граф коммитов

53 Коммитов

Автор SHA1 Сообщение Дата
pdr 479491279e
Dockerfile - Add support for arm64 build (#660)
Add support for arm64 build:

- Updated dockerfile for arm64 build
- extend cpu stream compilation for neoverse 
- handle onnxruntime-gpu installation
- third party builds filtering based on arch
- disable cuda decode perf build for non x86
2024-11-06 23:16:12 +00:00
Yuting Jiang 949f9cb406
Release - SuperBench v0.11.0 (#654)
**Description**
Cherry pick bug fixes from v0.11.0 to main

**Major Revision**
* #645 
* #648 
* #646 
* #647 
* #651 
* #652 
* #650

---------

Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
2024-10-10 09:59:47 +08:00
Yang Wang 9de841bc95
Use `types-setuptools` as `types-pkg_resources` is Yanked (#637)
* https://pypi.org/project/types-pkg-resources/
* Use types-setuptools instead
2024-08-08 22:30:37 +08:00
Yang Wang 9a3ce39d5a
Update omegaconf version to 2.3.0 (#631)
Update `omegaconf` version to
[2.3.0](https://pypi.org/project/omegaconf/2.3.0/) as omegaconf 2.0.6
has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will
enforce this behaviour change.
Discussion can be found at https://github.com/pypa/pip/issues/12063.
2024-07-23 14:46:28 -07:00
Yuting Jiang 1f5031bd74
Dockerfile - Upgrade to rocm5.7 dockerfile (#587)
**Description**
upgrade to rocm5.7 dockerfile.

---------

Co-authored-by: yukirora <yuting.jiang@microsoft.com>
2023-12-09 17:41:12 +00:00
Yuting Jiang dd5a6329ed
Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582)
**Description**
Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark
2023-12-07 09:37:09 +08:00
guoshzhao 028819b388
Monitor - Add support for AMD GPU. (#580)
**Description**
Add AMD support in monitor.

**Major Revision**
- Add library pyrsmi to collect metrics.
- Currently can get device_utilization, device_power, device_used_memory
and device_total_memory.
2023-11-27 18:45:56 +08:00
Yifan Xiong 1ad1c21c38
Dockerfile - Upgrade Docker image to CUDA 12.2 (#577)
Upgrade Docker image to CUDA 12.2 for H100:
* upgrade base image to 23.10
* fix onnxruntime version in python3.10
* fix compilation errors
2023-11-22 13:48:18 +00:00
Yifan Xiong 7184bdd1ed
Benchmarks - Update result parsing in tensorrt inference (#541)
* Update result parsing for newer tensorrt versions
* Update arguments when load torchvision models
2023-06-30 11:22:46 +08:00
Yifan Xiong f4dab9f7ba
Update error message in setup (#538)
Update error message in setup, require wheel for pip>=23.1.
2023-06-14 10:51:45 +08:00
Yifan Xiong a1cd3c9475
Runner - Add signal handler in runner (#530)
Add signal handler in runner to gracefully exit when receiving SIGINT
(<kbd>Ctrl</kbd>+<kbd>C</kbd>) or SIGTERM during benchmark execution.
2023-05-23 17:25:35 +08:00
Yifan Xiong 35f5390512
Pin setuptools version to v65.7.0 (#483)
Pin setuptools version to
[v65.7.0](https://setuptools.pypa.io/en/latest/history.html#v65-7-0) to
avoid breaking changes since v66.0.0.
2023-03-06 11:43:44 +00:00
Yifan Xiong 2cc4cd03e2
Limit ansible_runner version for Python3.6 (#485)
Limit ansible_runner version to less than 2.3.2 for Python3.6.
2023-03-06 18:54:45 +08:00
Yuting Jiang ec7f502c93
CI/CD - Upgrade networkx version to fix installation compatibility issue (#478)
**Description**
Upgrade networkx version to fix installation compatibility issue.
2023-02-17 05:36:21 +00:00
Yuting Jiang 1deb2eaa29
downgrage transformers version to fix tersorrt (#441)
**Description**
Downgrage transformers version to fix tersorrt test failure.
2022-12-14 14:19:32 +08:00
Yang Wang e4eeda0afd
Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430)
* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments
2022-11-29 12:30:10 +08:00
Yang Wang 57f7403c47
Update typing-extensions version to fix pipeline issue (#432) 2022-11-17 19:39:52 +08:00
Yifan Xiong d7bb8303fb
CLI - Update version to include revision hash and date (#427)
Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028
2022-10-31 10:44:41 +08:00
Yuting Jiang 3367c4f6cc
Benchmarks - Add support to allow list of custom config string in cudnn-functions and cublas-functions (#414)
**Description**
Add support to allow list of custom config string in cudnn-functions and cublas-functions.
2022-10-18 09:59:51 +08:00
Yifan Xiong 63e9b2d1bc
Release - SuperBench v0.6.0 (#409)
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
2022-09-06 18:06:05 +08:00
Yifan Xiong 626ac0a463
Update Python setup for require packages (#387)
__Description__

Update Python setup for require packages.

__Major Revisions__
* downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
* add extra entry in extras_require for nested packages
* update `pip install` contents accordingly
2022-08-17 11:33:57 +08:00
Yang Wang faeee0a7cc
Auto generate ibstat file for topo aware traffic pattern (#381)
An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
2022-08-13 18:20:42 +08:00
Jie Zhang ef4d65745b
Support topo-aware IB performance validation (#373)
* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline

Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>
2022-07-26 16:56:19 -07:00
Yifan Xiong 16b6385dee
Add dependencies (#374)
Add dependencies

* include ndv4-topo.xml in cuda docker images
* require requests version to avoid RequestsDependencyWarning
2022-07-13 08:42:53 +00:00
Yifan Xiong a94ead34b0
CLI - Support SKU auto detect if running on Azure VM (#365)
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
2022-07-05 10:52:39 +08:00
Yifan Xiong a4937e95c6
Support `sb run` on host directly without Docker (#358)
**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.
2022-06-14 10:57:01 +08:00
Yifan Xiong 6681c72043
Release - SuperBench v0.5.0 (#350)
**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2022-04-29 16:22:55 +08:00
Yuting Jiang b3c95f1827
Analyzer - Add md and html output format for DataDiagnosis (#325)
**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis
2022-03-15 18:04:11 +08:00
guoshzhao fd2bc9e048
Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283)
**Description**
Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
2022-01-19 10:49:56 +08:00
Yifan Xiong ff563b66af
Release - SuperBench v0.4.0 (#278)
__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)

Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
2021-12-30 16:24:00 +08:00
guoshzhao 4d85630abb
Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245)
**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.
2021-12-10 13:53:11 +00:00
Yuting Jiang c2f942cb6f
Analyzer: Add Feature - Add basic analysis features (#248)
**Description**
Add basic analysis features.

**Major Revision**
- Add statistics, correlations of the raw data
- Add numeric outlier detection(inter_quartile_range)
- Add boxplot for selected metric
2021-12-10 11:01:59 +00:00
Yuting Jiang c13ed2a297
Analyzer: Initialization - Add baseline-based data diagnosis module (#242)
**Description**
Add data diagnosis module.

**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators
2021-12-08 18:22:00 +08:00
Yuting Jiang 49cc8f9a8c
Benchmarks: Add Benchmark - Add tcp connectivity validation microbenchmark (#217)
**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.

**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example
2021-10-12 23:42:12 +00:00
Yuting Jiang 37b15db92c
Tools: Add Feature - Add script to generate system config info. (#160)
**Description**
Add script to generate system config info.

**Major Revision**
- Add script to generate system config info into the dict in superbench/tools.
2021-09-06 17:48:36 +08:00
guoshzhao c8357f4e7a
Setup: Revision - Revise torch extra_require (#177)
**Description**
change the minimal version requirement for superbench:
```
'torch>=1.7.0a0',
'torchvision>=0.8.0a0',
```
2021-08-31 16:14:45 +08:00
guoshzhao 7595d79434
Runner: Add Feature - Generate summarized output files. (#157)
**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
2021-08-20 16:48:40 +08:00
Yifan Xiong c0c43b8f81
Bug bash - Fix bugs in multi GPU benchmarks (#98)
* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.
2021-06-23 18:16:43 +08:00
Yifan Xiong ddbc51a135
Bug bash - Fix bugs and refine log in single GPU benchmarks (#97)
Fix bugs and refine log in single GPU benchmarks:

* Fix none framework issue
* Fix empty parameter bug
* Remove missed mobilenet_v3 models
* Change benchmark registration log to debug level
* Add pid in logging
* Add missing benchmarks in default config
* Fix deprecated logging warn
2021-06-16 13:51:22 +08:00
Yifan Xiong 6b0ca1cb05
Runner - Support local mode in runner (#88)
* Support local mode in runner.
2021-06-02 23:58:44 +08:00
guoshzhao 331c740a15
Benchmarks: Add Feature - Add nvml package to provide python interfaces of nvidia. (#91) 2021-06-01 23:31:07 +08:00
Yifan Xiong c05e173b3d
Runner - Implement ansible client and runner (#69)
Implement ansible client and runner:
* add ansible client
* add deploy and check_env playbooks
2021-05-23 23:53:37 +08:00
Yifan Xiong 977b1a7355
CLI - Refine CLI handlers (#68)
* use absolute path of input file
* parse registry uri from image
* merge common parts for arguments processing
2021-05-18 11:34:15 +08:00
Yifan Xiong 5711429403
CLI - Integration with Executor and Runner (#26)
* CLI integration with Executor and Runner
2021-04-12 17:38:17 +08:00
Yifan Xiong 67053d9a1f
Add CUDA dockerfile for superbench (#43)
* add cuda11.1.1 dockerfile
2021-04-12 14:17:10 +08:00
Yifan Xiong 0e2b2b0829
Update logger (#28)
Update logger class.
* add file handler along with stream handler
* add colored formatter
2021-03-29 14:06:55 +08:00
Yifan Xiong 91b44bc5a1
CLI: Code Revision - Use omegaconf to replace hydra for configuration (#27)
Use omegaconf to replace hydra for configuration system:
* remove hydra
* use omegaconf to merge configurations
2021-03-26 21:19:17 +08:00
Yifan Xiong 5d11579a10
CLI - Add command sb [version,deploy,exec,run] (#10)
- Add CLI commands
  * sb version
  * sb deploy
  * sb exec
  * sb run
- Add interface with executor and runner
- Add cli test cases
2021-03-12 13:16:43 +08:00
guoshzhao ebea2d5053
Benchmarks: Add Feature - Add random dataset for Pytorch. (#17)
* add random dataset.

* install pytorch-cpu for test docker.

* fix typo

* add more test cases.

* address comments.

Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
2021-03-12 02:30:44 +08:00
Yifan Xiong d32b96eb98
Setup: Add Test - Add Codecov (#9)
Add code coverage configuration.
2021-02-04 10:43:43 +08:00