superbenchmark

Граф коммитов

Автор	SHA1	Сообщение	Дата
pdr	479491279e	Dockerfile - Add support for arm64 build (#660 ) Add support for arm64 build: - Updated dockerfile for arm64 build - extend cpu stream compilation for neoverse - handle onnxruntime-gpu installation - third party builds filtering based on arch - disable cuda decode perf build for non x86	2024-11-06 23:16:12 +00:00
Yuting Jiang	949f9cb406	Release - SuperBench v0.11.0 (#654 ) Description Cherry pick bug fixes from v0.11.0 to main Major Revision * #645 * #648 * #646 * #647 * #651 * #652 * #650 --------- Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>	2024-10-10 09:59:47 +08:00
Yang Wang	9de841bc95	Use `types-setuptools` as `types-pkg_resources` is Yanked (#637 ) * https://pypi.org/project/types-pkg-resources/ * Use types-setuptools instead	2024-08-08 22:30:37 +08:00
Yang Wang	9a3ce39d5a	Update omegaconf version to 2.3.0 (#631 ) Update `omegaconf` version to [2.3.0](https://pypi.org/project/omegaconf/2.3.0/) as omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/12063.	2024-07-23 14:46:28 -07:00
Yuting Jiang	1f5031bd74	Dockerfile - Upgrade to rocm5.7 dockerfile (#587 ) Description upgrade to rocm5.7 dockerfile. --------- Co-authored-by: yukirora <yuting.jiang@microsoft.com>	2023-12-09 17:41:12 +00:00
Yuting Jiang	dd5a6329ed	Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582 ) Description Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark	2023-12-07 09:37:09 +08:00
guoshzhao	028819b388	Monitor - Add support for AMD GPU. (#580 ) Description Add AMD support in monitor. Major Revision - Add library pyrsmi to collect metrics. - Currently can get device_utilization, device_power, device_used_memory and device_total_memory.	2023-11-27 18:45:56 +08:00
Yifan Xiong	1ad1c21c38	Dockerfile - Upgrade Docker image to CUDA 12.2 (#577 ) Upgrade Docker image to CUDA 12.2 for H100: * upgrade base image to 23.10 * fix onnxruntime version in python3.10 * fix compilation errors	2023-11-22 13:48:18 +00:00
Yifan Xiong	7184bdd1ed	Benchmarks - Update result parsing in tensorrt inference (#541 ) * Update result parsing for newer tensorrt versions * Update arguments when load torchvision models	2023-06-30 11:22:46 +08:00
Yifan Xiong	f4dab9f7ba	Update error message in setup (#538 ) Update error message in setup, require wheel for pip>=23.1.	2023-06-14 10:51:45 +08:00
Yifan Xiong	a1cd3c9475	Runner - Add signal handler in runner (#530 ) Add signal handler in runner to gracefully exit when receiving SIGINT (<kbd>Ctrl</kbd>+<kbd>C</kbd>) or SIGTERM during benchmark execution.	2023-05-23 17:25:35 +08:00
Yifan Xiong	35f5390512	Pin setuptools version to v65.7.0 (#483 ) Pin setuptools version to [v65.7.0](https://setuptools.pypa.io/en/latest/history.html#v65-7-0) to avoid breaking changes since v66.0.0.	2023-03-06 11:43:44 +00:00
Yifan Xiong	2cc4cd03e2	Limit ansible_runner version for Python3.6 (#485 ) Limit ansible_runner version to less than 2.3.2 for Python3.6.	2023-03-06 18:54:45 +08:00
Yuting Jiang	ec7f502c93	CI/CD - Upgrade networkx version to fix installation compatibility issue (#478 ) Description Upgrade networkx version to fix installation compatibility issue.	2023-02-17 05:36:21 +00:00
Yuting Jiang	1deb2eaa29	downgrage transformers version to fix tersorrt (#441 ) Description Downgrage transformers version to fix tersorrt test failure.	2022-12-14 14:19:32 +08:00
Yang Wang	e4eeda0afd	Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430 ) * add mpi-parallels mode * update according to comments * fix and update doc * update * merge into 'mpi' mode * udpate according to comments * fix testcases * fix ansible * regard pattern as field * udpate * fix flake8 version * add flake8 range * remove map-by from host config * udpate comments	2022-11-29 12:30:10 +08:00
Yang Wang	57f7403c47	Update typing-extensions version to fix pipeline issue (#432 )	2022-11-17 19:39:52 +08:00
Yifan Xiong	d7bb8303fb	CLI - Update version to include revision hash and date (#427 ) Update version to include revision hash and date in "{last tag}+g{git hash}.d{date}" format, here're the examples: * exact tag: 0.6.0 * commit after tag: 0.6.0+gcbb1b34 * commit after tag with local changes: 0.6.0+gcbb1b34.d20221028	2022-10-31 10:44:41 +08:00
Yuting Jiang	3367c4f6cc	Benchmarks - Add support to allow list of custom config string in cudnn-functions and cublas-functions (#414 ) Description Add support to allow list of custom config string in cudnn-functions and cublas-functions.	2022-10-18 09:59:51 +08:00
Yifan Xiong	63e9b2d1bc	Release - SuperBench v0.6.0 (#409 ) Description Cherry-pick bug fixes from v0.6.0 to main. Major Revisions * Enable latency test in ib traffic validation distributed benchmark (#396) * Enhance parameter parsing to allow spaces in value (#397) * Update apt packages in dockerfile (#398) * Upgrade colorlog for NO_COLOR support (#404) * Analyzer - Update error handling to support exit code of sb result diagnosis (#403) * Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399) * Enhance timeout cleanup to avoid possible hanging (#405) * Auto generate ibstat file by pssh (#402) * Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406) * Docs - Upgrade version and release note (#407) * Docs - Fix issues in document (#408) Co-authored-by: Yang Wang <yangwang1@microsoft.com> Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>	2022-09-06 18:06:05 +08:00
Yifan Xiong	626ac0a463	Update Python setup for require packages (#387 ) __Description__ Update Python setup for require packages. __Major Revisions__ * downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6 * add extra entry in extras_require for nested packages * update `pip install` contents accordingly	2022-08-17 11:33:57 +08:00
Yang Wang	faeee0a7cc	Auto generate ibstat file for topo aware traffic pattern (#381 ) An enhancement for topo-aware IB performance validation #373. This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.	2022-08-13 18:20:42 +08:00
Jie Zhang	ef4d65745b	Support topo-aware IB performance validation (#373 ) * Support topo-aware IB performance validation Add a new pattern `topo-aware`, so the user can run IB performance test based on VM's topology information. This way, the user can validate the IB performance across VM pairs with different distance as a quick test instead of pair-wise test. To run with topo-aware pattern, user needs to specify three required (and two optional) parameters in YAML config file: --pattern topo-aware --ibstat path to ibstat output --ibnetdiscover path to ibnetdiscover output --min_dist minimum distance of VM pairs (optional, default 2) --max_dist maximum distance of VM pairs (optional, default 6) The newly added topo_aware module then parses the topology information, builds a graph, and generates the VM pairs with the specified distance (# hops). The specified IB test will then be running across these generated VM pairs. Signed-off-by: Jie Zhang <jessezhang1010@gmail.com> * Add description about topology aware ib traffic tests Signed-off-by: Jie Zhang <jessezhang1010@gmail.com> * Add unit test to verify generated topology aware config file This commit adds unit test to verify the generated topology aware config file is correct. To do so, four new data files are added in order to invoke gen_topo_aware_config function to generate topology aware config file, then compares it with the expected config file. Signed-off-by: Jie Zhang <jessezhang1010@gmail.com> * Fix lint issue on Azure pipeline Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>	2022-07-26 16:56:19 -07:00
Yifan Xiong	16b6385dee	Add dependencies (#374 ) Add dependencies * include ndv4-topo.xml in cuda docker images * require requests version to avoid RequestsDependencyWarning	2022-07-13 08:42:53 +00:00
Yifan Xiong	a94ead34b0	CLI - Support SKU auto detect if running on Azure VM (#365 ) Support SKU auto detect and using corresponding benchmark config if running on Azure VM.	2022-07-05 10:52:39 +08:00
Yifan Xiong	a4937e95c6	Support `sb run` on host directly without Docker (#358 ) Description Support `sb run` on host directly without Docker Major Revisions - Add `--no-docker` argument for `sb run`. - Run on host directly if `--no-docker` if specified. - Update docs and tests correspondingly.	2022-06-14 10:57:01 +08:00
Yifan Xiong	6681c72043	Release - SuperBench v0.5.0 (#350 ) Description Cherry-pick bug fixes from v0.5.0 to main. Major Revisions * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>	2022-04-29 16:22:55 +08:00
Yuting Jiang	b3c95f1827	Analyzer - Add md and html output format for DataDiagnosis (#325 ) Description Add md and html output format for DataDiagnosis. Major Revision - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output Minor Revision - move excel and json output interface into DataDiagnosis	2022-03-15 18:04:11 +08:00
guoshzhao	fd2bc9e048	Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283 ) Description Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.	2022-01-19 10:49:56 +08:00
Yifan Xiong	ff563b66af	Release - SuperBench v0.4.0 (#278 ) __Description__ Cherry-pick bug fixes from v0.4.0 to main. __Major Revisions__ * Bug - Fix issues for Ansible and benchmarks (#267) * Tests - Refine test cases for microbenchmark (#268) * Bug - Build openmpi with ucx support in rocm dockerfiles (#269) * Benchmarks: Fix Bug - Fix fio build issue (#272) * Docs - Unify metric and add doc for cublas and cudnn functions (#271) * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274) * Bug - Fix bug of detecting if gpu_index is none (#275) * Bug - Fix bugs in data diagnosis (#273) * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270) * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276) * Docs - Upgrade version and release note (#277) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>	2021-12-30 16:24:00 +08:00
guoshzhao	4d85630abb	Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245 ) Description Add ONNXRuntime inference benchmark based on ORT python API. Major Revision - Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference - Add tests and example for `ort-inference` benchmark - Update the introduction docs.	2021-12-10 13:53:11 +00:00
Yuting Jiang	c2f942cb6f	Analyzer: Add Feature - Add basic analysis features (#248 ) Description Add basic analysis features. Major Revision - Add statistics, correlations of the raw data - Add numeric outlier detection(inter_quartile_range) - Add boxplot for selected metric	2021-12-10 11:01:59 +00:00
Yuting Jiang	c13ed2a297	Analyzer: Initialization - Add baseline-based data diagnosis module (#242 ) Description Add data diagnosis module. Major Revision - Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes - Add RuleOp class to define rule operators	2021-12-08 18:22:00 +08:00
Yuting Jiang	49cc8f9a8c	Benchmarks: Add Benchmark - Add tcp connectivity validation microbenchmark (#217 ) Description Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile. Major Revision - Add tcp connectivity validation microbenchmark and related test, example	2021-10-12 23:42:12 +00:00
Yuting Jiang	37b15db92c	Tools: Add Feature - Add script to generate system config info. (#160 ) Description Add script to generate system config info. Major Revision - Add script to generate system config info into the dict in superbench/tools.	2021-09-06 17:48:36 +08:00
guoshzhao	c8357f4e7a	Setup: Revision - Revise torch extra_require (#177 ) Description change the minimal version requirement for superbench: ``` 'torch>=1.7.0a0', 'torchvision>=0.8.0a0', ```	2021-08-31 16:14:45 +08:00
guoshzhao	7595d79434	Runner: Add Feature - Generate summarized output files. (#157 ) Description Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op` Major Revision - Generate the summarized json file per node: For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]` For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}` `[]` means optional. ``` { "kernel-launch/overhead_event:0": 0.00583, "kernel-launch/overhead_event:1": 0.00545, "kernel-launch/overhead_event:2": 0.00581, "kernel-launch/overhead_event:3": 0.00572, "kernel-launch/overhead_event:4": 0.00559, "kernel-launch/overhead_event:5": 0.00591, "kernel-launch/overhead_event:6": 0.00562, "kernel-launch/overhead_event:7": 0.00586, "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134, "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773, "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677, "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973, "pytorch-sharding-matmul/0/allreduce": 10.561786651611328, "pytorch-sharding-matmul/1/allreduce": 10.561786651611328, "pytorch-sharding-matmul/0/allgather": 10.088025093078613, "pytorch-sharding-matmul/1/allgather": 10.088025093078613 } ``` - Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.	2021-08-20 16:48:40 +08:00
Yifan Xiong	c0c43b8f81	Bug bash - Fix bugs in multi GPU benchmarks (#98 ) * Add `sb deploy` command content. * Fix inline if-expression syntax in playbook. * Fix quote escape issue in bash command. * Add custom env in config. * Update default config for multi GPU benchmarks. * Update MANIFEST.in to include jinja2 template. * Require jinja2 minimum version. * Fix occasional duplicate output in Ansible runner. * Fix mixed color from Ansible and Python colorlog. * Update according to comments. * Change superbench.env from list to dict in config file.	2021-06-23 18:16:43 +08:00
Yifan Xiong	ddbc51a135	Bug bash - Fix bugs and refine log in single GPU benchmarks (#97 ) Fix bugs and refine log in single GPU benchmarks: * Fix none framework issue * Fix empty parameter bug * Remove missed mobilenet_v3 models * Change benchmark registration log to debug level * Add pid in logging * Add missing benchmarks in default config * Fix deprecated logging warn	2021-06-16 13:51:22 +08:00
Yifan Xiong	6b0ca1cb05	Runner - Support local mode in runner (#88 ) * Support local mode in runner.	2021-06-02 23:58:44 +08:00
guoshzhao	331c740a15	Benchmarks: Add Feature - Add nvml package to provide python interfaces of nvidia. (#91 )	2021-06-01 23:31:07 +08:00
Yifan Xiong	c05e173b3d	Runner - Implement ansible client and runner (#69 ) Implement ansible client and runner: * add ansible client * add deploy and check_env playbooks	2021-05-23 23:53:37 +08:00
Yifan Xiong	977b1a7355	CLI - Refine CLI handlers (#68 ) * use absolute path of input file * parse registry uri from image * merge common parts for arguments processing	2021-05-18 11:34:15 +08:00
Yifan Xiong	5711429403	CLI - Integration with Executor and Runner (#26 ) * CLI integration with Executor and Runner	2021-04-12 17:38:17 +08:00
Yifan Xiong	67053d9a1f	Add CUDA dockerfile for superbench (#43 ) * add cuda11.1.1 dockerfile	2021-04-12 14:17:10 +08:00
Yifan Xiong	0e2b2b0829	Update logger (#28 ) Update logger class. * add file handler along with stream handler * add colored formatter	2021-03-29 14:06:55 +08:00
Yifan Xiong	91b44bc5a1	CLI: Code Revision - Use omegaconf to replace hydra for configuration (#27 ) Use omegaconf to replace hydra for configuration system: * remove hydra * use omegaconf to merge configurations	2021-03-26 21:19:17 +08:00
Yifan Xiong	5d11579a10	CLI - Add command sb [version,deploy,exec,run] (#10 ) - Add CLI commands * sb version * sb deploy * sb exec * sb run - Add interface with executor and runner - Add cli test cases	2021-03-12 13:16:43 +08:00
guoshzhao	ebea2d5053	Benchmarks: Add Feature - Add random dataset for Pytorch. (#17 ) * add random dataset. * install pytorch-cpu for test docker. * fix typo * add more test cases. * address comments. Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>	2021-03-12 02:30:44 +08:00
Yifan Xiong	d32b96eb98	Setup: Add Test - Add Codecov (#9 ) Add code coverage configuration.	2021-02-04 10:43:43 +08:00

1 2

53 Коммитов