superbenchmark

Граф коммитов

Автор	SHA1	Сообщение	Дата
Yuting Jiang	54da021b4d	Analyzer - Fix bugs in data diagnosis (#355 ) Description Fix bugs in data diagnosis. Major Revision - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0' - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True - fix bug of using wrong column index when applying format(red color and percentile) in the excel	2022-06-01 17:12:38 +08:00
Yuting Jiang	3f135e4669	Dockerfile - Add support to run sb command inside docker image (#356 ) Description Add support to run sb command inside docker image - install missing dependency.	2022-06-01 01:11:28 +08:00
Yuting Jiang	e08b6d3a1c	Dockerfile: Update rccl version and fix issue in rocm5.1.1 dockerfile (#354 ) Description Update rccl version and fix issue in rocm5.1.1 dockerfile.	2022-05-27 10:46:40 +08:00
Yuting Jiang	81a4146bc1	Dockerfile - Add dockerfile for rocm5.1.1 (#353 ) Description Add dockerfile for rocm5.1.1.	2022-05-25 20:28:11 +08:00
Yifan Xiong	6681c72043	Release - SuperBench v0.5.0 (#350 ) Description Cherry-pick bug fixes from v0.5.0 to main. Major Revisions * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>	2022-04-29 16:22:55 +08:00
Yuting Jiang	712eafc373	Docs - Update links using relative file paths with extensions (#346 ) Description Update links of referencing other docs using relative file paths with extensions.	2022-04-21 07:28:19 +08:00
Jared Bowden	cb26691173	Docs - Update link to cli.md (#341 ) Description Fixes relative link in documentation: point to `../cli.md`.	2022-04-15 22:11:14 +08:00
guoshzhao	80dcc8aaec	Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338 ) Description Integrate FAMBench into superbench based on docker implementation: https://github.com/facebookresearch/FAMBench The script to run all benchmarks is: https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh	2022-04-11 15:31:07 +08:00
Yuting Jiang	8dc19ca4af	CLI - Integrate output all nodes diagnosis results (#339 ) Description Integrate output all nodes diagnosis results.	2022-04-11 13:42:04 +08:00
Yuting Jiang	55b0f9d239	Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336 ) Description Output results of all nodes in data diagnosis.	2022-04-10 18:57:15 +08:00
Yuting Jiang	56c9a711a8	Docs - Add usage for result summary (#337 ) Description Add usage for result summary.	2022-04-08 20:44:25 +00:00
Yuting Jiang	f15da60b2b	CLI - Integrage result summary and update output format of data diagnosis (#335 ) Description Integrage result summary and update output format of data diagnosis. Major Revision - integrage result summary - add md and html format for data diagnosis	2022-04-08 18:48:43 +08:00
guoshzhao	6d895da83c	Benchmarks: Add Feature - Provide option to save raw data into file. (#333 ) Description Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.	2022-04-01 16:26:09 +08:00
dependabot[bot]	d368d90e21	Bump minimist from 1.2.5 to 1.2.6 in /website (#334 ) Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6. - [Release notes](https://github.com/substack/minimist/releases) - [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6) --- updated-dependencies: - dependency-name: minimist dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-03-31 12:41:48 +08:00
Yuting Jiang	84fed1ce18	Analyzer: Add feature - Add result summary in excel,md,html format (#320 ) Description Add result summary in excel,md,html format. Major Revision - Add ResultSummary class to support result summary in excel,md,html format. - Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.	2022-03-24 15:32:01 +08:00
Yuting Jiang	c5aa4f4e38	Bug: Benchmarks - remove fp16 samples type converting time (#332 ) Description Remove fp16 samples type converting time for training cnn and lstm inference.	2022-03-22 12:51:52 +08:00
Yifan Xiong	a9634ef5a8	Config - Add inference config for NC A100 and NV A10 series (#329 ) Add inference config for preview SKUs, including: * [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series) * [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)	2022-03-21 14:24:37 +08:00
Yuting Jiang	6e74918044	Bug: Benchmarks - remove fp16 samples type converting time for cnn and lstm models (#330 ) Description Remove fp16 samples type converting time for cnn and lstm models.	2022-03-17 14:02:40 +08:00
rafsalas19	ff51a3cee9	Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324 ) Description Modifications adding GPU-Burn to SuperBench. - added third party submodule - modified Makefile to make gpu-burn binary - added/modified microbenchmarks to add gpu-burn python scripts - modified default and azure_ndv4 configs to add gpu-burn	2022-03-16 16:20:11 +08:00
Yuting Jiang	84359fd806	Bug: Executor - fix bug in result writing to files for mpi mode (#328 ) Description fix the bug in result writing to files for mpi mode.	2022-03-15 16:35:03 +00:00
Yuting Jiang	b3c95f1827	Analyzer - Add md and html output format for DataDiagnosis (#325 ) Description Add md and html output format for DataDiagnosis. Major Revision - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output Minor Revision - move excel and json output interface into DataDiagnosis	2022-03-15 18:04:11 +08:00
Yifan Xiong	f755c0b659	Bug - Fix env path to absolute path (#327 ) Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.	2022-03-09 17:16:43 +08:00
Yuting Jiang	1ec055e1c2	Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321 ) Description Abstract RuleBase from DataDiagnosis.	2022-03-07 17:25:07 +08:00
dependabot[bot]	9759527111	Bump url-parse from 1.5.8 to 1.5.10 in /website (#323 ) Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-03-07 03:24:22 +00:00
Jeff Daily	a9ef0f99ab	Benchmarks - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322 ) Description The BatchNorm operator is not numerically stable in fp16. PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models. Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32. Preserving BN in fp32 for superbench more accurately reflects real workloads.	2022-03-06 13:22:43 +00:00
Yuting Jiang	425b9ff865	Dockerfile - Add dockerfile for rocm5.0.1 (#319 ) Description Add dockerfile for rocm5.0.1.	2022-02-28 19:30:43 +08:00
dependabot[bot]	74a3b1231a	Bump prismjs from 1.23.0 to 1.27.0 in /website (#318 ) Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0. - [Release notes](https://github.com/PrismJS/prism/releases) - [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md) - [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0) --- updated-dependencies: - dependency-name: prismjs dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-28 14:55:12 +08:00
Yuting Jiang	a4950a707e	Dockerfile - Add rocm5.0 dockerfile (#307 ) Description Add rocm5.0 dockerfile.	2022-02-26 07:12:45 +08:00
Ziyue Yang	01304706ed	Bug Fix - Fix P2P detection in gpu_copy (#317 ) Description Fix invalid reference of P2P detection result in gpu_copy.	2022-02-25 05:48:38 +08:00
Yuting Jiang	4f5027dbda	Benchmarks: Build Pipeline - Make gpcnet only for cuda (#316 ) Description Make gpcnet only for cuda.	2022-02-24 18:18:49 +08:00
Yuting Jiang	e0c491425d	Bug - Fix empty HIP_ARCHITECTURES issue in cmake>=3.21.0 (#315 ) Description Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0. Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364	2022-02-22 12:38:58 +00:00
dependabot[bot]	0740780bcc	Bump url-parse from 1.5.1 to 1.5.8 in /website (#313 ) Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-21 13:03:27 +08:00
Yifan Xiong	ea2c10abc4	Config - Add T4 configurations for inference (#311 ) Add T4 configurations for inference.	2022-02-20 13:00:55 +00:00
Yuting Jiang	97ed12f97f	Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289 ) Description Add multi-rules feature for data diagnosis to support multiple rules' combined check. Major Revision - revise rule design to support multiple rules combination check - update related codes and tests	2022-02-20 16:59:38 +08:00
Yifan Xiong	1f48268bf5	Bug - Fix env file path (#310 ) Fix env file path for `docker run`.	2022-02-15 15:23:43 +08:00
dependabot[bot]	53fe0c4798	Bump follow-redirects from 1.14.7 to 1.14.8 in /website (#309 ) Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8. - [Release notes](https://github.com/follow-redirects/follow-redirects/releases) - [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8) --- updated-dependencies: - dependency-name: follow-redirects dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-15 13:04:37 +08:00
Yuting Jiang	e31b8c9e08	Benchmarks: Revise Code - Add support for pytorch>=1.9.0 of init_process_group (#305 ) Description Add support for pytorch>=1.9.0 of init_process_group. Major Revision - Use PrefixStore(TCPStore) to init_process_group manully for each model run	2022-02-10 22:44:01 +08:00
Yuting Jiang	4abda6f5d4	Benchmarks: Build Pipeline - Update rccl-tests submodule to fix divide by zero error (#306 ) Description Update rccl-tests submodule to fix divide by zero error.	2022-02-09 14:46:29 +00:00
Ziyue Yang	6cdf759543	Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302 ) Description This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.	2022-02-09 20:30:42 +08:00
Ziyue Yang	433785fd0c	Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299 ) This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.	2022-02-08 17:59:48 +08:00
Ziyue Yang	682b2c120d	Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301 ) This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.	2022-02-08 10:59:27 +08:00
Ziyue Yang	853890559a	Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298 ) Description This commit does the following to optimize result variance in gpu_copy benchmark: 1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead; 2) Use CUDA events for timing instead of CPU timestamps; 3) Make data checking an option that is not preferred to be enabled in performance test; 4) Enlarge message size in performance benchmark.	2022-02-07 13:16:13 +08:00
Yuting Jiang	28195be6db	Bug - Fix typo in document (#297 ) Fix typo in document.	2022-01-30 13:38:00 +08:00
Yifan Xiong	3419447c11	Benchmarks - Support T4 and A10 in GEMM benchmark (#294 ) Support T4 and A10 in GEMM benchmark.	2022-01-29 13:26:00 +00:00
Yifan Xiong	3524975cfc	Config - Support customized env for all modes (#295 ) Support customized env for all modes in configuration.	2022-01-29 08:19:48 +00:00
Ziyue Yang	f3d05006d4	Benchmarks: Fix Bug - Fix GPU scan logic in gpu_copy (#296 ) Fix bug of GPU scan logic in bidirectional tests.	2022-01-29 14:04:03 +08:00
guoshzhao	d03d110f55	Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287 ) Description Please write a brief description and link the related issue if have. Major Revision - Sync (do allreduce max) the E2E training results among all workers. - Avoid using ':0' in metric name if there has only one rank having output.	2022-01-28 20:35:53 +08:00
guoshzhao	d877ca2322	Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288 ) Description Add timeout feature for each benchmark. Major Revision - Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future. - Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254: [ansible.py:80][WARNING] Run failed, return code 254. - Using `timeout` command to terminate the client process.	2022-01-28 08:16:32 +00:00
Yuting Jiang	f283b53638	Config - Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml (#292 ) Description Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml	2022-01-28 06:15:19 +08:00
Yifan Xiong	7d7cd3dc63	Config - Update benchmark naming to support annotations (#284 ) __Description__ Update benchmark naming to support annotations. __Major Revisions__ - Update name for `create_benchmark_context` in executor. - Backward compatibility for model benchmarks using "_models" suffix. - Update documents.	2022-01-25 09:54:58 +00:00

... 2 3 4 5 6 ...

437 Коммитов Все ветки Поиск

437 Коммитов

Все ветки