superbenchmark

Граф коммитов

Автор	SHA1	Сообщение	Дата
Yuting Jiang	75573f59da	Benchmarks: Micro benchmarks - Add correctness check in cublas-function benchmark (#452 ) Description Add correctness check in cublas-function benchmark. Major Revision - add python code of correctness check in cublas-function benchmark and test	2023-01-03 14:59:30 +08:00
Yifan Xiong	0591da5f49	Benchmarks - Add cuBLASLt FP16 and FP8 GEMM micro-benchmark (#451 ) Add micro-benchmark for cublaslt fp8 gemm.	2023-01-03 05:28:56 +00:00
Yuting Jiang	678b1251f1	Benchmarks: Micro benchmarks - add source code of correctness check for cublas functions (#450 ) Description Add c source code of correctness check for cublas functions. Major Revision - add correctness check for all supported cublas functions - add --correctness option into binary Minor Revision - fix bug and template fill_data and prepare_tensor to get right memory-alignment output matrix for different datatype	2023-01-03 04:20:10 +00:00
Yuting Jiang	9dfefce350	Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445 ) Description Add stdout logging util module and enable real-time logging flushing in executor Major Revision - Add stdout logging util module to redirect stdout into file log - enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log` - enable real-time log flushing in run_command of microbenchmarks through config `log_flushing` Minor Revision - add log_n_step args to enable regular step time log in model benchmarks - udpate related docs	2022-12-30 09:40:28 +00:00
Yang Wang	f2634d8608	Benchmarks - Support `pair-wise` pattern in IB validation benchmark (#453 ) Description * Reuse `gen_pair_wise_config` in micro-benchmark	2022-12-30 13:02:52 +08:00
Yifan Xiong	a3c65b2a57	Dockerfile - Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449 ) Add Docker image for arch90 NVIDIA GPUs: * add CUDA11.8 Dockerfile * update archs in Makefile and benchmarks accordingly * update image build pipeline	2022-12-29 12:19:38 +00:00
Yang Wang	7838b6b154	Runner - Support `pair-wise` pattern in `mpi` mode (#447 ) * Extract pair-wise pattern from ib_validation	2022-12-29 08:23:36 +00:00
dependabot[bot]	6186146d59	Bump qs and express in /website (#440 ) Bumps [qs](https://github.com/ljharb/qs) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together. Updates `qs` from 6.7.0 to 6.11.0 - [Release notes](https://github.com/ljharb/qs/releases) - [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md) - [Commits](https://github.com/ljharb/qs/compare/v6.7.0...v6.11.0) Updates `express` from 4.17.1 to 4.18.2 - [Release notes](https://github.com/expressjs/express/releases) - [Changelog](https://github.com/expressjs/express/blob/master/History.md) - [Commits](https://github.com/expressjs/express/compare/4.17.1...4.18.2) --- updated-dependencies: - dependency-name: qs dependency-type: indirect - dependency-name: express dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-28 13:47:06 +08:00
dependabot[bot]	de6deb0e2d	Bump decode-uri-component from 0.2.0 to 0.2.2 in /website (#439 ) Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2. - [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases) - [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2) --- updated-dependencies: - dependency-name: decode-uri-component dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-27 13:40:49 +08:00
Yuting Jiang	6583ba2e40	Benchmark: Revision - Add wait time option to resolve mem-bw unstable issue (#438 ) Description Add wait time option to resolve mem-bw unstable issue.	2022-12-14 17:21:02 +08:00
Yuting Jiang	1deb2eaa29	downgrage transformers version to fix tersorrt (#441 ) Description Downgrage transformers version to fix tersorrt test failure.	2022-12-14 14:19:32 +08:00
Yang Wang	e4eeda0afd	Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430 ) * add mpi-parallels mode * update according to comments * fix and update doc * update * merge into 'mpi' mode * udpate according to comments * fix testcases * fix ansible * regard pattern as field * udpate * fix flake8 version * add flake8 range * remove map-by from host config * udpate comments	2022-11-29 12:30:10 +08:00
dependabot[bot]	3c97381fd2	Bump loader-utils from 1.4.0 to 1.4.2 in /website (#431 ) Bumps [loader-utils](https://github.com/webpack/loader-utils) from 1.4.0 to 1.4.2. - [Release notes](https://github.com/webpack/loader-utils/releases) - [Changelog](https://github.com/webpack/loader-utils/blob/v1.4.2/CHANGELOG.md) - [Commits](https://github.com/webpack/loader-utils/compare/v1.4.0...v1.4.2) --- updated-dependencies: - dependency-name: loader-utils dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-11-18 15:13:04 +08:00
Yang Wang	57f7403c47	Update typing-extensions version to fix pipeline issue (#432 )	2022-11-17 19:39:52 +08:00
Yifan Xiong	1b86503d1e	CLI - Add non-zero return code for `sb [deploy,run]` (#425 ) Add non-zero return code for `sb deploy` and `sb run` command when there're Ansible failures in control plane. Return code is set to count of failure. For failures caused by benchmarks, return code is still set per benchmark in results json file.	2022-11-01 10:46:19 +08:00
Yifan Xiong	d7bb8303fb	CLI - Update version to include revision hash and date (#427 ) Update version to include revision hash and date in "{last tag}+g{git hash}.d{date}" format, here're the examples: * exact tag: 0.6.0 * commit after tag: 0.6.0+gcbb1b34 * commit after tag with local changes: 0.6.0+gcbb1b34.d20221028	2022-10-31 10:44:41 +08:00
Yifan Xiong	7a27732e97	CI/CD - Fix codecov status issue (#426 ) Fix codecov status issue due to service upgrade.	2022-10-27 17:41:06 +08:00
dependabot[bot]	12c48627d7	Bump ansi-html and webpack-dev-server in /website (#419 ) Removes [ansi-html](https://github.com/Tjatse/ansi-html). It's no longer used after updating ancestor dependency [webpack-dev-server](https://github.com/webpack/webpack-dev-server). These dependencies need to be updated together. Removes `ansi-html` Updates `webpack-dev-server` from 3.11.2 to 3.11.3 - [Release notes](https://github.com/webpack/webpack-dev-server/releases) - [Changelog](https://github.com/webpack/webpack-dev-server/blob/v3.11.3/CHANGELOG.md) - [Commits](https://github.com/webpack/webpack-dev-server/compare/v3.11.2...v3.11.3) --- updated-dependencies: - dependency-name: ansi-html dependency-type: indirect - dependency-name: webpack-dev-server dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-10-20 20:39:20 +08:00
Yifan Xiong	13c6919553	CI/CD - Update Codecov config (#418 ) Update codecov flags in configuration.	2022-10-18 12:30:59 +08:00
Yuting Jiang	3367c4f6cc	Benchmarks - Add support to allow list of custom config string in cudnn-functions and cublas-functions (#414 ) Description Add support to allow list of custom config string in cudnn-functions and cublas-functions.	2022-10-18 09:59:51 +08:00
Yifan Xiong	63e9b2d1bc	Release - SuperBench v0.6.0 (#409 ) Description Cherry-pick bug fixes from v0.6.0 to main. Major Revisions * Enable latency test in ib traffic validation distributed benchmark (#396) * Enhance parameter parsing to allow spaces in value (#397) * Update apt packages in dockerfile (#398) * Upgrade colorlog for NO_COLOR support (#404) * Analyzer - Update error handling to support exit code of sb result diagnosis (#403) * Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399) * Enhance timeout cleanup to avoid possible hanging (#405) * Auto generate ibstat file by pssh (#402) * Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406) * Docs - Upgrade version and release note (#407) * Docs - Fix issues in document (#408) Co-authored-by: Yang Wang <yangwang1@microsoft.com> Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>	2022-09-06 18:06:05 +08:00
Yuting Jiang	733860d715	Analyzer - Add support to store values of metrics in data diagnosis (#392 ) Description Add support to store values of metrics in data diagnosis. Take the following rules as example: ``` nccl_store_rule: categories: NCCL_DIS store: True metrics: - nccl-bw:allreduce-run0/allreduce_1073741824_busbw - nccl-bw:allreduce-run1/allreduce_1073741824_busbw - nccl-bw:allreduce-run2/allreduce_1073741824_busbw - nccl-bw:allreduce-run3/allreduce_1073741824_busbw - nccl-bw:allreduce-run4/allreduce_1073741824_busbw nccl_rule: function: multi_rules criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False' categories: NCCL_DIS ``` nccl_store_rule will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then rccl_rule can use the values of metrics through `label["nccl_store_rule"].values()` in criteria	2022-08-23 03:25:32 +00:00
Yuting Jiang	10a79c4ea8	Analyzer - Add support for both jsonl and json format in data diagnosis (#388 ) Description Add support for both jsonl and json format in data diagnosis. Major Revision - Add support for both jsonl and json format in data diagnosis Minor Revision - change related doc - add jsonl support in cli	2022-08-22 10:57:00 +08:00
Yifan Xiong	626ac0a463	Update Python setup for require packages (#387 ) __Description__ Update Python setup for require packages. __Major Revisions__ * downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6 * add extra entry in extras_require for nested packages * update `pip install` contents accordingly	2022-08-17 11:33:57 +08:00
Yuting Jiang	e335556d7a	Benchmarks: Build Pipeline - Degrade perftest submodule to fix stability issue (#386 ) Description Degrade perftest submodule to v4.4-0.37 to fix stability issue. Issue: rdma-loopback is not stable on public version(v0.5/v0.6-rc1) Docker Version: v0.6-rc1-cuda11.1 Testbed: 8 A100 40GB GPUs (1 NDv4 node) Result: New perftest version introduce the variance, max-min/mean = 2% for v4.4-0.37, 8% for v4.5-0.2	2022-08-16 11:49:10 +08:00
Yang Wang	faeee0a7cc	Auto generate ibstat file for topo aware traffic pattern (#381 ) An enhancement for topo-aware IB performance validation #373. This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.	2022-08-13 18:20:42 +08:00
Yuting Jiang	b5c7c85d17	Analyzer: Rename fields in json of data diagnosis to be more readable (#382 ) Description Rename field in data diagnosis to be more readable. Major Revision - rename fields according to diagnosis/metric format Minor Revision - change type of diagnosis/issue_num to be int	2022-08-09 10:03:50 +08:00
Yifan Xiong	9c29c93114	Runner - Fix minimum timeout (#385 ) Fix minimum timeout: use 60s if config is shorter.	2022-08-08 11:22:24 +08:00
Yifan Xiong	9b8df883ae	Gracefully exit when timeout (#383 ) * Gracefully exit when timeout, add corresponding log and return code. * Set minimum timeout to 1 minute and enlarge Ansible timeout.	2022-08-04 13:05:34 +08:00
Yuting Jiang	ec16d42564	Analyzer - Add failure check feature in data diagnosis (#378 ) Description Add failure check feature in data diagnosis. Major Revision - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest - Split performance issue and failedtest in categories Minor Revision - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas	2022-08-01 12:35:35 +08:00
Jie Zhang	ef4d65745b	Support topo-aware IB performance validation (#373 ) * Support topo-aware IB performance validation Add a new pattern `topo-aware`, so the user can run IB performance test based on VM's topology information. This way, the user can validate the IB performance across VM pairs with different distance as a quick test instead of pair-wise test. To run with topo-aware pattern, user needs to specify three required (and two optional) parameters in YAML config file: --pattern topo-aware --ibstat path to ibstat output --ibnetdiscover path to ibnetdiscover output --min_dist minimum distance of VM pairs (optional, default 2) --max_dist maximum distance of VM pairs (optional, default 6) The newly added topo_aware module then parses the topology information, builds a graph, and generates the VM pairs with the specified distance (# hops). The specified IB test will then be running across these generated VM pairs. Signed-off-by: Jie Zhang <jessezhang1010@gmail.com> * Add description about topology aware ib traffic tests Signed-off-by: Jie Zhang <jessezhang1010@gmail.com> * Add unit test to verify generated topology aware config file This commit adds unit test to verify the generated topology aware config file is correct. To do so, four new data files are added in order to invoke gen_topo_aware_config function to generate topology aware config file, then compares it with the expected config file. Signed-off-by: Jie Zhang <jessezhang1010@gmail.com> * Fix lint issue on Azure pipeline Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>	2022-07-26 16:56:19 -07:00
Yang Wang	5d448eedbf	Fix unexpected base conversion when the result value is negative (#377 ) Fix an unexpected result value (`-0.125`) issue in ib traffic benchmark when encountering `-1` in raw output * Check if the value is valid before the base conversion * Add a test case to cover this situation	2022-07-25 15:27:46 +08:00
dependabot[bot]	02941e6e09	Bump terser from 4.8.0 to 4.8.1 in /website (#376 ) Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1. - [Release notes](https://github.com/terser/terser/releases) - [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md) - [Commits](https://github.com/terser/terser/commits) --- updated-dependencies: - dependency-name: terser dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-07-22 11:52:43 +08:00
Yifan Xiong	352ae0c95f	Fix port conflict in ib loopback (#375 ) Fix potential port conflict due to race condition between time-to-check to time-to-use, by binding the port all through. Modify the function to resolve flake8 C901 while keeping the logic same.	2022-07-20 11:30:00 +08:00
Yifan Xiong	16b6385dee	Add dependencies (#374 ) Add dependencies * include ndv4-topo.xml in cuda docker images * require requests version to avoid RequestsDependencyWarning	2022-07-13 08:42:53 +00:00
Yifan Xiong	b2875179bf	Fix issues in ib validation benchmark (#370 ) Fix several issues in ib validation benchmark: * continue running when timeout in the middle, instead of aborting whole mpi process * make timeout parameter configurable, set default to 120 seconds * avoid mixture of stdio and iostream when print to stdout * set default message size to 8M which will saturate ib in most cases * fix hostfile path issue so that it can be auto found in different cases	2022-07-09 19:57:11 +08:00
Yifan Xiong	e00a8180f6	Support node_num=1 in mpi mode (#372 ) Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in both 1 node and all nodes in one config by changing `node_num`. Update docs and add test case accordingly.	2022-07-08 09:24:17 +08:00
Yifan Xiong	9f03d5687a	Update dependencies and Dockerfile (#371 ) Update dependencies and Dockerfile: * upgrade nccl-tests and rccl-tests to current latest version to match NCCL/RCCL versions * unify image tag names on DockerHub * remove verbose output in Dockerfile and minor fix some flags	2022-07-06 10:31:41 +00:00
Yifan Xiong	a94ead34b0	CLI - Support SKU auto detect if running on Azure VM (#365 ) Support SKU auto detect and using corresponding benchmark config if running on Azure VM.	2022-07-05 10:52:39 +08:00
Yifan Xiong	620192a242	Fix issues in ib loopback benchmark (#369 ) Fix several issues in ib loopback benchmark: * use `--report_gbits` and divide by 8 to get GB/s, previous results are MiB/s / 1000 * use the ib_write_bw binary built in third_party instead of system path * update the metrics name so that different hca indices have same metric	2022-06-29 17:53:02 +00:00
Yifan Xiong	8ef7163a18	Deployment - Refine error message when GPU is not detected (#368 ) Refine error message when GPU is not detected. Possible solutions if hardware exists and drivers are already installed: * nvidia gpus: ```sh /sbin/modprobe nvidia-uvm D=`grep nvidia-uvm /proc/devices \| awk '{print $1}'` mknod -m 666 /dev/nvidia-uvm c $D 0 ``` * amd gpus ```sh modprobe amdgpu ```	2022-06-30 01:12:25 +08:00
Yifan Xiong	325a7338bf	Fix incorrect ulimit config in Dockerfile (#364 ) Fix incorrect ulimit nofile config in Dockerfile. Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.	2022-06-24 14:14:00 +00:00
Yifan Xiong	bfaa1c837b	Support multiple IB/GPU in ib validation (#363 ) Description Support multiple IB/GPU devices run simultaneously in ib validation benchmark. Major Revisions - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel. - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes. - Fix env issues in Dockerfile for end-to-end test. - Update ib-traffic configuration examples in config files. - Update unit tests and docs accordingly. Closes #326.	2022-06-24 08:35:20 +00:00
Yifan Xiong	0f7b057a2d	Runner - Fix sudo issue when running without Docker (#362 ) Fix sudo issue when running without Docker, user account could be arbitrary in such case.	2022-06-19 11:56:36 +00:00
Yifan Xiong	483bf782e1	Update ROCm Dockerfile (#361 ) Description Update ROCm Dockerfile. Major Revisions - Add dockerfile for ROCm 5.1.3 - Merge 5.1.x and 5.0.x dockerfile - Remove 4.2 and 4.0 legacy - Update build pipeline accordingly	2022-06-19 17:26:39 +08:00
Yifan Xiong	60a3c74306	Fix cmake and build issues (#360 ) Description Fix cmake and build issues. Major Revision * Remove unnecessary boost build * Remove user-agent for mlc * Remove -j for third party to build each project in sequence * Fix ansible collections installation path	2022-06-15 13:07:57 +08:00
Yifan Xiong	a4937e95c6	Support `sb run` on host directly without Docker (#358 ) Description Support `sb run` on host directly without Docker Major Revisions - Add `--no-docker` argument for `sb run`. - Run on host directly if `--no-docker` if specified. - Update docs and tests correspondingly.	2022-06-14 10:57:01 +08:00
dependabot[bot]	528d69bd13	Bump eventsource from 1.1.0 to 1.1.1 in /website (#357 ) Bumps [eventsource](https://github.com/EventSource/eventsource) from 1.1.0 to 1.1.1. - [Release notes](https://github.com/EventSource/eventsource/releases) - [Changelog](https://github.com/EventSource/eventsource/blob/master/HISTORY.md) - [Commits](https://github.com/EventSource/eventsource/compare/v1.1.0...v1.1.1) --- updated-dependencies: - dependency-name: eventsource dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-06 11:39:35 +08:00
dependabot[bot]	77f8048ad8	Bump cross-fetch from 3.1.4 to 3.1.5 in /website (#349 ) Bumps [cross-fetch](https://github.com/lquixada/cross-fetch) from 3.1.4 to 3.1.5. - [Release notes](https://github.com/lquixada/cross-fetch/releases) - [Commits](https://github.com/lquixada/cross-fetch/compare/v3.1.4...v3.1.5) --- updated-dependencies: - dependency-name: cross-fetch dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-02 19:06:54 +08:00
dependabot[bot]	cdd19e6f30	Bump async from 2.6.3 to 2.6.4 in /website (#351 ) Bumps [async](https://github.com/caolan/async) from 2.6.3 to 2.6.4. - [Release notes](https://github.com/caolan/async/releases) - [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md) - [Commits](https://github.com/caolan/async/compare/v2.6.3...v2.6.4) --- updated-dependencies: - dependency-name: async dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-02 13:26:35 +08:00

1 2 3 4 5 ...

437 Коммитов Все ветки Поиск

437 Коммитов

Все ветки