superbenchmark

Граф коммитов

Автор	SHA1	Сообщение	Дата
Yifan Xiong	352ae0c95f	Fix port conflict in ib loopback (#375 ) Fix potential port conflict due to race condition between time-to-check to time-to-use, by binding the port all through. Modify the function to resolve flake8 C901 while keeping the logic same.	2022-07-20 11:30:00 +08:00
Yifan Xiong	b2875179bf	Fix issues in ib validation benchmark (#370 ) Fix several issues in ib validation benchmark: * continue running when timeout in the middle, instead of aborting whole mpi process * make timeout parameter configurable, set default to 120 seconds * avoid mixture of stdio and iostream when print to stdout * set default message size to 8M which will saturate ib in most cases * fix hostfile path issue so that it can be auto found in different cases	2022-07-09 19:57:11 +08:00
Yifan Xiong	e00a8180f6	Support node_num=1 in mpi mode (#372 ) Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in both 1 node and all nodes in one config by changing `node_num`. Update docs and add test case accordingly.	2022-07-08 09:24:17 +08:00
Yifan Xiong	a94ead34b0	CLI - Support SKU auto detect if running on Azure VM (#365 ) Support SKU auto detect and using corresponding benchmark config if running on Azure VM.	2022-07-05 10:52:39 +08:00
Yifan Xiong	620192a242	Fix issues in ib loopback benchmark (#369 ) Fix several issues in ib loopback benchmark: * use `--report_gbits` and divide by 8 to get GB/s, previous results are MiB/s / 1000 * use the ib_write_bw binary built in third_party instead of system path * update the metrics name so that different hca indices have same metric	2022-06-29 17:53:02 +00:00
Yifan Xiong	bfaa1c837b	Support multiple IB/GPU in ib validation (#363 ) Description Support multiple IB/GPU devices run simultaneously in ib validation benchmark. Major Revisions - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel. - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes. - Fix env issues in Dockerfile for end-to-end test. - Update ib-traffic configuration examples in config files. - Update unit tests and docs accordingly. Closes #326.	2022-06-24 08:35:20 +00:00
Yifan Xiong	a4937e95c6	Support `sb run` on host directly without Docker (#358 ) Description Support `sb run` on host directly without Docker Major Revisions - Add `--no-docker` argument for `sb run`. - Run on host directly if `--no-docker` if specified. - Update docs and tests correspondingly.	2022-06-14 10:57:01 +08:00
Yuting Jiang	54da021b4d	Analyzer - Fix bugs in data diagnosis (#355 ) Description Fix bugs in data diagnosis. Major Revision - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0' - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True - fix bug of using wrong column index when applying format(red color and percentile) in the excel	2022-06-01 17:12:38 +08:00
Yifan Xiong	6681c72043	Release - SuperBench v0.5.0 (#350 ) Description Cherry-pick bug fixes from v0.5.0 to main. Major Revisions * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>	2022-04-29 16:22:55 +08:00
guoshzhao	80dcc8aaec	Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338 ) Description Integrate FAMBench into superbench based on docker implementation: https://github.com/facebookresearch/FAMBench The script to run all benchmarks is: https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh	2022-04-11 15:31:07 +08:00
Yuting Jiang	8dc19ca4af	CLI - Integrate output all nodes diagnosis results (#339 ) Description Integrate output all nodes diagnosis results.	2022-04-11 13:42:04 +08:00
Yuting Jiang	55b0f9d239	Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336 ) Description Output results of all nodes in data diagnosis.	2022-04-10 18:57:15 +08:00
Yuting Jiang	f15da60b2b	CLI - Integrage result summary and update output format of data diagnosis (#335 ) Description Integrage result summary and update output format of data diagnosis. Major Revision - integrage result summary - add md and html format for data diagnosis	2022-04-08 18:48:43 +08:00
guoshzhao	6d895da83c	Benchmarks: Add Feature - Provide option to save raw data into file. (#333 ) Description Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.	2022-04-01 16:26:09 +08:00
Yuting Jiang	84fed1ce18	Analyzer: Add feature - Add result summary in excel,md,html format (#320 ) Description Add result summary in excel,md,html format. Major Revision - Add ResultSummary class to support result summary in excel,md,html format. - Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.	2022-03-24 15:32:01 +08:00
rafsalas19	ff51a3cee9	Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324 ) Description Modifications adding GPU-Burn to SuperBench. - added third party submodule - modified Makefile to make gpu-burn binary - added/modified microbenchmarks to add gpu-burn python scripts - modified default and azure_ndv4 configs to add gpu-burn	2022-03-16 16:20:11 +08:00
Yuting Jiang	b3c95f1827	Analyzer - Add md and html output format for DataDiagnosis (#325 ) Description Add md and html output format for DataDiagnosis. Major Revision - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output Minor Revision - move excel and json output interface into DataDiagnosis	2022-03-15 18:04:11 +08:00
Yuting Jiang	1ec055e1c2	Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321 ) Description Abstract RuleBase from DataDiagnosis.	2022-03-07 17:25:07 +08:00
Yuting Jiang	97ed12f97f	Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289 ) Description Add multi-rules feature for data diagnosis to support multiple rules' combined check. Major Revision - revise rule design to support multiple rules combination check - update related codes and tests	2022-02-20 16:59:38 +08:00
Ziyue Yang	6cdf759543	Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302 ) Description This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.	2022-02-09 20:30:42 +08:00
Ziyue Yang	682b2c120d	Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301 ) This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.	2022-02-08 10:59:27 +08:00
Ziyue Yang	853890559a	Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298 ) Description This commit does the following to optimize result variance in gpu_copy benchmark: 1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead; 2) Use CUDA events for timing instead of CPU timestamps; 3) Make data checking an option that is not preferred to be enabled in performance test; 4) Enlarge message size in performance benchmark.	2022-02-07 13:16:13 +08:00
Yifan Xiong	3419447c11	Benchmarks - Support T4 and A10 in GEMM benchmark (#294 ) Support T4 and A10 in GEMM benchmark.	2022-01-29 13:26:00 +00:00
Yifan Xiong	3524975cfc	Config - Support customized env for all modes (#295 ) Support customized env for all modes in configuration.	2022-01-29 08:19:48 +00:00
guoshzhao	d03d110f55	Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287 ) Description Please write a brief description and link the related issue if have. Major Revision - Sync (do allreduce max) the E2E training results among all workers. - Avoid using ':0' in metric name if there has only one rank having output.	2022-01-28 20:35:53 +08:00
guoshzhao	d877ca2322	Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288 ) Description Add timeout feature for each benchmark. Major Revision - Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future. - Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254: [ansible.py:80][WARNING] Run failed, return code 254. - Using `timeout` command to terminate the client process.	2022-01-28 08:16:32 +00:00
Yifan Xiong	7d7cd3dc63	Config - Update benchmark naming to support annotations (#284 ) __Description__ Update benchmark naming to support annotations. __Major Revisions__ - Update name for `create_benchmark_context` in executor. - Backward compatibility for model benchmarks using "_models" suffix. - Update documents.	2022-01-25 09:54:58 +00:00
Ziyue Yang	74421ffee0	Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285 ) Description This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.	2022-01-21 13:45:37 +08:00
guoshzhao	fd2bc9e048	Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283 ) Description Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.	2022-01-19 10:49:56 +08:00
Yifan Xiong	f7ffc54522	CLI - Add command sb benchmark [list,list-parameters] (#279 ) __Description__ Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks. <details> <summary>Examples</summary> <pre> $ sb benchmark list -n [a-z]+-bw -o table Result -------- mem-bw nccl-bw rccl-bw </pre> <pre> $ sb benchmark list-parameters -n mem-bw === mem-bw === optional arguments: --bin_dir str Specify the directory of the benchmark binary. --duration int The elapsed time of benchmark in seconds. --mem_type str [str ...] Memory types to benchmark. E.g. htod dtoh dtod. --memory str Memory argument for bandwidthtest. E.g. pinned unpinned. --run_count int The run count of benchmark. --shmoo_mode Enable shmoo mode for bandwidthtest. default values: {'bin_dir': None, 'duration': 0, 'mem_type': ['htod', 'dtoh'], 'memory': 'pinned', 'run_count': 1} </pre> </details> __Major Revisions__ * Add `sb benchmark list` to list benchmarks matching given name. * Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name. __Minor Revisions__ * Sort format help text for argparse.	2022-01-18 08:40:03 +00:00
Yifan Xiong	ff563b66af	Release - SuperBench v0.4.0 (#278 ) __Description__ Cherry-pick bug fixes from v0.4.0 to main. __Major Revisions__ * Bug - Fix issues for Ansible and benchmarks (#267) * Tests - Refine test cases for microbenchmark (#268) * Bug - Build openmpi with ucx support in rocm dockerfiles (#269) * Benchmarks: Fix Bug - Fix fio build issue (#272) * Docs - Unify metric and add doc for cublas and cudnn functions (#271) * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274) * Bug - Fix bug of detecting if gpu_index is none (#275) * Bug - Fix bugs in data diagnosis (#273) * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270) * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276) * Docs - Upgrade version and release note (#277) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>	2021-12-30 16:24:00 +08:00
Yifan Xiong	cb8a3cfb15	Benchmarks - Add transformers for TensorRT inference (#254 ) Add transformers for TensorRT inference.	2021-12-13 13:21:32 +00:00
Ziyue Yang	10012a0a47	Docs - Add benchmark metrics for cpu-memory-bw-latency (#264 ) Description Add benchmark metrics for cpu-memory-bw-latency.	2021-12-13 19:08:19 +08:00
Ziyue Yang	b6781968f2	Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py #263 Description Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py.	2021-12-13 07:02:39 +00:00
Hossein Pourreza	b590409e0f	Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216 ) Description Add mlc memory bandwidth and latency micro benchmark to Superbench. Major Revision - Add mlc benchmark with test and example files	2021-12-13 13:47:42 +08:00
guoshzhao	4d85630abb	Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245 ) Description Add ONNXRuntime inference benchmark based on ORT python API. Major Revision - Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference - Add tests and example for `ort-inference` benchmark - Update the introduction docs.	2021-12-10 13:53:11 +00:00
Yuting Jiang	c2f942cb6f	Analyzer: Add Feature - Add basic analysis features (#248 ) Description Add basic analysis features. Major Revision - Add statistics, correlations of the raw data - Add numeric outlier detection(inter_quartile_range) - Add boxplot for selected metric	2021-12-10 11:01:59 +00:00
guoshzhao	6e357fb9d2	Monitor: Integration - Integrate monitor into Superbench (#259 ) Description Integrate monitor into Superbench. Major Revision - Initialize, start and stop monitor in SB executor. - Parse the monitor data in SB runner and merge into benchmark results. - Specify ReduceType for monitor metrics, such as MAX, MIN and LAST. - Add monitor configs into config file.	2021-12-10 09:33:13 +00:00
guoshzhao	afea9913ae	Benchmarks: Fix Bug - Set reduce_op type for metirc return_code (#261 ) Description Set the `reduce_op` type for metirc `return_code` as `None`.	2021-12-10 16:02:29 +08:00
Yuting Jiang	ed2f3c3c82	CLI - Integrate data diagnosis (#260 ) Description Add cli to integrate data diagnosis module.	2021-12-10 06:11:00 +00:00
Yuting Jiang	9f56b2198f	Benchmarks: Unify metric names of benchmarks (#252 ) Description Unify metric names of benchmarks.	2021-12-09 04:48:42 +00:00
Yuting Jiang	c13ed2a297	Analyzer: Initialization - Add baseline-based data diagnosis module (#242 ) Description Add data diagnosis module. Major Revision - Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes - Add RuleOp class to define rule operators	2021-12-08 18:22:00 +08:00
guoshzhao	44f0270ec4	Benchmarks: Add Feature - Add return_code metric into result (#256 ) Description Add return_code metric into result and revise unit tests.	2021-12-07 07:32:37 +00:00
guoshzhao	371fd61cea	Benchmarks: Add Feature - Add 'ignore_invalid' option when register benchmarks. (#247 ) Description If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.	2021-12-02 10:26:56 +00:00
Yifan Xiong	b4ea97bfa4	Benchmark: Replace `-c` argument with `-N` for `numactl` in Configuration (#250 ) Description Replace `-c` argument with `-N` for `numactl` since the old `-c`/`--cpubind` argument is deprecated.	2021-12-02 09:27:03 +00:00
guoshzhao	4074f12c1c	Monitor: Initialization - Add Monitor and MonitorRecord class (#240 ) Description Add the initial version of Monitor. Major Revision - Add `Monitor` class to launch background process for monitoring. - Add `MonitorRecord` class to save the data one time capturing.	2021-11-18 15:54:18 +08:00
guoshzhao	cc70f9c18c	Benchmarks: Add Feature - Extend the device manager utility to support more functions. (#239 ) Description Rename `nvidia_helper` utility as `device_manager` module and support more functions: ``` device_manager.get_device_count() device_manager.get_device_utilization(idx) device_manager.get_device_temperature(idx) device_manager.get_device_power_limit(idx) device_manager.get_device_memory(idx) device_manager.get_device_row_remapped_info(idx) device_manager.get_device_ecc_error(idx) ```	2021-11-15 14:24:04 +08:00
Yifan Xiong	8a00c8a03b	Benchmarks - Add TensorRT inference benchmark (#236 ) __Description__ Add TensorRT inference benchmark for torchvision models. __Major Revision__ - Measure TensorRT inference performance.	2021-11-12 15:27:16 +08:00
Yuting Jiang	54919424c3	Benchmarks: Add Benchmark - Add ib traffic validation distributed benchmark (#215 ) Description Add ib traffic validation distributed benchmark. Major Revision - Add ib traffic validation distributed benchmark, example and test	2021-11-10 01:18:41 +08:00
Ziyue Yang	008e0fe1d8	Benchmarks: Add Feature - Add CPU-initiated copy and dtod support to gpu-sm-copy benchmark (#230 ) Description This commit does the following: 1) Adds CPU-initiated copy benchmark; 2) Adds dtod benchmark; 3) Support scanning NUMA nodes and GPUs inside the benchmark program; 4) Change the name of gpu-sm-copy to gpu-copy.	2021-10-30 11:19:09 +08:00
guoshzhao	e98a68124e	Benchmarks: Add Benchmark - Add onnx model benchmarks based on docker image. (#227 ) Add RocmOnnxModelBenchmark class to run benchmarks packaged in superbench/benchmark:rocm4.3.1-onnxruntime1.9.0	2021-10-27 18:41:40 +08:00
Yuting Jiang	6003f2c2a2	Benchmarks: Add Benchmark - Add gpcnet microbenchmark (#229 ) Description Add gpcnet microbenchmark Major Revision - add 2 microbenmark for gpcnet, gpc-network-test, gpc-network-load-test - add related test and example file	2021-10-22 08:40:01 +00:00
guoshzhao	f841c8f466	Benchmarks: Add Feature - Support AMD and CUDA platform for DockerBenchmark. (#226 ) Description Add CudaDockerBenchmark and RocmDockerBenchmark to support amd and cuda platform for DockerBenchmark.	2021-10-22 15:22:15 +08:00
guoshzhao	455ad1f873	revise the term onnx to onnxruntime. (#232 ) Description Revise the all the term `onnx` to `onnxruntime`.	2021-10-21 04:29:27 +00:00
Yuting Jiang	49cc8f9a8c	Benchmarks: Add Benchmark - Add tcp connectivity validation microbenchmark (#217 ) Description Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile. Major Revision - Add tcp connectivity validation microbenchmark and related test, example	2021-10-12 23:42:12 +00:00
guoshzhao	f944245694	Benchmarks: Add Feature - Add option to use fp32 instead of tf32 (#213 ) Description Add option `force_fp32` to use fp32 instead of tf32, only takes effect on Ampere or newer GPUs.	2021-09-28 05:53:01 +08:00
Yifan Xiong	dfbd70b129	Release - SuperBench v0.3.0 (#212 ) Description Cherry-pick bug fixes from v0.3.0 to main. Major Revisions * Docs - Upgrade version and release note (#209) * Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210) * Benchmarks: Update - Update benchmarks in configuration file (#208) * CI/CD - Update GitHub Action VM (#211) * Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203) * CI/CD - Fix bug in build image for push event (#205) * Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204) * Tool: Fix bug - Fix function naming issue in system info (#200) * CI/CD - Push images in GitHub Action (#202) * Bug - Fix torch.distributed command for single node (#201) * CLI - Integrate system info for node (#199) * Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196) * CI/CD - Add ROCm image build in GitHub Actions (#194) * Bug: Fix bug - fix bug of hipBusBandwidth build (#193) * Benchmarks: Build Pipeline - Restore rocblas build logic (#197) * Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198) * Bug - Revise 'docker run' in sb deploy (#195) * Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190) Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com> Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com> Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>	2021-09-26 09:30:31 +08:00
Yuting Jiang	6076251816	Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode and rename metric (#189 ) Description Revise arguments of nccl/rccl to support mpi mode for (mpi can not run in nccl/rccl due to multiple operators run in sequence without barrier) and rename metric . Major Revision - revise argument operators to be a single one Minor Revision - rename metric to remove benchmark name info - change argument ngpus default value to be 1	2021-09-03 14:23:19 +08:00
Yifan Xiong	e2453e1cae	Runner - Fix inventory issue in ansible_runner (#185 ) __Description__ Fix inventory bug in ansible_runner when host list is provided with multiple hosts. It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.	2021-09-02 13:24:48 +08:00
guoshzhao	37d5dfd5ed	Benchmarks: Code Revision - revise the DockerBenchmark base class (#179 ) Description Revise the DockerBenchmark base to support image pull, image rm etc. Major Revision - image pull in _preprocess() - image clean in _postprocess() - execute customized commands in _benchmark() - add unit tests	2021-09-01 22:15:42 +08:00
Ziyue Yang	024a870be1	Benchmarks: Code Revision - Revise metric name generation and default config for disk performance benchmark (#175 ) Description This commit revises disk performance benchmark, including: 1) Add missing benchmark name in default config; 2) Avoid using reserved character ':' in metric name.	2021-08-31 19:21:42 +08:00
Ziyue Yang	b97197f08e	Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#169 ) Description This commit adds gpu_sm_copy benchmark and related tests.	2021-08-30 18:54:26 +08:00
Yuting Jiang	f3d53c3d5f	Benchmarks: Add Benchmark - Add gemm flops microbenchmark for amd (#152 ) Description Add gemm flops microbenchmark for amd. Major Revision - Add gemm flops microbenchmark for amd. - Add related example and test file.	2021-08-30 13:40:46 +08:00
Yuting Jiang	b0df66f7a2	Benchmarks: Code Revision - Extract base class for gemm flops microbenchmark (#165 ) Description Extract base class for gemm flops microbenchmark. Major Revision - extract base class for gemm flops microbenchmark and add related test. - revise gemm_flops_performance for cuda.	2021-08-30 10:01:28 +08:00
guoshzhao	35114bae9d	Benchmarks: Code Revision - Rename kernel_launch_overhead metrics (#171 ) Description Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.	2021-08-28 06:36:41 +08:00
Yuting Jiang	666e3a9471	Benchmarks: Add Benchmark - Add memory bus bandwidth performance microbenchmark for amd (#153 ) Description Add memory bus bandwidth performance microbenchmark for amd. Major Revision - Add memory bus bandwidth performance microbenchmark for amd. - Add related example and test file.	2021-08-27 21:17:39 +08:00
Yuting Jiang	e5e84a2ece	Benchmarks: Code Revision - Extract base class for memory bandwidth microbenchmark (#159 ) Description extract base class for memory bandwidth microbenchmark. Major Revision - revise and optimize cuda_memory_bandwidth_performance - extract base class for memory bandwidth microbenchmark - add test for base class	2021-08-26 07:48:07 +08:00
Yuting Jiang	0583862d2d	Benchmarks: Code Revision - fix typo in test of nccl microbenchmark. (#163 ) Description fix typo in test_nccl_bw_performance.py. Major Revision - fix typo in test_nccl_bw_performance.py.	2021-08-23 13:53:47 +08:00
Ziyue Yang	6774d7b702	Benchmarks: Revise Benchmark - Add readwrite I/O pattern (#161 ) Description This commit adds readwrite I/O pattern for FIO benchmark. Read/write ratio is fixed at 4:1.	2021-08-22 22:38:25 +08:00
guoshzhao	7595d79434	Runner: Add Feature - Generate summarized output files. (#157 ) Description Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op` Major Revision - Generate the summarized json file per node: For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]` For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}` `[]` means optional. ``` { "kernel-launch/overhead_event:0": 0.00583, "kernel-launch/overhead_event:1": 0.00545, "kernel-launch/overhead_event:2": 0.00581, "kernel-launch/overhead_event:3": 0.00572, "kernel-launch/overhead_event:4": 0.00559, "kernel-launch/overhead_event:5": 0.00591, "kernel-launch/overhead_event:6": 0.00562, "kernel-launch/overhead_event:7": 0.00586, "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134, "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773, "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677, "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973, "pytorch-sharding-matmul/0/allreduce": 10.561786651611328, "pytorch-sharding-matmul/1/allreduce": 10.561786651611328, "pytorch-sharding-matmul/0/allgather": 10.088025093078613, "pytorch-sharding-matmul/1/allgather": 10.088025093078613 } ``` - Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.	2021-08-20 16:48:40 +08:00
Yifan Xiong	98b6c0e3ca	Runner - Support mpi mode (#146 ) Support mpi mode in runner: * concate mpirun command * support mca and env config * prepare hostfile and update Ansible host pattern Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>	2021-08-19 15:59:17 +08:00
guoshzhao	7293e783f1	Benchmarks: Code Revision - change 'reduce' to 'reduce_op' (#156 ) Description Change the field name `reduce` to `reduce_op`.	2021-08-16 11:33:39 +08:00
guoshzhao	acf365a856	Benchmarks: Add Feature - Set reduce type for current benchmarks' metrics. (#149 ) Description Set reduce type for current benchmarks' metrics, including model benchmarks and ShardingMatmul.	2021-08-06 17:23:14 +08:00
guoshzhao	bc1a61b91a	Benchmarks: Code Revision - Calculate average value by using statistics module. (#148 ) Description Replace `sum(results) / len(results)` with `statistics.mean(results)`	2021-08-06 13:37:18 +08:00
guoshzhao	e41b1f6225	Benchmarks: Add Feature - Add reduce function support for output summary. (#147 ) Description Add reduce function support for output summary. Major Revision - Add reducer class to maintain all reduce functions. - Save reduce type of each metric into `BenchmarkResult` - Fix UT.	2021-08-05 16:52:49 +08:00
Yuting Jiang	e083a598cf	Benchmarks: Add Benchmark - Add NCCL performance benchmark (#113 ) Description Add NCCL performance microbenchmark. Major Revision - Add microbenchmark, example, test, config for NCCL	2021-07-26 10:54:47 +08:00
Yuting Jiang	b0c5addcac	Benchmarks: Add Benchmark - Add IB Loopback performance benchmark. (#112 ) Description Add RDMA Loopback performance microbenchmark. Major Revision - Add microbenchmark, example, test, config for RDMA Loopback	2021-07-24 03:40:24 +08:00
Ziyue Yang	db297fb4ed	Benchmarks: Add Benchmark - Add disk performance benchmark (#132 ) Description Add disk performance microbenchmark. Major Revision - Add microbenchmark, example, test, config for disk performance. Minor Revision - Fix bugs in executor unit test related to default enabled tests.	2021-07-23 14:49:05 +08:00
Ziyue Yang	477fbb0ad2	Benchmarks: Fix bug - fix bug in test_executor.py to test default enabled tests only (#133 ) Description Fix bug of tests/executor/test_executor.py. Major Revision - Test default enabled benchmarks only instead of all benchmarks.	2021-07-20 20:11:08 +08:00
Yuting Jiang	f9550bd693	Benchmarks: Add Benchmark - Add memory bandwidth benchmark for cuda. (#114 ) Add microbenchmark, example, test, config for cuda memory performance and Add cuda-samples(tag with cuda version) as git submodule and update related makefile	2021-07-13 17:30:19 +08:00
Yuting Jiang	71c1617b2e	Utils: Code Revision - Update network common utils (#118 ) Update network common utils. Add get_ib_devices in network common utils and move get_free_port from test utils to network common utils	2021-07-13 16:05:01 +08:00
guoshzhao	9c984c7eb0	Bug bash - Merge fix from release/0.2 to main (#124 ) * Bug Fix - Fix race condition issue for multi ranks (#117) Fix race condition issue when multi ranks rotating the same directory. * Update pipeline for release branch (#122) * Bug Fix - Fix bug when convert bool config to store_true argument. (#120) Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>	2021-07-09 16:54:42 +08:00
Yifan Xiong	7458f83a9b	Runner & Executor - Support AMD GPU (#119 ) Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution. * Add GPU environment check in sb deploy. * Check GPU vendor in executor.	2021-07-09 00:42:49 +08:00
Yifan Xiong	fb7d4a7396	Runner - Fetch benchmarks results on all nodes (#116 ) Fetch benchmarks results on all nodes, will rsync after each benchmark. The results directory structure on control node is as follows: ``` outputs/ └── datetime ├── nodes │ └── node-0 │ ├── benchmarks │ │ ├── benchmark-0 │ │ │ ├── rank-0 │ │ │ │ └── results.json │ └── sb-exec.log ├── sb-run.log └── sb.config.yaml ```	2021-07-02 21:45:56 +08:00
Yifan Xiong	7b0b0e9add	CLI - Support custom output directory (#110 ) * Support custom output directory. * Update document.	2021-07-01 21:10:12 +08:00
guoshzhao	8ffaddfaef	Benchmarks: Fix Bug - Fix gemm kernel bug for nvidia v100. (#105 ) * fix bug for nvidia v100 * hard code the supported dict for different arch.	2021-06-29 18:46:44 +08:00
guoshzhao	9c7485276b	Benchmarks: Code Revision - Replace torch.optim.AdamW with transformers.AdamW. (#106 ) * replace torch.optim.AdamW with transformers.AdamW.	2021-06-28 15:24:39 +08:00
Yifan Xiong	c0c43b8f81	Bug bash - Fix bugs in multi GPU benchmarks (#98 ) * Add `sb deploy` command content. * Fix inline if-expression syntax in playbook. * Fix quote escape issue in bash command. * Add custom env in config. * Update default config for multi GPU benchmarks. * Update MANIFEST.in to include jinja2 template. * Require jinja2 minimum version. * Fix occasional duplicate output in Ansible runner. * Fix mixed color from Ansible and Python colorlog. * Update according to comments. * Change superbench.env from list to dict in config file.	2021-06-23 18:16:43 +08:00
Yifan Xiong	ddbc51a135	Bug bash - Fix bugs and refine log in single GPU benchmarks (#97 ) Fix bugs and refine log in single GPU benchmarks: * Fix none framework issue * Fix empty parameter bug * Remove missed mobilenet_v3 models * Change benchmark registration log to debug level * Add pid in logging * Add missing benchmarks in default config * Fix deprecated logging warn	2021-06-16 13:51:22 +08:00
guoshzhao	03b41be145	Benchmarks: Fix Bug - Fix OOM issue when run pytorch models sequentially. (#93 ) * Clean up the cache.	2021-06-07 10:19:05 +08:00
guoshzhao	2d9be807a9	Benchmarks: Fix Bug - Fix return code overwrite issue (#94 ) * fix return code reset issue	2021-06-04 18:02:12 +08:00
Yifan Xiong	6b0ca1cb05	Runner - Support local mode in runner (#88 ) * Support local mode in runner.	2021-06-02 23:58:44 +08:00
guoshzhao	44c5103b5c	Benchmarks: Code Revision - Change default shape of sharding-matmul. (#92 ) * Change default shape of sharding-matmul.	2021-06-02 10:50:09 +08:00
guoshzhao	6c6f526937	Benchmarks: Add Benchmark - Add FLOPs performance benchmark for cuda. (#87 ) * add cuda flops performance benchmark.	2021-06-02 09:15:58 +08:00
Yuting Jiang	83235433b2	Benchmarks: Add benchmark - add micro benchmark for cudnn test (#89 ) * add python related cudnn microbenchmark	2021-06-01 22:24:35 +08:00
Yifan Xiong	5e9f948df2	Executor - Save benchmark results to file (#86 ) * Save benchmark results to json file.	2021-05-31 13:05:12 +08:00
Yuting Jiang	18398fbaa2	Benchmarks: Add benchmark - add micro benchmark for cublas test (#80 ) * add benchmark for cublas test * format * revise error handling and test * add interface to read json file, revise json file path and include .json in packaging * add random_seed in arguments * revise preprocess of cublas benchmark * fix lint error and note error in source code * update according comments * revise input arguments from json file to custom str and convert json file to built-in dict list * restore package config * fit lint issue * update platform and comments * rename files to match source code dir and fix comments error Co-authored-by: root <root@sb-validation-000001.51z1chmys5fuzfqyo4niepozre.bx.internal.cloudapp.net>	2021-05-31 10:31:53 +08:00
Yifan Xiong	8b4f613a76	Runner - Support torch.distributed mode in runner (#81 ) * Support `torch.distributed` mode in runner. * Support given `proc_num` and `node_num` in `torch.distributed` mode.	2021-05-28 12:29:39 +08:00
Yifan Xiong	e7f6d8ba78	CI/CD - Add integration tests for Ansible playbooks (#82 ) * Add integration tests for Ansible playbooks * Add `gpu_vendor` var to bypass gpu mount	2021-05-26 20:04:49 +08:00
Yifan Xiong	c05e173b3d	Runner - Implement ansible client and runner (#69 ) Implement ansible client and runner: * add ansible client * add deploy and check_env playbooks	2021-05-23 23:53:37 +08:00

1 2 3 4

183 Коммитов