DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
Nadav Elyahu	3b09d945ea	fix pipeline eval_batch micro_batches argument for schedule (#6484 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-05 22:07:57 +00:00
Jinxing Pan	4f803852ac	Op_builder->is_compatible quite warning (#6093 ) Set the default value of op_builder/xxx.py/is_compatible()/verbose to False for quite warning. Add verbose judgement before op_builder/xxx.py/is_compatible()/self.warning(...). Otherwise the verbose arg will not work. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-05 17:03:34 +00:00
Nadav Elyahu	857780a85a	HPU: add required ENV vars to acccelerator init (#6495 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-09-05 15:43:08 +00:00
Logan Adams	c210e601e3	Update version.txt after 0.15.1 release (#6493 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.1 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-09-04 18:34:42 -07:00
Alex Morehead	10ba3dde84	Handle an edge case where `CUDA_HOME` is not defined on ROCm systems (#6488 ) * Handles an edge case when building `gds` where `CUDA_HOME` is not defined on ROCm systems	2024-09-04 22:28:13 +00:00
Olatunji Ruwase	662a421b05	Safe usage of popen (#6490 ) Avoid shell=True security issues with Popen	2024-09-04 21:06:04 +00:00
Jiancheng Liu	ddd3571823	Add default value to "checkpoint_folder" in "load_state_dict" of bf16_optimizer (#6446 ) Add default value `checkpoint_folder=None` for compatibility. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-09-04 20:32:02 +00:00
Olatunji Ruwase	5d1a30c033	DS_BUILD_OPS should build only compatible ops (#6489 ) Currently DS_BUILD_OPS=1 fails on incompatible ops. This is a deviation from [documentation](https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops) which states that only compatible ops are built. <img width="614" alt="image" src="https://github.com/user-attachments/assets/0f1a184e-b568-4d25-9e9b-e394fb047df2">	2024-09-04 20:30:56 +00:00
Masahiro Tanaka	ddeb0c19a0	Fix patch for parameter partitioning in zero.Init() (#6388 ) This PR fixes an issue addressed in #5921. With this change, we only apply the patch for parameter partitioning to classes that have `__init__` so that we can avoid applying the patch multiple times. The class that does not have `__init__` now uses its superclass's one. So this PR also applies the patch to the root class, `torch.nn.modules.module.Module`. Thanks @VeryLazyBoy for the report and initial solution. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-04 18:43:20 +00:00
Joshua C. Randall	9d17116fcd	print warning if actual triton cache dir is on NFS, not just for default (#6487 ) move the logic that prints a warning when triton cache dir is on NFS to act on the actual calculated cache_dir rather than on the default. this means that: - when the default directory (in the user's home directory) is on NFS but `TRITON_CACHE_DIR` is set to a non-NFS directory, no warning will be printed whereas prior to this change a spurious and confusing warning was printed - when the user's home directory is not on NFS but `TRITON_CACHE_DIR` is set to an NFS directory, a warning will be printed whereas prior to this change no warning would be printed fixes #6486	2024-09-04 18:22:07 +00:00
Olatunji Ruwase	5df12a4a85	DeepNVMe tutorial (#6449 ) Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>	2024-09-04 15:31:31 +00:00
Nadav Elyahu	cfc6ed3722	bf16_optimizer: fixes to different grad acc dtype (#6485 ) - fix step function to cast to FP32 before step in case of different gradient accumulation data type - remove redundatn function initialize_optimizer_states()	2024-09-04 15:27:26 +00:00
Logan Adams	9b7fc54524	Add workflow to build DS without torch to better test before releases (#6450 ) - Adds a nightly workflow that tests to confirm we can build DeepSpeed without torch as a dependency, as this often only surfaces when doing a release.	2024-08-29 23:43:21 +00:00
Raza Sikander	89c4d9f5a7	TestLowCpuMemUsage UT get device by device_name (#6397 ) Co-authored-by: Shaik Raza Sikander <srsikander@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-29 22:05:20 +00:00
Ramya Ramineni	a7ffe540fc	Avoid gds build errors on ROCm (#6456 ) This PR is to avoid the below error during DeepSpeed build on ROCm. The error is because of the incompatibility of GDSBuilder extension on ROCm. ``` Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-req-build-lv1v39xc/setup.py", line 180, in <module> op_compatible = builder.is_compatible() File "/tmp/pip-req-build-lv1v39xc/op_builder/gds.py", line 47, in is_compatible CUDA_LIB64 = os.path.join(CUDA_HOME, "lib64") File "/opt/conda/envs/py_3.9/lib/python3.9/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType Total number of unsupported CUDA function calls: 0 Total number of replaced kernel launches: 1 ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output ``` cc: @jithunnair-amd --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-29 17:15:52 +00:00
Yizhou Wang	0cd9bf5978	[CCL] fix condition issue in ccl.py (#6443 ) previous condition check is not right, it would cause this condition always be True. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-29 16:42:53 +00:00
Joe Mayer	f2739b4f72	Change GDS to 1 AIO thread (#6459 ) The `numThreads` config option determines how many threads are used to read from the file. In the CPU case these threads are created via AIO, in the GDS case they are handled by the GDS library via the cufile.json. If we were to also create AIO threads it would have a multiplicative effect. Example 8 AIO threads * 8 GDS threads would be 64 threads reading from the file when the user really only intended for 8 threads. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-29 15:59:32 +00:00
Siddartha Naidu	4864991f53	Allow triton==3.0.x for fp_quantizer (#6447 ) Tested with triton==3.0.x and the kernel tests pass so adding as an allowed version. Triton 2.3.x is not well supported on arm64. Triton 3.0.0 is supported on arm64 and it appears the fp8 kernel works fine with triton==3.0.0 so this simplifies usage on arm hosts (GH200).	2024-08-28 18:04:40 +00:00
Roger Feng	405b6d5e33	Add the accelerator setup guide link in Getting Started page (#6452 ) Add the link of https://www.deepspeed.ai/tutorials/accelerator-setup-guide/ into the installation section in Getting Started page so that users can easily find the doc. Signed-off-by: roger feng <roger.feng@intel.com>	2024-08-28 16:55:33 +00:00
Raza Sikander	9bc4cd01b7	Store/Load CIFAR from local/offline (#6390 ) CIFAR10_DATASET_PATH -> Path where the dataset is stored STORE_CIFAR10 -> Store the dataset 1/0 CIFAR10_OFFLINE -> To use offline dataset 1/0 MISC: Added getDeviceId to get device if by name in case of accelerator --------- Co-authored-by: Shaik Raza Sikander <srsikander@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-28 16:28:22 +00:00
Raza Sikander	b5cf30a085	Dtype support check for accelerator in UTs (#6360 ) Check if the dtype is supported by the accelarator if not then skip --------- Co-authored-by: Shaik Raza Sikander <srsikander@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-28 16:02:51 +00:00
Dogacan Colak	1041c8a172	Add documentation for launcher without SSH (#6455 ) #5728 --------- Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-28 15:28:10 +00:00
Logan Adams	eb37cacf22	Revert "Revert "Fix torch check (#6402 )"" This reverts commit `77f61e6771`.	2024-08-27 13:02:44 -07:00
Logan Adams	77f61e6771	Revert "Fix torch check (#6402 )" This reverts commit `55b4cae80f`.	2024-08-27 13:02:07 -07:00
jiahao su	1bfa341bbd	add Huawei Ascend NPU setup guide (#6445 ) This PR adds the setup instructions for Huawei Ascend NPU. Please refer to the remainder of the guide for instructions on other devices. --------- Co-authored-by: sjh <sjh1270@163.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-27 18:15:48 +00:00
Sam Ade Jacobs	8ac42ed73a	Fix redundant seq data parallel grp argument in Z3/MiCS (#5352 ) Deprecate redundant sequence_data_parallel_group argument. Users/client code will control across which process group Z3 parameters will be partitioned from one of [None, data_parallel_group, sequence_data_parallel]. --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-27 01:10:21 +00:00
Joe Mayer	e2654bfd1a	Fix Type Mismatch (#6410 ) `num_bytes_per_thread` was a smaller type than `file_num_bytes`, this caused issues when dividing by `num_threads`. Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-23 23:17:38 +00:00
Logan Adams	ca4449e843	Update version.txt after 0.15.0 release (#6403 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.0 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-08-22 15:48:34 -07:00
Logan Adams	55b4cae80f	Fix torch check (#6402 )	2024-08-22 15:46:10 -07:00
Michael Wyatt	0a4457cc48	Pydantic v2 migration (#5167 ) Pydantic v2 has been out for some time now. We have been relying on using the v1 API available in v2 until now. This is a refresh of #3902 to bring proper v2 support to DeepSpeed. Corresponding DeepSpeed-MII PR [here](https://github.com/microsoft/DeepSpeed-MII/pull/423). @loadams --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com> Co-authored-by: Abhishek Kulkarni <abkulkarni@microsoft.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>	2024-08-22 15:38:13 -07:00
Logan Adams	8c2be7e942	Update to prepare for next release	2024-08-22 15:37:17 -07:00
Yizhou Wang	b81b197ec3	[XPU] API align with new intel pytorch extension release (#6395 ) with intel_extension_for_pytorch new release, we changed deepspeed kernel API. Need to align with upstream DeepSpeed xpu op builder. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-22 17:24:38 +00:00
Guanhua Wang	51da191eb5	add pip install cutlass version check (#6393 ) fix this issue: https://github.com/microsoft/DeepSpeed/issues/6006 cc. @tjruwase	2024-08-22 17:18:17 +00:00
Logan Adams	0f0f231e8a	Correct op_builder path to xpu files for trigger XPU tests (#6398 )	2024-08-22 08:38:42 -07:00
Masahiro Tanaka	649b078571	Add Japanese translation of Windows support blog (#6394 ) This PR adds the Japanese translation of the release blog of Windows support.	2024-08-21 18:24:27 -07:00
Perry Zou	e6fcc226c7	fix fp16 Qwen2 series model to DeepSpeed-FastGen (#6028 ) based on PR #5403 (Qwen1.5-MOE) and #5219 (Qwen1.5), support Qwen2 series model. including: 0.5B, 1.5B, 7B, 57B-A14B, and 72B models. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-21 23:59:39 +00:00
ranzhejiang	7260890452	reduce cpu host overhead when using moe (#5578 ) The operation `.to('cpu') `is not necessary for exp_counts, and it will cause device to host synchronization which damage performance. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-21 21:52:48 +00:00
Sam Ade Jacobs	8b191d7ccf	Long sequence parallelism (Ulysses) integration with HuggingFace (#5774 ) This PR enhances capabilities of [DeepSpeed long sequence (context) parallelism (aka DS Ulysses)](https://dl.acm.org/doi/10.1145/3662158.3662806) with support for HuggingFace (and by extension other frameworks) models. With HF integration, users can use sequence parallelism for model pre/mid/post-training, finetuning etc. Usage requires both _torch >=2.2.2 and flash-attention_. ZeRO-1 and 2 are supported, ZeRO-3 and SPDA support in progress. Corresponding PR in HF is [PR32305](https://github.com/huggingface/transformers/pull/32305). --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-21 01:46:50 +00:00
Joe Mayer	b65ea50631	GDS Swapping Fix (#6386 ) Fixing gds api call Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>	2024-08-20 23:25:37 +00:00
Jinxing Pan	96393f561d	Update linear.py compatible with torch 2.4.0 (#5811 ) deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. Fixes: #5682 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-20 17:22:02 +00:00
Joe Mayer	3831d4b57a	Bug Fix 5880 (#6378 ) Allowing hf args to be passed through class to AutoConfig.pretrained. Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-20 16:04:37 +00:00
Olatunji Ruwase	01fe65b300	DeepSpeed on Window blog (#6364 ) DeepSpeed on Windows blog --------- Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-19 11:16:22 -07:00
Logan Adams	ba0ab7138d	Pin transformers version on nv-nightly (#6002 ) nv-nightly was failing due to updates in transformers, we will need to introduce a real fix for these, but this at least gets the test passing and we need to update transformers support for MII too.	2024-08-19 09:51:31 -07:00
Joe Mayer	5f0d177fd7	DeepNVMe GDS (#5852 ) PR for the GDS AIO code. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-19 04:28:50 +00:00
Joe Mayer	c2e3a706b5	Add and Remove ZeRO 3 Hooks (#5658 ) Gives the ability to add and remove the forward hooks in ZeRO 3 by using a context manager. These code changes were taken from a Huggingface [PR](https://github.com/huggingface/trl/pull/1617) and integrated for direct support in DeepSpeed. This is useful in the inference case and the speedup can be observed [here](https://github.com/huggingface/trl/pull/1483). --------- Co-authored-by: root <root@deepspeed-c000004.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-16 18:40:25 +00:00
Masahiro Tanaka	1ab1928d79	Enable dynamic shapes for pipeline parallel engine inputs (#5481 ) This PR enables dynamic shapes for inputs to pipeline parallel (PP) engine. Currently PP engine checks tensor shapes and allocates communication buffer at the first forward/backward passes. This causes a tensor shape mismatch error when input tensor shapes changed. This PR adds an option to check tensor shapes at every iteration and allocate buffer based on the shapes. As shown below, you can enable this feature by passing `dynamic_shape=True` to `PipelineModule`. Note that this might have a performance impact and the option is set to False as default. ```python model = PipelineModule( ... dynamic_shape=True ) ``` This will increase the overhead of buffer allocation and communication for tensor metadata. To mitigate the overhead, this PR also includes these improvements: - Consolidate multiple communication calls to send/recv tensor shapes `9f96ad4049` - Reuse (extend) communication buffer instead of creating a new one `b3c07504be` --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-16 15:19:12 +00:00
Liran Bachar	4d4ff0eddd	Move inf_or_nan_tracker to cpu for cpu offload (#5826 ) Must use the same device as grad_partitions_flat_buffer --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-15 22:42:32 +00:00
inkcherry	9a3ede7079	add moe topk(k>2) gate support (#5881 ) Notice some users need to use topk > 2 to train MoE models. For example: https://huggingface.co/Qwen/Qwen2-57B-A14B/blob/main/config.json, this PR adds support for topk (k > 2) gates. - add topk (k>2) support - add drop token policy based on position and probabilities. - unit tests --------- Co-authored-by: Kurt Chen <kurt.chen@intel.com> Co-authored-by: Jin, Youzhi <youzhi.jin@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-08-15 17:43:45 +00:00
Rohan Potdar	30428d0318	move pynvml install to setup.py (#5840 ) Only install pynvml on nvidia gpus; not all accelerators	2024-08-15 16:27:10 +00:00
Logan Adams	3f6df9a236	Update version.txt after 0.14.5 release (#5982 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.14.5 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-08-15 11:06:46 -07:00

1 2 3 4 5 ...

2573 Коммитов Все ветки Поиск

2573 Коммитов

Все ветки