Граф коммитов

140 Коммитов

Автор SHA1 Сообщение Дата
Logan Adams e238351101
ROCm 6.0 prep changes (#4537)
* ROCm 6.0 prep changes

* PR feedback

* Try updating apex
2023-10-20 19:05:54 +00:00
Logan Adams 427253b94b
Update ROCm verison (#4486)
* Update ROCm verison

* Update .github/workflows/amd-mi200.yml

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-10-10 17:41:28 +00:00
Michael Wyatt e9503fe40e
fix missing package 2023-10-09 09:53:41 -07:00
Michael Wyatt 923f3590ee
fix bad build command (#4483) 2023-10-09 09:44:39 -07:00
Logan Adams 2118c63591
Add release flow (#4467) 2023-10-06 14:44:27 -07:00
Michael Wyatt 4294ea172c
CI fix for torch 2.1 release (#4452)
* Fix for torch 2.1 release
Co-authored-by: Logan Adams <loadams@microsoft.com>
2023-10-05 15:31:24 -07:00
Logan Adams cd0d2ba2df
Enable ad-hoc running of cpu_inference (#4444) 2023-10-03 10:49:39 -07:00
Logan Adams 0636c74c5e
Update cp_inf wokrflow (#4424) 2023-09-29 15:02:47 -07:00
Yejing-Lai 388c84834f
add CPU autotp UT (#4263) 2023-09-27 22:39:24 +00:00
Logan Adams 58619402b5
Update nv-transformers workflow to use cu11.6 (#4412) 2023-09-27 19:51:20 +00:00
Logan Adams dcd3ae1954
Enable workflow dispatch on Torch 1.10 CI tests (#4361) 2023-09-19 15:24:13 -07:00
Conglong Li f876d81d34
DeepSpeed4Science (#4357)
* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* DeepSpeed4Science (#569)

* Integrating evoformer attention

* add cutlass version check

* Updaate error message

* add benchmark

* Update

* Update evoformer_attn.py

* Update run_evoformer_test.py

* Update evoformer_attn.py

* Update run_evoformer_test.py

* support more GPU archs

* add copyright

* add tests

* Fix bugs

* Update benchmark

* update

* Fix nvcc macro

* clean code

* fix formatting

* fix yaml import

* skip unit test when not compatible

* fix yaml requirement

* revert changes

* update tutorial

* update

* fix formatting

* fix format

* skip evoformer attn in pre-compile-ops

* revert changes

* update tutorial

* fix cutlass check

* update tutorial

* refactor tutorial

* revise

* Updated the Megatron-DS section (#565)

* Updated the Megatron-DS section

* minor fix

* minor fix

* minor fix

* separate evoformer tutorial

* Revised the ds4science landing page (#566)

* Updated the Megatron-DS section

* minor fix

* minor fix

* minor fix

* Revised the landing page

* Revised the landing page

* Removing unused file

* fix links image position

* modify main page

* fix doc

---------

Co-authored-by: Shiyang Chen <csycfl@gmail.com>
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>

---------

Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Shiyang Chen <csycfl@gmail.com>
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>
2023-09-18 22:16:08 +00:00
Logan Adams 55d9964c59
Fix nv-inference/un-pin transformers (#4269)
* Fix bump_patch_version.py to update version.txt post GH release

* Un-pin transformers

* Unmerge changes from another branch
2023-09-05 23:51:23 +00:00
Logan Adams c93e89a38c
Add check that opening issues on CI failure requires schedule (#4242)
* Add check that opening issues on CI failure requires build to be scheduled

* Update ()
2023-09-05 16:06:59 +00:00
Lev Kurilenko f96c1c0a78
Pin Triton version to >=2.0.0 and <2.1.0 (#4251)
* Pin Triton version to 2.0.0

* Pin Triton version to < 2.1.0

* Add >=2.0.0

* pin transformers version

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-09-01 16:47:52 -07:00
Logan Adams 5dbc531328
Enable AMD MI200 and H100 to run on branches for testing (#4238) 2023-08-30 23:28:23 +00:00
Michael Wyatt 46d859a75d
pin transformers to last known good commit (#4174) 2023-08-18 14:46:09 -07:00
Lev Kurilenko a3540f17f8
Add DSE branch input to nv-ds-chat (#4173)
* Add DSE branch input to nv-ds-chat

* Use provided DSE branch

* Echo DSE branch
2023-08-18 14:37:19 -07:00
Lev Kurilenko 64c670ef02
Add DS-Chat CI workflow (#4127)
* Add DS Chat CI workflow

* Add CRITIC_CKPT_DIR env variable to actions.yml

* Update step 2 opt 125m ckpt dir name

* Update test dir

* Add workflow_dispatch

* Add :

* Add nv-ds-chat badge to main README

* Open GH issue if DS Chat CI fails

* Remove pull_request and merge_group conditions

* Update and test torch version

* Remove PR trigger

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-08-17 04:11:35 +00:00
Logan Adams ff7d5275f2
Update torch1.9 tests to 1.10 to match latest accelerate. (#4126)
* Fix torch19 tests

* test pip list and --no-build-isolation

* Enable verbosity

* pin to older accelerate version

* Update oldest tested torch to 1.10

* Properly rename directories

* Return PR tests to CI again.

* Remove -vv
2023-08-10 22:49:04 +00:00
Logan Adams 0c75f4a3f9
Update nightly workflows to open an issue if CI fails (#3952)
* Update H100 workflow to open an issue if nightly CI fails

* Test running as not CI

* Add all nightly/switch envvar name

* Test with AMD

* Add way to get url, switch path of template

* Add additional checkout step

* Move actions checkout step

* Try absolute path with github workspace

* Create issue without template/path

* Re-enable and add debug logic

* add if failed()

* More debug

* Try without checkout action uses

* Rename file

* Update variables

* Update issue template

* Confirm removing permissions still work

* Revert "Confirm removing permissions still work"

This reverts commit e7c2915adc.

* Re-enable permissions

* Remove PR trigger for AMD MI200 tests

* Revert "Remove PR trigger for AMD MI200 tests"

This reverts commit 5c5c5fd67b.

* Test update_existing

* Switch to composite action

* Fix line ending encoding issue

* Switch failure to be a variable

* Test with second workflow

* Format fix

* Switch failure to always

* Switch back to previously working way

* Test permission changes

* Revert "Test permission changes"

This reverts commit e051da759b.

* Update existing bugs with newest build failure link

* Remove PR triggers for that were used for testing.
2023-08-09 23:44:22 +00:00
Michael Wyatt a7fe3bcc35
unpin datasets in UT (#4079) 2023-08-02 21:15:33 +00:00
Michael Wyatt 8e808392c8
Specify triton 2.0.0 requirement (#4008)
* specify triton 2.0.0 requirement

* fix for setup-venv action

* fix for install error

* fix torch install error
2023-07-21 18:13:08 +00:00
Logan Adams ceccfa3ef6
Make AMD/ROCm apex install to /blob to save test/compile time. (#3997)
* Re-use apex builds for AMD

* Build MI200 tests on PR

* Install on /blob for the first time

* Add the venv version

* Test

* Remove AMD tests from PR - were only needed for tessting.
2023-07-19 23:32:14 +00:00
Yejing-Lai 7290aace9b
[CPU] Skip CPU support unimplemented error (#3633)
* skip cpu support unimplemented error and update cpu inference workflow

* add torch.bfloat16 to cuda_accelerator

* remove UtilsBuilder skip

* fused adam can build

* use cpu adam to implement fused adam

* enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU)

* remove unused parameters

* remove skip FusedAdamBuilder; add suported_dtypes

* fix format

* Revert "fix format"

Revert "remove skip FusedAdamBuilder; add suported_dtypes"

Revert "remove unused parameters"

Revert "enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU)"

Revert "use cpu adam to implement fused adam"

Revert "fused adam can build"

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ma, Guokai <guokai.ma@intel.com>
2023-07-19 19:58:38 +00:00
Michael Wyatt aef6c65ce3
Reduce Unit Test Times (Part 3) (#3850)
* add coverage report

* define env vars in shared action

* reduce time for longest running tests

* fix broken shared action

* reduce test time

* reducing Pipeline test times

* further reducing test times

* rework Z3 test

* testing new mp.pool and persistent dist envs

* fix import

* reuse distributed environment for tests with lots of param combos

* fix for dist teardown

* fix pickling issue with pool cache

* actually fix pickling problem

* avoid running pool cache stuff on non-distributed tests

* fix issues with nested mp.pool

* fix for nested pools in Pipeline Engine

* re-add params

* update workflows with pytest opts

* implement feedback

* resolve race condition with port selection

* Update tests/unit/common.py

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-07-12 00:35:49 +00:00
Michael Wyatt 52844f4956
Update workflows for merge queue (#3892)
* update workflow triggers for merge queue

* add branch specifier to trigger
2023-07-06 19:03:35 +00:00
Logan Adams 59c9b0914f
Update apex installation to resolve apex's pyproject.toml issues. (#3745) 2023-07-05 15:20:05 -07:00
Michael Wyatt 7c126f431c
update lightning version in CI (#3882) 2023-07-05 10:45:24 -07:00
Michael Wyatt fd1d2c6447
Reduce Unit Test Time (Part 2) (#3838)
* utilize shorter tests for MII

* use cached torch download

* rework zero++ unit tests

* formatting

---------

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
2023-06-29 13:54:49 -07:00
Logan Adams c973e15711
Disable AMD test flows in YML (#3847)
* Disable AMD workflows in the YML

* Switch from PR to nightly so we can enable the flows here
2023-06-29 11:14:21 -07:00
Michael Wyatt 7726fc8d54
Reduce Unit Test Times (Part 1) (#3829)
* move torch19 tests to nightly

* make megatron apex install persistent on blob storage
2023-06-28 18:29:47 +00:00
Jeff Rasley 6102d128f2
Revert "Prevent hangs in CI during parallel run compilation (#2844)" (#3817)
This reverts commit 2b2be85f43.
2023-06-26 10:10:12 -07:00
Michael Wyatt 2b2be85f43
Prevent hangs in CI during parallel run compilation (#2844) 2023-06-26 16:11:29 +00:00
stephen youn 69d1b9f978 DeepSpeed-Triton for Inference (#3748)
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-06-23 14:30:49 -07:00
Jeff Rasley 52c6baa933 remove staging trigger (#3792) 2023-06-23 14:30:49 -07:00
Logan Adams dd59341001
Add H100 workflow and status badge. (#3754)
* Add H100 workflow

* Switch to nightly from PR - will let us check stability first
2023-06-21 16:48:08 +00:00
Logan Adams 1b40182312
Fix apex install bugs (#3741)
* Fix apex installation

* Switch install flag from build-opt to global-opt to fix missing cpp_ext

* Try installing with support for newer pip

* Add build packaging

* Update to latest

* Pin to specific commit while pyproject.toml is fixed
2023-06-13 09:57:29 -07:00
Michael Wyatt 8b8c7031fb
Skip tests on docs-only changes (#3651)
* skip test for docs-only changes

* add missing skip to blog changes
2023-05-31 10:53:31 -07:00
Ma, Guokai 1f72082fc0
[CPU] Support Intel CPU inference (#3041)
* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* add ccl backend

* change nccl to ccl

* remove unused code

* add comm/ccl to ops

* initial ccl comm support

* first broadcast case passed

* add CCL_Backend to DeepSpeed

* support comm timer for CPU

* support barrier for comm backend

* support specify master address from deepspeed command line

* support pytorch 2.0

* remove 'block' from api

* Tweak for debug

Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>

* Remove unecessary directory

Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>

* Add bf16 kernel support for inference

* Add temporary torch implement for cpu inference

* Add softmax ops cpu fallback for inference

* bind cores to numa domain as well

* merge latest change in gma/numactl

* initial bf16 kernel support with fallback path

* initial fallback path for bloom kernel injection

* fix softmax attn mask

* check KMP_AFFINITY to avoid conflict with numactl

* New CCLBackend which utilize TorchBackend for initialization

* rollback last change because there is result error

* fix bloom injection policy TP could not work issue.

injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

* Use TorchBackend to initialize CCLBackend, make behavior consistent

* remove comm under deepspeed/ops

* add license header

* code clean up

* fix format issue

* remove magic number in main address

* add caching support but not turn on by default

* change name of inference_cuda_module to inference_module

* Check for is_synchronized_device in accelerator before get Event

* fix typo

* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type

* add cpu backend files

* change CPU_Accelerator op_builder_dir

* remove cpu_kernel_path

* using CPU_Accelerator on non-cuda device

* fix deepspeed.op_builder => deepspeed.ops.op_builder

* add alias for num_gpus: num_accelerators

* allow loading cpu_builder in build stage

* Assume cuda available if torch not installed

* add oneccl_binding_pt to requirements

* move oneccl-binding-pt to seperate requiremetns-cpu.txt

* add missing file

* use dependency_links in setuptools.setup() call for additional dependency links

* install oneccl_bind_pt in workflows

* change oneccl_bind_pt's version from 1.13 to 2.0

* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used

* Add indicator for Accelerator used

* change foo.c to foo.cpp

* exclude 'cpu' directory in CUDA op builder reflection

* add a cpu-inference workflow

* run cpu-inference workflow on self-hosted instance

* change cpu runs-on node to v100 node

* print out python version in workflow

* add verbose in pip command to understand oneccl_bind_pt install issue

* update cpu-inference workflow

* add a stage to detect instance instruction sets

* add back bf16 support for CPU inference

* enable autoTP for bloom

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update workflow to detect cpu instruction sets

* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection

* change cpu-inference workflow machine to ubuntu-20.04

* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable policy for llama

* use a special build ipex to test avx2 detection fix

* fix format

* fix test fail issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* return a not implemented build in get_op_builder in cpu_backend

* support cpu device in tests

* use cpuinfo to extract number of CPUs

* use ~/tmp as transfomer cache rather than /blob/

* Add support for mpich launcher with prefer_deepspeed_comm

* add missing modification in accelerator

* enable IMPI launcher

* remove unused file and fix formatting

* clean up ccl.cpp

* Less confusing error message when certin op builder are not implemented

* Fix license header

* Add license header

* add license headers

* add license header

* fix cuda specific code in test

* update CPU workflow

* use numactl to bind to core

* allow bind_cores_to_rank in multi-node impi runner

* fix format error

* Remove InferenceBuilder

* fix format error in numa.py

* check whether op is in installed ops in ds_report.py

* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'

* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator

* put short path in the beginning in real_accelerator.py

* device_count return number of NUMA nodes

* fix typo

* install numactl in cpu workflow

* Follow comments

* Better implementation of device_count() and current_device()

* remove dependency_link for Intel Extension for DeepSpeed

* use check is_synchronized_device in timer only once

* remove env mapping WA in cpu_accelerator

* fix duplicate definition

* fix format error

* refine ccl backend selection

* move comments to the right place

* remove prefer_deepspeed_comm, use CCLBackend by default

* refractor fallback path

* Fix execution failure in kernel injection path

* do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect

* guard residual_add fallback path with environ DS_KI_FALLBACK=True

* fix format error

* add test for allreduce on CPU workflow

* fix format error

* Fallback to TorchBackend if CCLBackend kernel are not implemented

* Update Intel Extension for Pytorch installation link

* Don't specify version number of Intel Extension for PyTorch

* install oneCCL for CCLBackend

* fix link path for CPU comm kernels

* fix source oneCCL environment

* source oneCCL env before run UT

* Give more specific instruction when CCL_ROOT not defined

---------

Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: sdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: Cao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: Zhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: baodii <di.bao@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-05-16 11:59:22 -04:00
digger-yu b3956dc9e3
change actions/checkout@v2 to v3 (#3526) 2023-05-12 08:47:05 -07:00
Connor Holmes 0a61d5d664
Hybrid Engine Refactor and Llama Inference Support (#3425)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-05-03 17:20:07 -07:00
Michael Wyatt 3031eec44e
Update DS-Chat issue template (#3368)
* request log output

* add more details
2023-04-24 17:45:36 -04:00
Logan Adams 4de4d2acc6
Add pre-compiling ops test (#3277)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-04-19 07:23:20 -07:00
Logan Adams 61a8b3a0ee
Update MI200 workflow to install apex with changes from pip (#3294) 2023-04-19 05:41:21 +00:00
Michael Wyatt bcccee4d85
Fix cupy install version detection (#3276)
* updated cupy install

* do non-isolated pip install

* Update action.yml
2023-04-18 17:13:35 +00:00
Michael Wyatt 6fc8e33c12
Create deepspeed_chat_bug_report.md 2023-04-14 09:44:22 -07:00
Michael Wyatt a8f999e3c4
Update DeepSpeed-Chat docs with latest changes to scripts (#3219)
* update docs to reflect changes in deepspeed-chat training script

* add blogs to ignored changes in unit tests
2023-04-13 16:35:30 -07:00
Logan Adams 9408a8666c
Update AMD workflows (#3179)
* Update AMD workflows

* Update MI200 test flow to use torch latest

* Update tolerances to values that pass (will fix before completing PR)

* Revert chyanges to atol

* Rename workflows

* Fix CI badges
2023-04-12 14:20:59 -07:00
Stas Bekman 20ed15be04
[ci] `nv-transformers-v100` - use the same torch version as transformers CI (#3096)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-04-05 10:03:27 -07:00