**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606

Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>
This commit is contained in:
Yifan Xiong 2024-01-07 21:40:52 -08:00 коммит произвёл GitHub
Родитель 2c2096ed83
Коммит 2c88db907f
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
56 изменённых файлов: 920 добавлений и 241 удалений

26
.github/workflows/build-image.yml поставляемый
Просмотреть файл

@ -18,6 +18,7 @@ jobs:
docker: docker:
name: Docker build ${{ matrix.name }} name: Docker build ${{ matrix.name }}
runs-on: ${{ matrix.runner }} runs-on: ${{ matrix.runner }}
timeout-minutes: 600
permissions: permissions:
contents: read contents: read
packages: write packages: write
@ -27,15 +28,23 @@ jobs:
- name: cuda12.2 - name: cuda12.2
dockerfile: cuda12.2 dockerfile: cuda12.2
tags: superbench/main:cuda12.2 tags: superbench/main:cuda12.2
runner: ubuntu-latest runner: [self-hosted, rocm-build]
build_args: "NUM_MAKE_JOBS=64"
- name: cuda11.1.1 - name: cuda11.1.1
dockerfile: cuda11.1.1 dockerfile: cuda11.1.1
tags: superbench/main:cuda11.1.1,superbench/superbench:latest tags: superbench/main:cuda11.1.1,superbench/superbench:latest
runner: ubuntu-latest runner: ubuntu-latest
build_args: "NUM_MAKE_JOBS=8"
- name: rocm5.7 - name: rocm5.7
dockerfile: rocm5.7.x dockerfile: rocm5.7.x
tags: superbench/main:rocm5.7 tags: superbench/main:rocm5.7
runner: [self-hosted, rocm-build] runner: [self-hosted, rocm-build]
build_args: "NUM_MAKE_JOBS=64"
- name: rocm6.0
dockerfile: rocm6.0.x
tags: superbench/main:rocm6.0
runner: [self-hosted, rocm-build]
build_args: "NUM_MAKE_JOBS=64"
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v2 uses: actions/checkout@v2
@ -75,7 +84,7 @@ jobs:
fi fi
DOCKERFILE=dockerfile/${{ matrix.dockerfile }}.dockerfile DOCKERFILE=dockerfile/${{ matrix.dockerfile }}.dockerfile
BUILD_ARGS="NUM_MAKE_JOBS=8" BUILD_ARGS=${{ matrix.build_args }}
if [[ "${{ matrix.extra_args }}" ]]; then if [[ "${{ matrix.extra_args }}" ]]; then
BUILD_ARGS="${BUILD_ARGS} ${{ matrix.extra_args }}" BUILD_ARGS="${BUILD_ARGS} ${{ matrix.extra_args }}"
fi fi
@ -87,11 +96,11 @@ jobs:
CACHE_TO="type=inline,mode=max" CACHE_TO="type=inline,mode=max"
fi fi
echo ::set-output name=dockerfile::${DOCKERFILE} echo "dockerfile=${DOCKERFILE}" >> "$GITHUB_OUTPUT"
echo ::set-output name=build_args::${BUILD_ARGS} echo "build_args=${BUILD_ARGS}" >> "$GITHUB_OUTPUT"
echo ::set-output name=tags::${TAGS} echo "tags=${TAGS}" >> "$GITHUB_OUTPUT"
echo ::set-output name=cache_from::${CACHE_FROM} echo "cache_from=${CACHE_FROM}" >> "$GITHUB_OUTPUT"
echo ::set-output name=cache_to::${CACHE_TO} echo "cache_to=${CACHE_TO}" >> "$GITHUB_OUTPUT"
- name: Echo build args - name: Echo build args
run: echo ${{ steps.metadata.outputs.build_args }} run: echo ${{ steps.metadata.outputs.build_args }}
- name: Echo image tag - name: Echo image tag
@ -106,6 +115,9 @@ jobs:
with: with:
username: ${{ secrets.DOCKERHUB_USERNAME }} username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }} password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Pull cache image
run: sudo docker pull ${{ steps.metadata.outputs.tags }}
continue-on-error: true
- name: Login to the GitHub Container Registry - name: Login to the GitHub Container Registry
uses: docker/login-action@v1 uses: docker/login-action@v1
if: ${{ github.event_name == 'release' }} if: ${{ github.event_name == 'release' }}

6
.gitmodules поставляемый
Просмотреть файл

@ -24,3 +24,9 @@
[submodule "third_party/msccl"] [submodule "third_party/msccl"]
path = third_party/msccl path = third_party/msccl
url = https://github.com/Azure/msccl url = https://github.com/Azure/msccl
[submodule "third_party/Megatron/Megatron-LM"]
path = third_party/Megatron/Megatron-LM
url = https://github.com/NVIDIA/Megatron-LM.git
[submodule "third_party/Megatron/Megatron-DeepSpeed"]
path = third_party/Megatron/Megatron-DeepSpeed
url = https://github.com/microsoft/Megatron-DeepSpeed.git

Просмотреть файл

@ -15,7 +15,7 @@
__SuperBench__ is a validation and profiling tool for AI infrastructure. __SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.9.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.9.0) has been released! 📢 [v0.10.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.10.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._ ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._

Просмотреть файл

@ -7,7 +7,7 @@ FROM nvcr.io/nvidia/pytorch:23.10-py3
# NVIDIA: # NVIDIA:
# - CUDA: 12.2.2 # - CUDA: 12.2.2
# - cuDNN: 8.9.5 # - cuDNN: 8.9.5
# - NCCL: v2.19.3-1 # - NCCL: v2.18.3-1
# Mellanox: # Mellanox:
# - OFED: 23.07-0.5.1.2 # - OFED: 23.07-0.5.1.2
# - HPC-X: v2.16 # - HPC-X: v2.16
@ -113,6 +113,13 @@ RUN cd /tmp && \
mv amd-blis /opt/AMD && \ mv amd-blis /opt/AMD && \
rm -rf aocl-blis-linux-aocc-4.0.tar.gz rm -rf aocl-blis-linux-aocc-4.0.tar.gz
# Install NCCL 2.18.3
RUN cd /tmp && \
git clone -b v2.18.3-1 https://github.com/NVIDIA/nccl.git && \
cd nccl && \
make -j src.build && \
make install && \
rm -rf /tmp/nccl
ENV PATH="${PATH}" \ ENV PATH="${PATH}" \
LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}" \ LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}" \

Просмотреть файл

@ -54,6 +54,8 @@ RUN curl -s -L https://dist.nuget.org/win-x86-commandline/latest/nuget.exe -o "%
# Run the setup script to install the visual studio components # Run the setup script to install the visual studio components
RUN "%SB_HOME%\\dockerfile\\directx\\install-components.bat" RUN "%SB_HOME%\\dockerfile\\directx\\install-components.bat"
RUN powershell -Command "Set-ItemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem -Name LongPathsEnabled -Value 1;"
RUN git config --system core.longpaths true
# Install Superbench # Install Superbench
RUN python -m pip install setuptools==65.0.0 && \ RUN python -m pip install setuptools==65.0.0 && \
python -m pip install --no-cache-dir .[amdworker] && \ python -m pip install --no-cache-dir .[amdworker] && \

Просмотреть файл

@ -1,34 +1,34 @@
<system version="1"> <system version="1">
<cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
<pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16">
<pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
<pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
</pci>
</cpu>
<cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
<pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16">
<pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
<pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
</pci> </pci>
</cpu> </cpu>
<cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
<pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16">
<pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
<pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
</pci> </pci>
</cpu> </cpu>
<cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49"> <cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
<pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16"> <pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16">
<pci busid="000d:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="000d:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0107:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0107:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
<pci busid="000e:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/> <pci busid="000e:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0108:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/> <pci busid="0108:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
</pci> </pci>
</cpu> </cpu>
<cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
<pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16">
<pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
<pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
<pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
</pci>
</cpu>
</system> </system>

Просмотреть файл

@ -0,0 +1,38 @@
<system version="1">
<cpu numaid="0" affinity="ffffffff,ffff0000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
<pci busid="ffff:ff:01.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="0001:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0101:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
<pci busid="ffff:ff:02.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="0002:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0102:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
<pci busid="ffff:ff:03.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="0003:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0103:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
<pci busid="ffff:ff:04.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="0008:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0104:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
</cpu>
<cpu numaid="1" affinity="00000000,0000ffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
<pci busid="ffff:ff:05.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="0009:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0105:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
<pci busid="ffff:ff:06.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="000a:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0106:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
<pci busid="ffff:ff:07.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="000b:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0107:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
<pci busid="ffff:ff:08.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
<pci busid="000c:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
<pci busid="0108:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
</pci>
</cpu>
</system>

Просмотреть файл

@ -17,6 +17,7 @@ RUN apt-get update && \
apt-get -q install -y --no-install-recommends \ apt-get -q install -y --no-install-recommends \
autoconf \ autoconf \
automake \ automake \
bc \
build-essential \ build-essential \
curl \ curl \
dmidecode \ dmidecode \
@ -27,6 +28,7 @@ RUN apt-get update && \
libaio-dev \ libaio-dev \
libboost-program-options-dev \ libboost-program-options-dev \
libcap2 \ libcap2 \
libcurl4-openssl-dev \
libnuma-dev \ libnuma-dev \
libpci-dev \ libpci-dev \
libssl-dev \ libssl-dev \
@ -38,6 +40,7 @@ RUN apt-get update && \
openssh-client \ openssh-client \
openssh-server \ openssh-server \
pciutils \ pciutils \
python3-mpi4py \
rsync \ rsync \
sudo \ sudo \
util-linux \ util-linux \
@ -46,11 +49,11 @@ RUN apt-get update && \
&& \ && \
rm -rf /tmp/* rm -rf /tmp/*
ARG NUM_MAKE_JOBS=16 ARG NUM_MAKE_JOBS=
# Check if CMake is installed and its version # Check if CMake is installed and its version
RUN cmake_version=$(cmake --version 2>/dev/null | grep -oP "(?<=cmake version )(\d+\.\d+)" || echo "0.0") && \ RUN cmake_version=$(cmake --version 2>/dev/null | grep -oP "(?<=cmake version )(\d+\.\d+)" || echo "0.0") && \
required_version="3.26.4" && \ required_version="3.24.1" && \
if [ "$(printf "%s\n" "$required_version" "$cmake_version" | sort -V | head -n 1)" != "$required_version" ]; then \ if [ "$(printf "%s\n" "$required_version" "$cmake_version" | sort -V | head -n 1)" != "$required_version" ]; then \
echo "existing cmake version is ${cmake_version}" && \ echo "existing cmake version is ${cmake_version}" && \
cd /tmp && \ cd /tmp && \
@ -100,40 +103,26 @@ RUN if ! command -v ofed_info >/dev/null 2>&1; then \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \ rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \
fi fi
# Install UCX # Add target file to help determine which device(s) to build for
ENV UCX_VERSION=1.14.1 ENV ROCM_PATH=/opt/rocm
RUN if [ -z "$(ls -A /opt/ucx)" ]; then \ RUN bash -c 'echo -e "gfx90a:xnack-\ngfx90a:xnac+\ngfx940\ngfx941\ngfx942\ngfx1030\ngfx1100\ngfx1101\ngfx1102\n" >> ${ROCM_PATH}/bin/target.lst'
echo "/opt/ucx is empty. Installing UCX..."; \
cd /tmp && \
git clone https://github.com/openucx/ucx.git -b v${UCX_VERSION} && \
cd ucx && \
./autogen.sh && \
mkdir build && \
cd build && \
../configure -prefix=$UCX_DIR --with-rocm=/opt/rocm --without-knem && \
make -j $(nproc) && make -j $(nproc) install && rm -rf /tmp/ucx-${UCX_VERSION} ; \
else \
echo "/opt/ucx is not empty. Skipping UCX installation."; \
fi
# Install OpenMPI # Install OpenMPI
ENV OPENMPI_VERSION=4.1.x ENV OPENMPI_VERSION=4.1.x
ENV MPI_HOME=/usr/local/mpi
# Check if Open MPI is installed # Check if Open MPI is installed
RUN [ -d /usr/local/bin/mpirun ] || { \ RUN cd /tmp && \
echo "Open MPI not found. Installing Open MPI..." && \
cd /tmp && \
git clone --recursive https://github.com/open-mpi/ompi.git -b v${OPENMPI_VERSION} && \ git clone --recursive https://github.com/open-mpi/ompi.git -b v${OPENMPI_VERSION} && \
cd ompi && \ cd ompi && \
./autogen.pl && \ ./autogen.pl && \
mkdir build && \ mkdir build && \
cd build && \ cd build && \
../configure --prefix=/usr/local --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --enable-prte-prefix-by-default --enable-mca-no-build=btl-uct --with-ucx=/opt/ucx --with-rocm=/opt/rocm && \ ../configure --prefix=/usr/local/mpi --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --enable-prte-prefix-by-default --with-rocm=/opt/rocm && \
make -j $(nproc) && \ make -j $(nproc) && \
make -j $(nproc) install && \ make -j $(nproc) install && \
ldconfig && \ ldconfig && \
cd / && \ cd / && \
rm -rf /tmp/openmpi-${OPENMPI_VERSION}* ;\ rm -rf /tmp/openmpi-${OPENMPI_VERSION}*
}
# Install Intel MLC # Install Intel MLC
RUN cd /tmp && \ RUN cd /tmp && \
@ -148,12 +137,18 @@ RUN cd /opt/ && \
cd rccl && \ cd rccl && \
mkdir build && \ mkdir build && \
cd build && \ cd build && \
CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_PREFIX_PATH=/opt/rocm/ .. && \ CXX=/opt/rocm/bin/hipcc cmake -DHIP_COMPILER=clang -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_PREFIX_PATH="${ROCM_PATH}/hsa;${ROCM_PATH}/hip;${ROCM_PATH}/share/rocm/cmake/;${ROCM_PATH}" \
.. && \
make -j${NUM_MAKE_JOBS} make -j${NUM_MAKE_JOBS}
ENV PATH="/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \ # Install AMD SMI Python Library
RUN cd /opt/rocm/share/amd_smi && \
python3 -m pip install --user .
ENV PATH="/usr/local/mpi/bin:/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \
LD_PRELOAD="/opt/rccl/build/librccl.so:$LD_PRELOAD" \ LD_PRELOAD="/opt/rccl/build/librccl.so:$LD_PRELOAD" \
LD_LIBRARY_PATH="/opt/ucx/lib:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \ LD_LIBRARY_PATH="/usr/local/mpi/lib:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \
SB_HOME=/opt/superbench \ SB_HOME=/opt/superbench \
SB_MICRO_PATH=/opt/superbench \ SB_MICRO_PATH=/opt/superbench \
ANSIBLE_DEPRECATION_WARNINGS=FALSE \ ANSIBLE_DEPRECATION_WARNINGS=FALSE \
@ -163,13 +158,19 @@ RUN echo PATH="$PATH" > /etc/environment && \
echo LD_LIBRARY_PATH="$LD_LIBRARY_PATH" >> /etc/environment && \ echo LD_LIBRARY_PATH="$LD_LIBRARY_PATH" >> /etc/environment && \
echo SB_MICRO_PATH="$SB_MICRO_PATH" >> /etc/environment echo SB_MICRO_PATH="$SB_MICRO_PATH" >> /etc/environment
RUN apt install rocm-cmake -y && \
python3 -m pip install --upgrade pip wheel setuptools==65.7
WORKDIR ${SB_HOME} WORKDIR ${SB_HOME}
ADD . .
RUN apt install rocm-cmake -y && \
python3 -m pip install --upgrade pip wheel setuptools==65.7 && \
python3 -m pip install .[amdworker] && \
make postinstall
RUN make cppbuild
ADD third_party third_party ADD third_party third_party
RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-5.7.1.1 HIPBLASLT_BRANCH=release-staging/rocm-rel-5.7 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm # Apply patch
RUN cd third_party/perftest && \
git apply ../perftest_rocm6.patch
RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-5.7.1.1 HIPBLASLT_BRANCH=release/rocm-rel-5.7 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm
ADD . .
#ENV USE_HIPBLASLT_DATATYPE=1
RUN python3 -m pip install .[amdworker] && \
CXX=/opt/rocm/bin/hipcc make cppbuild && \
make postinstall

Просмотреть файл

@ -0,0 +1,181 @@
ARG BASE_IMAGE=rocm/pytorch:rocm6.0_ubuntu22.04_py3.9_pytorch_2.0.1
FROM ${BASE_IMAGE}
# OS:
# - Ubuntu: 22.04
# - Docker Client: 20.10.8
# ROCm:
# - ROCm: 6.0
# Lib:
# - torch: 2.0.1
# - rccl: 2.18.3+hip6.0 develop:7e1cbb4
# - hipblaslt: release/rocm-rel-6.0
# - openmpi: 4.1.x
# - apex: 1.0.0
# Intel:
# - mlc: v3.10
LABEL maintainer="SuperBench"
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get -q install -y --no-install-recommends \
autoconf \
automake \
bc \
build-essential \
curl \
dmidecode \
git \
hipify-clang \
iproute2 \
jq \
libaio-dev \
libboost-program-options-dev \
libcap2 \
libcurl4-openssl-dev \
libnuma-dev \
libpci-dev \
libssl-dev \
libtinfo5 \
libtool \
lshw \
net-tools \
numactl \
openssh-client \
openssh-server \
pciutils \
python3-mpi4py \
rsync \
sudo \
util-linux \
vim \
wget \
&& \
rm -rf /tmp/*
ARG NUM_MAKE_JOBS=64
# Check if CMake is installed and its version
RUN cmake_version=$(cmake --version 2>/dev/null | grep -oP "(?<=cmake version )(\d+\.\d+)" || echo "0.0") && \
required_version="3.24.1" && \
if [ "$(printf "%s\n" "$required_version" "$cmake_version" | sort -V | head -n 1)" != "$required_version" ]; then \
echo "existing cmake version is ${cmake_version}" && \
cd /tmp && \
wget -q https://github.com/Kitware/CMake/releases/download/v${required_version}/cmake-${required_version}.tar.gz && \
tar xzf cmake-${required_version}.tar.gz && \
cd cmake-${required_version} && \
./bootstrap --prefix=/usr --no-system-curl --parallel=16 && \
make -j ${NUM_MAKE_JOBS} && \
make install && \
rm -rf /tmp/cmake-${required_version}* \
else \
echo "CMake version is greater than or equal to 3.23"; \
fi
# Install Docker
ENV DOCKER_VERSION=20.10.8
RUN cd /tmp && \
wget -q https://download.docker.com/linux/static/stable/x86_64/docker-${DOCKER_VERSION}.tgz -O docker.tgz && \
tar --extract --file docker.tgz --strip-components 1 --directory /usr/local/bin/ && \
rm docker.tgz
# Update system config
RUN mkdir -p /root/.ssh && \
touch /root/.ssh/authorized_keys && \
mkdir -p /var/run/sshd && \
sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config && \
sed -i "s/[# ]*PermitUserEnvironment no/PermitUserEnvironment yes/" /etc/ssh/sshd_config && \
sed -i "s/[# ]*Port.*/Port 22/" /etc/ssh/sshd_config && \
echo "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \
echo "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf
# Get Ubuntu version and set as an environment variable
RUN export UBUNTU_VERSION=$(lsb_release -r -s)
RUN echo "Ubuntu version: $UBUNTU_VERSION"
ENV UBUNTU_VERSION=${UBUNTU_VERSION}
# Install OFED
ENV OFED_VERSION=5.9-0.5.6.0
# Check if ofed_info is present and has a version
RUN if ! command -v ofed_info >/dev/null 2>&1; then \
echo "OFED not found. Installing OFED..."; \
cd /tmp && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \
tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \
PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \
fi
# Add target file to help determine which device(s) to build for
ENV ROCM_PATH=/opt/rocm
RUN bash -c 'echo -e "gfx90a:xnack-\ngfx90a:xnac+\ngfx940\ngfx941\ngfx942:sramecc+:xnack-\n" >> ${ROCM_PATH}/bin/target.lst'
# Install OpenMPI
ENV OPENMPI_VERSION=4.1.x
ENV MPI_HOME=/usr/local/mpi
# Check if Open MPI is installed
RUN cd /tmp && \
git clone --recursive https://github.com/open-mpi/ompi.git -b v${OPENMPI_VERSION} && \
cd ompi && \
./autogen.pl && \
mkdir build && \
cd build && \
../configure --prefix=/usr/local/mpi --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default --enable-prte-prefix-by-default --with-rocm=/opt/rocm && \
make -j $(nproc) && \
make -j $(nproc) install && \
ldconfig && \
cd / && \
rm -rf /tmp/openmpi-${OPENMPI_VERSION}*
# Install Intel MLC
RUN cd /tmp && \
wget -q https://downloadmirror.intel.com/763324/mlc_v3.10.tgz -O mlc.tgz && \
tar xzf mlc.tgz Linux/mlc && \
cp ./Linux/mlc /usr/local/bin/ && \
rm -rf ./Linux mlc.tgz
# Install RCCL
RUN cd /opt/ && \
git clone https://github.com/ROCmSoftwarePlatform/rccl.git && \
cd rccl && \
mkdir build && \
cd build && \
CXX=/opt/rocm/bin/hipcc cmake -DHIP_COMPILER=clang -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_PREFIX_PATH="${ROCM_PATH}/hsa;${ROCM_PATH}/hip;${ROCM_PATH}/share/rocm/cmake/;${ROCM_PATH}" \
.. && \
make -j${NUM_MAKE_JOBS}
ENV PATH="/usr/local/mpi/bin:/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \
LD_PRELOAD="/opt/rccl/build/librccl.so:$LD_PRELOAD" \
LD_LIBRARY_PATH="/usr/local/mpi/lib:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \
SB_HOME=/opt/superbench \
SB_MICRO_PATH=/opt/superbench \
ANSIBLE_DEPRECATION_WARNINGS=FALSE \
ANSIBLE_COLLECTIONS_PATH=/usr/share/ansible/collections
RUN echo PATH="$PATH" > /etc/environment && \
echo LD_LIBRARY_PATH="$LD_LIBRARY_PATH" >> /etc/environment && \
echo SB_MICRO_PATH="$SB_MICRO_PATH" >> /etc/environment
RUN apt install rocm-cmake -y && \
python3 -m pip install --upgrade pip wheel setuptools==65.7
WORKDIR ${SB_HOME}
ADD third_party third_party
# Apply patch
RUN cd third_party/perftest && \
git apply ../perftest_rocm6.patch
RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-6.0 HIPBLASLT_BRANCH=release/rocm-rel-6.0 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm
RUN cd third_party/Megatron/Megatron-DeepSpeed && \
git apply ../megatron_deepspeed_rocm6.patch
ADD . .
ENV USE_HIP_DATATYPE=1
ENV USE_HIPBLAS_COMPUTETYPE=1
RUN python3 -m pip install .[amdworker] && \
CXX=/opt/rocm/bin/hipcc make cppbuild && \
make postinstall

Просмотреть файл

@ -29,7 +29,7 @@ You need to [clone the code](./development.md#set-up) first before building the
export DOCKER_BUILDKIT=1 export DOCKER_BUILDKIT=1
docker buildx build \ docker buildx build \
--platform linux/amd64 --cache-to type=inline,mode=max \ --platform linux/amd64 --cache-to type=inline,mode=max \
--tag superbench-dev --file dockerfile/cuda12.1.dockerfile . --tag superbench-dev --file dockerfile/cuda12.2.dockerfile .
``` ```
</TabItem> </TabItem>

Просмотреть файл

@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note :::note Note
You should checkout corresponding tag to use release version, for example, You should checkout corresponding tag to use release version, for example,
`git clone -b v0.9.0 https://github.com/microsoft/superbenchmark` `git clone -b v0.10.0 https://github.com/microsoft/superbenchmark`
::: :::
```bash ```bash

Просмотреть файл

@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note :::note Note
You should deploy corresponding Docker image to use release version, for example, You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.9.0-cuda12.1` `sb deploy -f local.ini -i superbench/superbench:v0.10.0-cuda12.2`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone. You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.

Просмотреть файл

@ -70,7 +70,7 @@ superbench:
<TabItem value='example'> <TabItem value='example'>
```yaml ```yaml
version: v0.9 version: v0.10
superbench: superbench:
enable: benchmark_1 enable: benchmark_1
monitor: monitor:

Просмотреть файл

@ -58,17 +58,18 @@ Large scale matmul operation using `torch.matmul` with one GPU.
|--------------------------------|-----------|--------------------------------| |--------------------------------|-----------|--------------------------------|
| pytorch-matmul/nosharding_time | time (ms) | Time of pure matmul operation. | | pytorch-matmul/nosharding_time | time (ms) | Time of pure matmul operation. |
### `cublaslt-gemm` ### `cublaslt-gemm` / `hipblaslt-gemm`
#### Introduction #### Introduction
Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/cublas/#cublasltmatmul). Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/cublas/#cublasltmatmul) or [`hipblasLt-bench`](https://github.com/ROCm/hipBLASLt/blob/develop/clients/benchmarks/README.md).
#### Metrics #### Metrics
| Name | Unit | Description | | Name | Unit | Description |
|----------------------------------------------------------|----------------|---------------------------------| |-----------------------------------------------------------|----------------|---------------------------------|
| cublaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. | | cublaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
| hipblaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
### `cublas-function` ### `cublas-function`
@ -243,6 +244,7 @@ or [AMD](https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/1_Utils
### `gpu-copy-bw` ### `gpu-copy-bw`
Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device. Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device.
For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in `hipDeviceMallocUncached` (previous `hipDeviceMallocFinegrained`) mode to maximize performance.
#### Metrics #### Metrics
@ -283,6 +285,7 @@ Measure the performance of NCCL/RCCL operations under multi nodes' traffic patte
performed by [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/44df0bf010dcc95e840ca0fb7466c67cff3f1f0f) performed by [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/44df0bf010dcc95e840ca0fb7466c67cff3f1f0f)
or [rccl-tests](https://github.com/ROCmSoftwarePlatform/rccl-tests/tree/dc1ad4853d7ec738387d42a75a58a98d7af00c7b). or [rccl-tests](https://github.com/ROCmSoftwarePlatform/rccl-tests/tree/dc1ad4853d7ec738387d42a75a58a98d7af00c7b).
Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall. Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall.
Support both in-place and out-of-place measurements.
Support the following traffic patterns: Support the following traffic patterns:
* `all-nodes`, validate the NCCL/RCCL performance across all VM nodes simultaneously. * `all-nodes`, validate the NCCL/RCCL performance across all VM nodes simultaneously.

Просмотреть файл

@ -28,26 +28,29 @@ available tags are listed below for all stable versions.
}> }>
<TabItem value='cuda'> <TabItem value='cuda'>
| Tag | Description | | Tag | Description |
|-------------------|------------------------------------| |--------------------|-------------------------------------|
| v0.9.0-cuda12.1 | SuperBench v0.9.0 with CUDA 12.1 | | v0.10.0-cuda12.2 | SuperBench v0.10.0 with CUDA 12.2 |
| v0.9.0-cuda11.1.1 | SuperBench v0.9.0 with CUDA 11.1.1 | | v0.10.0-cuda11.1.1 | SuperBench v0.10.0 with CUDA 11.1.1 |
| v0.8.0-cuda12.1 | SuperBench v0.8.0 with CUDA 12.1 | | v0.9.0-cuda12.1 | SuperBench v0.9.0 with CUDA 12.1 |
| v0.8.0-cuda11.1.1 | SuperBench v0.8.0 with CUDA 11.1.1 | | v0.9.0-cuda11.1.1 | SuperBench v0.9.0 with CUDA 11.1.1 |
| v0.7.0-cuda11.8 | SuperBench v0.7.0 with CUDA 11.8 | | v0.8.0-cuda12.1 | SuperBench v0.8.0 with CUDA 12.1 |
| v0.7.0-cuda11.1.1 | SuperBench v0.7.0 with CUDA 11.1.1 | | v0.8.0-cuda11.1.1 | SuperBench v0.8.0 with CUDA 11.1.1 |
| v0.6.0-cuda11.1.1 | SuperBench v0.6.0 with CUDA 11.1.1 | | v0.7.0-cuda11.8 | SuperBench v0.7.0 with CUDA 11.8 |
| v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 | | v0.7.0-cuda11.1.1 | SuperBench v0.7.0 with CUDA 11.1.1 |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 | | v0.6.0-cuda11.1.1 | SuperBench v0.6.0 with CUDA 11.1.1 |
| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 | | v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 |
| v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 | | v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
| v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 | | v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
| v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
| v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 |
</TabItem> </TabItem>
<TabItem value='rocm'> <TabItem value='rocm'>
| Tag | Description | | Tag | Description |
|-------------------------------|--------------------------------------------------| |-------------------------------|--------------------------------------------------|
| v0.10.0-rocm5.7 | SuperBench v0.10.0 with ROCm 5.7 |
| v0.9.0-rocm5.1.3 | SuperBench v0.9.0 with ROCm 5.1.3 | | v0.9.0-rocm5.1.3 | SuperBench v0.9.0 with ROCm 5.1.3 |
| v0.9.0-rocm5.1.1 | SuperBench v0.9.0 with ROCm 5.1.1 | | v0.9.0-rocm5.1.1 | SuperBench v0.9.0 with ROCm 5.1.1 |
| v0.9.0-rocm5.0.1 | SuperBench v0.9.0 with ROCm 5.0.1 | | v0.9.0-rocm5.0.1 | SuperBench v0.9.0 with ROCm 5.0.1 |

Просмотреть файл

@ -65,7 +65,7 @@ superbench:
example: example:
```yaml ```yaml
# SuperBench rules # SuperBench rules
version: v0.9 version: v0.10
superbench: superbench:
rules: rules:
failure-rule: failure-rule:

Просмотреть файл

@ -58,7 +58,7 @@ superbench:
```yaml title="Example" ```yaml title="Example"
# SuperBench rules # SuperBench rules
version: v0.9 version: v0.10
superbench: superbench:
rules: rules:
kernel_launch: kernel_launch:

Просмотреть файл

@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems. Provide hardware and software benchmarks for AI systems.
""" """
__version__ = '0.9.0' __version__ = '0.10.0'
__author__ = 'Microsoft' __author__ = 'Microsoft'

Просмотреть файл

@ -94,6 +94,17 @@ class CudaNcclBwBenchmark(MicroBenchmarkWithInvoke):
default=0, default=0,
help='Number of graph launch iterations. Set to 0 to disable graph mode. Default: 0.', help='Number of graph launch iterations. Set to 0 to disable graph mode. Default: 0.',
) )
self._parser.add_argument(
'--in_place',
action='store_true',
help='If specified, collect in-place numbers, else collect out-of-place numbers.',
)
self._parser.add_argument(
'--data_type',
type=str,
default='float',
help='Data type used in NCCL operations. Default: float.',
)
def _preprocess(self): def _preprocess(self):
"""Preprocess/preparation operations before the benchmarking. """Preprocess/preparation operations before the benchmarking.
@ -123,9 +134,10 @@ class CudaNcclBwBenchmark(MicroBenchmarkWithInvoke):
return False return False
command = os.path.join(self._args.bin_dir, self._bin_name) command = os.path.join(self._args.bin_dir, self._bin_name)
command += ' -b {} -e {} -f {} -g {} -c {} -n {} -w {} -G {}'.format( command += ' -b {} -e {} -f {} -g {} -c {} -n {} -w {} -G {} -d {}'.format(
self._args.minbytes, self._args.maxbytes, str(self._args.stepfactor), str(self._args.ngpus), self._args.minbytes, self._args.maxbytes, str(self._args.stepfactor), str(self._args.ngpus),
str(self._args.check), str(self._args.iters), str(self._args.warmup_iters), str(self._args.graph_iters) str(self._args.check), str(self._args.iters), str(self._args.warmup_iters), str(self._args.graph_iters),
self._args.data_type
) )
self._commands.append(command) self._commands.append(command)
@ -171,9 +183,9 @@ class CudaNcclBwBenchmark(MicroBenchmarkWithInvoke):
content = content[out_of_place_index + 1:out_of_bound_index] content = content[out_of_place_index + 1:out_of_bound_index]
# Parse max out of bound bus bw as the result # Parse max out of bound bus bw as the result
size_index = -1 size_index = -1
time_index = -1 time_index = None
busbw_index = -1 busbw_index = None
algbw_index = -1 algbw_index = None
for line in content: for line in content:
if 'time' in line and 'busbw' in line: if 'time' in line and 'busbw' in line:
# Get index of selected column # Get index of selected column
@ -181,11 +193,17 @@ class CudaNcclBwBenchmark(MicroBenchmarkWithInvoke):
line = re.sub(r' +', ' ', line).split(' ') line = re.sub(r' +', ' ', line).split(' ')
# Get first index of condition in list, if it not existing, raise exception # Get first index of condition in list, if it not existing, raise exception
size_index = line.index('size') size_index = line.index('size')
time_index = line.index('time') - len(line) # Need index from the end because sometimes previous fields (like redop) can be empty
busbw_index = line.index('busbw') - len(line) if self._args.in_place:
algbw_index = line.index('algbw') - len(line) time_index = -1 - list(reversed(line)).index('time')
busbw_index = -1 - list(reversed(line)).index('busbw')
algbw_index = -1 - list(reversed(line)).index('algbw')
else:
time_index = line.index('time') - len(line)
busbw_index = line.index('busbw') - len(line)
algbw_index = line.index('algbw') - len(line)
break break
if size_index != -1 and busbw_index != -1 and time_index != -1 and algbw_index != -1: if size_index != -1 and busbw_index is not None and time_index is not None and algbw_index is not None:
for line in content: for line in content:
line = line.strip(' ') line = line.strip(' ')
line = re.sub(r' +', ' ', line).split(' ') line = re.sub(r' +', ' ', line).split(' ')

Просмотреть файл

@ -493,13 +493,12 @@ class DistInference(MicroBenchmarkWithInvoke):
try: try:
output_lines = [x.strip() for x in raw_output.strip().splitlines()] output_lines = [x.strip() for x in raw_output.strip().splitlines()]
step_time = None step_times = []
for output_line in output_lines: for output_line in output_lines:
if ' ms per iteration' in output_line: if output_line.startswith('Latency of step'):
step_time = float(output_line.split(' ms per iteration')[0].split()[-1]) step_times.append(float(output_line.split(' ms')[0].split()[-1]))
break
return self._process_numeric_result( return self._process_numeric_result(
'step_times', [step_time], reduce_type=ReduceType.MAX, cal_percentile=True 'step_times', step_times, reduce_type=ReduceType.MAX, cal_percentile=True
) )
except BaseException as e: except BaseException as e:
return self._set_error_code_and_print_error_msg( return self._set_error_code_and_print_error_msg(

Просмотреть файл

@ -31,6 +31,14 @@ else()
# link hip device lib # link hip device lib
add_executable(dist_inference dist_inference.cpp) add_executable(dist_inference dist_inference.cpp)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -DROCM_USE_FLOAT16=1") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -DROCM_USE_FLOAT16=1")
if(DEFINED ENV{USE_HIPBLASLT_DATATYPE})
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_HIPBLASLT_DATATYPE=1")
elseif(DEFINED ENV{USE_HIP_DATATYPE})
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_HIP_DATATYPE=1")
endif()
if(DEFINED ENV{USE_HIPBLAS_COMPUTETYPE})
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_HIPBLAS_COMPUTETYPE=1")
endif()
target_link_libraries(dist_inference MPI::MPI_CXX rccl hipblaslt hip::device) target_link_libraries(dist_inference MPI::MPI_CXX rccl hipblaslt hip::device)
else() else()
message(FATAL_ERROR "No CUDA or ROCm environment found.") message(FATAL_ERROR "No CUDA or ROCm environment found.")

Просмотреть файл

@ -45,6 +45,21 @@
#include <hipblaslt/hipblaslt.h> #include <hipblaslt/hipblaslt.h>
#include <rccl/rccl.h> #include <rccl/rccl.h>
using cublasLtHalf = hipblasLtHalf; using cublasLtHalf = hipblasLtHalf;
#if defined(USE_HIPBLASLT_DATATYPE)
#define DIST_INF_HIP_DATATYPE_R_16F HIPBLASLT_R_16F
#define DIST_INF_HIP_DATATYPE_R_32F HIPBLASLT_R_32F
#elif defined(USE_HIP_DATATYPE)
#define DIST_INF_HIP_DATATYPE_R_16F HIP_R_16F
#define DIST_INF_HIP_DATATYPE_R_32F HIP_R_32F
#else
#define DIST_INF_HIP_DATATYPE_R_16F HIPBLAS_R_16F
#define DIST_INF_HIP_DATATYPE_R_32F HIPBLAS_R_32F
#endif
#if defined(USE_HIPBLAS_COMPUTETYPE)
#define DIST_INF_HIP_COMPUTETYPE_F32 HIPBLAS_COMPUTE_32F
#else
#define DIST_INF_HIP_COMPUTETYPE_F32 HIPBLASLT_COMPUTE_F32
#endif
#else #else
#include <cublasLt.h> #include <cublasLt.h>
#include <cuda_fp16.h> #include <cuda_fp16.h>
@ -229,16 +244,18 @@ void TestModel(int64_t m, int64_t n, int64_t k, float alpha, float beta, int32_t
CHECK_CUBLASLT_ERROR(hipblasLtCreate(&handle)); CHECK_CUBLASLT_ERROR(hipblasLtCreate(&handle));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matA, HIPBLAS_R_16F, k, n, k)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matA, DIST_INF_HIP_DATATYPE_R_16F, k, n, k));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matB, HIPBLAS_R_16F, m, k, m)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matB, DIST_INF_HIP_DATATYPE_R_16F, m, k, m));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matC, HIPBLAS_R_16F, m, n, m)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matC, DIST_INF_HIP_DATATYPE_R_16F, m, n, m));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matD, HIPBLAS_R_16F, m, n, m)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matD, DIST_INF_HIP_DATATYPE_R_16F, m, n, m));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matE, HIPBLAS_R_16F, k, m, k)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matE, DIST_INF_HIP_DATATYPE_R_16F, k, m, k));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matF, HIPBLAS_R_16F, k, n, k)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matF, DIST_INF_HIP_DATATYPE_R_16F, k, n, k));
CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matG, HIPBLAS_R_16F, k, n, k)); CHECK_CUBLASLT_ERROR(hipblasLtMatrixLayoutCreate(&matG, DIST_INF_HIP_DATATYPE_R_16F, k, n, k));
CHECK_CUBLASLT_ERROR(hipblasLtMatmulDescCreate(&matmul1, HIPBLASLT_COMPUTE_F32, HIPBLAS_R_32F)); CHECK_CUBLASLT_ERROR(
CHECK_CUBLASLT_ERROR(hipblasLtMatmulDescCreate(&matmul2, HIPBLASLT_COMPUTE_F32, HIPBLAS_R_32F)); hipblasLtMatmulDescCreate(&matmul1, DIST_INF_HIP_COMPUTETYPE_F32, DIST_INF_HIP_DATATYPE_R_32F));
CHECK_CUBLASLT_ERROR(
hipblasLtMatmulDescCreate(&matmul2, DIST_INF_HIP_COMPUTETYPE_F32, DIST_INF_HIP_DATATYPE_R_32F));
hipblasOperation_t trans = HIPBLAS_OP_N; hipblasOperation_t trans = HIPBLAS_OP_N;
CHECK_CUBLASLT_ERROR( CHECK_CUBLASLT_ERROR(
@ -336,8 +353,9 @@ void TestModel(int64_t m, int64_t n, int64_t k, float alpha, float beta, int32_t
#endif #endif
std::chrono::steady_clock::time_point start_time, stop_time; std::chrono::steady_clock::time_point start_time, stop_time;
std::vector<double> step_times(num_iters, 0.);
for (int i = 0; i < num_warmups + num_iters; ++i) { for (int i = 0; i < num_warmups + num_iters; ++i) {
if (i == num_warmups) { if (i >= num_warmups) {
start_time = std::chrono::steady_clock::now(); start_time = std::chrono::steady_clock::now();
} }
#if (NCCL_MAJOR > 2 || (NCCL_MAJOR >= 2 && NCCL_MINOR >= 9)) && (CUDART_VERSION >= 11030 || HIP_VERSION >= 50221310) #if (NCCL_MAJOR > 2 || (NCCL_MAJOR >= 2 && NCCL_MINOR >= 9)) && (CUDART_VERSION >= 11030 || HIP_VERSION >= 50221310)
@ -350,11 +368,15 @@ void TestModel(int64_t m, int64_t n, int64_t k, float alpha, float beta, int32_t
model_forward(); model_forward();
#endif #endif
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream)); CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
if (i >= num_warmups) {
stop_time = std::chrono::steady_clock::now();
double step_time = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_time - start_time).count();
step_times[i - num_warmups] = step_time;
}
}
for (int i = 0; i < num_iters; i++) {
fprintf(stdout, "Latency of step %d: %g ms\n", i, step_times[i] / 1e6);
} }
stop_time = std::chrono::steady_clock::now();
double duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop_time - start_time).count();
fprintf(stdout, "Time: %g ms in total, %g ms per iteration, %g ms per layer\n", duration, duration / num_iters,
duration / num_iters / num_layers);
#if (NCCL_MAJOR > 2 || (NCCL_MAJOR >= 2 && NCCL_MINOR >= 9)) && (CUDART_VERSION >= 11030 || HIP_VERSION >= 50221310) #if (NCCL_MAJOR > 2 || (NCCL_MAJOR >= 2 && NCCL_MINOR >= 9)) && (CUDART_VERSION >= 11030 || HIP_VERSION >= 50221310)
// Destroy graph // Destroy graph

Просмотреть файл

@ -27,6 +27,13 @@ else()
# link hip device lib # link hip device lib
add_executable(gpu_copy gpu_copy.cpp) add_executable(gpu_copy gpu_copy.cpp)
include(CheckSymbolExists)
check_symbol_exists("hipDeviceMallocUncached" "hip/hip_runtime_api.h" HIP_UNCACHED_MEMORY)
if(${HIP_UNCACHED_MEMORY})
target_compile_definitions(gpu_copy PRIVATE HIP_UNCACHED_MEMORY)
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
target_link_libraries(gpu_copy numa hip::device) target_link_libraries(gpu_copy numa hip::device)
else() else()

Просмотреть файл

@ -313,6 +313,25 @@ int SetGpu(int gpu_id) {
return 0; return 0;
} }
#if defined(__HIP_PLATFORM_AMD__)
bool UseFineGrained(const SubBenchArgs &args) {
return args.is_src_dev_gpu && args.is_dst_dev_gpu && args.src_gpu_id != args.dst_gpu_id;
}
cudaError_t GpuMallocDataBuf(uint8_t **ptr, uint64_t size, bool use_fine_grained) {
if (use_fine_grained) {
#if defined(HIP_UNCACHED_MEMORY)
return hipExtMallocWithFlags((void **)ptr, size, hipDeviceMallocUncached);
#else
return hipExtMallocWithFlags((void **)ptr, size, hipDeviceMallocFinegrained);
#endif
} else {
return cudaMalloc(ptr, size);
}
}
#else
cudaError_t GpuMallocDataBuf(uint8_t **ptr, uint64_t size) { return cudaMalloc(ptr, size); }
#endif
// Prepare data buffers and streams to be used. // Prepare data buffers and streams to be used.
int PrepareBufAndStream(BenchArgs *args) { int PrepareBufAndStream(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess; cudaError_t cuda_err = cudaSuccess;
@ -346,7 +365,11 @@ int PrepareBufAndStream(BenchArgs *args) {
return -1; return -1;
} }
*(host_buf_ptrs[j]) = nullptr; *(host_buf_ptrs[j]) = nullptr;
cuda_err = cudaMalloc(gpu_buf_ptrs[j], args->size); #if defined(__HIP_PLATFORM_AMD__)
cuda_err = GpuMallocDataBuf(gpu_buf_ptrs[j], args->size, UseFineGrained(sub));
#else
cuda_err = GpuMallocDataBuf(gpu_buf_ptrs[j], args->size);
#endif
if (cuda_err != cudaSuccess) { if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaMalloc error: %d\n", cuda_err); fprintf(stderr, "PrepareBufAndStream::cudaMalloc error: %d\n", cuda_err);
return -1; return -1;
@ -876,7 +899,11 @@ int RunAllToAllBench(const Opts &opts, int gpu_count, int src_rank, int dst_rank
} }
// Prepare source buffers // Prepare source buffers
cuda_err = cudaMalloc(&(src_buffers_gpu[rank]), opts.size); #if defined(__HIP_PLATFORM_AMD__)
cuda_err = GpuMallocDataBuf(&(src_buffers_gpu[rank]), opts.size, true);
#else
cuda_err = GpuMallocDataBuf(&(src_buffers_gpu[rank]), opts.size);
#endif
if (cuda_err != cudaSuccess) { if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunAllToAllBench::cudaMalloc for src_buffers_gpu[%d] error: %d\n", cuda_err, rank); fprintf(stderr, "RunAllToAllBench::cudaMalloc for src_buffers_gpu[%d] error: %d\n", cuda_err, rank);
return -1; return -1;
@ -893,7 +920,11 @@ int RunAllToAllBench(const Opts &opts, int gpu_count, int src_rank, int dst_rank
} }
// Prepare destination buffers // Prepare destination buffers
cuda_err = cudaMalloc(&(dst_buffers_gpu[rank]), opts.size); #if defined(__HIP_PLATFORM_AMD__)
cuda_err = GpuMallocDataBuf(&(dst_buffers_gpu[rank]), opts.size, true);
#else
cuda_err = GpuMallocDataBuf(&(dst_buffers_gpu[rank]), opts.size);
#endif
if (cuda_err != cudaSuccess) { if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunAllToAllBench::cudaMalloc for dst_buffers_gpu[%d] error: %d\n", cuda_err, rank); fprintf(stderr, "RunAllToAllBench::cudaMalloc for dst_buffers_gpu[%d] error: %d\n", cuda_err, rank);
return -1; return -1;

Просмотреть файл

@ -4,7 +4,6 @@
"""Module of the hipBlasLt GEMM benchmark.""" """Module of the hipBlasLt GEMM benchmark."""
import os import os
import re
from superbench.common.utils import logger from superbench.common.utils import logger
from superbench.benchmarks import BenchmarkRegistry, Platform, ReturnCode from superbench.benchmarks import BenchmarkRegistry, Platform, ReturnCode
@ -23,11 +22,12 @@ class HipBlasLtBenchmark(BlasLtBaseBenchmark):
super().__init__(name, parameters) super().__init__(name, parameters)
self._bin_name = 'hipblaslt-bench' self._bin_name = 'hipblaslt-bench'
self._in_types = ['fp32', 'fp16', 'bf16'] self._in_types = ['fp32', 'fp16', 'bf16', 'fp8']
self._in_type_map = { self._in_type_map = {
'fp16': '--a_type f16_r --b_type f16_r --c_type f16_r --d_type f16_r --compute_type f32_r', 'fp16': '--a_type f16_r --b_type f16_r --c_type f16_r --d_type f16_r --compute_type f32_r',
'fp32': '--a_type f32_r --b_type f32_r --c_type f32_r --d_type f32_r --compute_type f32_r', 'fp32': '--a_type f32_r --b_type f32_r --c_type f32_r --d_type f32_r --compute_type f32_r',
'bf16': '--a_type bf16_r --b_type bf16_r --c_type bf16_r --d_type bf16_r --compute_type f32_r', 'bf16': '--a_type bf16_r --b_type bf16_r --c_type bf16_r --d_type bf16_r --compute_type f32_r',
'fp8': '--a_type f8_r --b_type f8_r --c_type f8_r --d_type f8_r --compute_type f32_r',
} }
def add_parser_arguments(self): def add_parser_arguments(self):
@ -42,6 +42,30 @@ class HipBlasLtBenchmark(BlasLtBaseBenchmark):
required=False, required=False,
help='List of input data types, support {}.'.format(' '.join(self._in_types)), help='List of input data types, support {}.'.format(' '.join(self._in_types)),
) )
self._parser.add_argument(
'--initialization',
type=str,
default='rand_int',
choices=['trig_float', 'rand_int', 'hpl'],
required=False,
help='Initialize matrix data.',
)
self._parser.add_argument(
'--transA',
type=str,
default='N',
choices=['N', 'T', 'C'],
required=False,
help='Transpose matrix A.',
)
self._parser.add_argument(
'--transB',
type=str,
default='N',
choices=['N', 'T', 'C'],
required=False,
help='Transpose matrix B.',
)
def _preprocess(self): def _preprocess(self):
"""Preprocess/preparation operations before the benchmarking. """Preprocess/preparation operations before the benchmarking.
@ -58,7 +82,9 @@ class HipBlasLtBenchmark(BlasLtBaseBenchmark):
self._precision_in_commands = [] self._precision_in_commands = []
for (_m, _n, _k, _b, _in_type) in self._shapes_to_run: for (_m, _n, _k, _b, _in_type) in self._shapes_to_run:
command = f'{self.__bin_path} -m {_m} -n {_n} -k {_k} -j {self._args.num_warmup}' + \ command = f'{self.__bin_path} -m {_m} -n {_n} -k {_k} -j {self._args.num_warmup}' + \
f' -i {self._args.num_steps} {self._in_type_map[_in_type]}' f' -i {self._args.num_steps} {self._in_type_map[_in_type]}' + \
f' --transA {self._args.transA} --transB {self._args.transB}' + \
f' --initialization {self._args.initialization}'
command = command + f' -b {str(_b)}' if _b > 0 else command command = command + f' -b {str(_b)}' if _b > 0 else command
logger.info(command) logger.info(command)
self._commands.append(command) self._commands.append(command)
@ -97,13 +123,12 @@ class HipBlasLtBenchmark(BlasLtBaseBenchmark):
fields = lines[index + 1].strip().split(',') fields = lines[index + 1].strip().split(',')
# Check the number of fields and the format of the first two fields # Check the number of fields and the format of the first two fields
if len(fields) != 23 or not all( if len(fields) != 23:
re.match(r'\d*\.\d*$', item.strip()) or item.strip().isdigit() for item in fields[-2:]
):
raise ValueError('Invalid result') raise ValueError('Invalid result')
self._result.add_result( self._result.add_result(
f'{self._precision_in_commands[cmd_idx]}_{fields[3]}_{"_".join(fields[4:7])}_flops', float(fields[-2]) f'{self._precision_in_commands[cmd_idx]}_{fields[3]}_{"_".join(fields[4:7])}_flops',
float(fields[-2]) / 1000
) )
except BaseException as e: except BaseException as e:
self._result.set_return_code(ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE) self._result.set_return_code(ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)

Просмотреть файл

@ -45,8 +45,7 @@ message(STATUS "CMAKE HIP ARCHITECTURES: ${CMAKE_HIP_ARCHITECTURES}")
if(EXISTS ${HIP_PATH}) if(EXISTS ${HIP_PATH})
# Search for hip in common locations # Search for hip in common locations
list(APPEND CMAKE_PREFIX_PATH ${HIP_PATH} ${ROCM_PATH}) list(APPEND CMAKE_PREFIX_PATH ${HIP_PATH} ${ROCM_PATH} ${ROCM_PATH}/hsa ${ROCM_PATH}/hip ${ROCM_PATH}/share/rocm/cmake/)
set(CMAKE_PREFIX_PATH /opt/rocm ROCM_PATH)
set(CMAKE_CXX_COMPILER "${HIP_PATH}/bin/hipcc") set(CMAKE_CXX_COMPILER "${HIP_PATH}/bin/hipcc")
set(CMAKE_MODULE_PATH "${HIP_PATH}/cmake" ${CMAKE_MODULE_PATH}) set(CMAKE_MODULE_PATH "${HIP_PATH}/cmake" ${CMAKE_MODULE_PATH})
set(CMAKE_MODULE_PATH "${HIP_PATH}/lib/cmake/hip" ${CMAKE_MODULE_PATH}) set(CMAKE_MODULE_PATH "${HIP_PATH}/lib/cmake/hip" ${CMAKE_MODULE_PATH})

Просмотреть файл

@ -116,6 +116,9 @@ class MegatronGPT(ModelBenchmark):
self._parser.add_argument('--data_home', type=str, default='/tmp', help='Data home.') self._parser.add_argument('--data_home', type=str, default='/tmp', help='Data home.')
self._parser.add_argument('--vocab_path', type=str, default='/tmp/gpt2-vocab.json', help='Vocab path.') self._parser.add_argument('--vocab_path', type=str, default='/tmp/gpt2-vocab.json', help='Vocab path.')
self._parser.add_argument('--merge_path', type=str, default='/tmp/gpt2-merges.txt', help='Merge path.') self._parser.add_argument('--merge_path', type=str, default='/tmp/gpt2-merges.txt', help='Merge path.')
self._parser.add_argument(
'--split', type=str, default='949,50,1', help='Split dataset ratio for train/val/test.'
)
self._parser.add_argument('--prescale_grad', action='store_true', help='Prescale grad.') self._parser.add_argument('--prescale_grad', action='store_true', help='Prescale grad.')
self._parser.add_argument( self._parser.add_argument(
'--hostfile', type=str, default=None, help='Hostfile to run the mutli-node benchmark.' '--hostfile', type=str, default=None, help='Hostfile to run the mutli-node benchmark.'
@ -128,6 +131,13 @@ class MegatronGPT(ModelBenchmark):
def _preprocess(self): def _preprocess(self):
if not super()._preprocess(): if not super()._preprocess():
return False return False
if not self._args.code_base:
if self._args.deepspeed:
self._args.code_base = os.path.join(
os.getenv('SB_MICRO_PATH'), 'third_party/Megatron/Megatron-DeepSpeed/'
)
else:
self._args.code_base = os.path.join(os.getenv('SB_MICRO_PATH'), 'third_party/Megatron/Megatron-LM')
if not os.path.exists(self._args.code_base) or \ if not os.path.exists(self._args.code_base) or \
not os.path.exists(os.path.join(self._args.code_base, 'pretrain_gpt.py')): not os.path.exists(os.path.join(self._args.code_base, 'pretrain_gpt.py')):
@ -156,35 +166,35 @@ class MegatronGPT(ModelBenchmark):
def _parse_log(self, output): def _parse_log(self, output):
"""Parse log output and get the performance.""" """Parse log output and get the performance."""
tflops_pattern = re.compile(r'TFLOPs: (\d+\.\d+)') tflops_pattern = re.compile(r'(TFLOPs|TFLOP/s/GPU\)): (\d+\.\d+)')
elapsed_time_pattern = re.compile(r'elapsed time per iteration \(ms\): (\d+\.\d+)') elapsed_time_pattern = re.compile(r'elapsed time per iteration \(ms\): (\d+\.\d+)')
mem_allocated_pattern = re.compile(r'MemAllocated=([\d.]+)[KMGTPEZY]?B') mem_allocated_pattern = re.compile(r'allocated: (\d+\.\d+)')
max_mem_allocated_pattern = re.compile(r'MaxMemAllocated=([\d.]+)[KMGTPEZY]?B') max_mem_allocated_pattern = re.compile(r'max allocated: (\d+\.\d+)')
lines = output.splitlines() lines = output.splitlines()
tflops = [] tflops = []
mem_allocated = [] mem_allocated = []
max_mem_allocated = [] max_mem_allocated = []
iteration_times = [] iteration_times = []
for line in lines: for line in lines:
if 'TFLOPs' in line: if 'elapsed time per iteration' in line:
tflops_matches = tflops_pattern.search(line) tflops_matches = tflops_pattern.search(line)
elapsed_time_match = elapsed_time_pattern.search(line) elapsed_time_match = elapsed_time_pattern.search(line)
if tflops_matches: if tflops_matches:
tflops_values = float(tflops_matches.group(1)) tflops_values = float(tflops_matches.group(2))
tflops.append(tflops_values) tflops.append(tflops_values)
if elapsed_time_match: if elapsed_time_match:
elapsed_time_value = float(elapsed_time_match.group(1)) elapsed_time_value = float(elapsed_time_match.group(1))
iteration_times.append(elapsed_time_value) iteration_times.append(elapsed_time_value)
if 'MaxMemAllocated' in line: if 'max allocated' in line:
mem_allocated_match = mem_allocated_pattern.search(line) mem_allocated_match = mem_allocated_pattern.search(line)
max_mem_allocated_match = max_mem_allocated_pattern.search(line) max_mem_allocated_match = max_mem_allocated_pattern.search(line)
if mem_allocated_match: if mem_allocated_match:
mem_allocated_value = float(mem_allocated_match.group(1)) mem_allocated_value = float(mem_allocated_match.group(1)) / 1024
mem_allocated.append(mem_allocated_value) mem_allocated.append(mem_allocated_value)
if max_mem_allocated_match: if max_mem_allocated_match:
max_mem_allocated_value = float(max_mem_allocated_match.group(1)) max_mem_allocated_value = float(max_mem_allocated_match.group(1)) / 1024
max_mem_allocated.append(max_mem_allocated_value) max_mem_allocated.append(max_mem_allocated_value)
return iteration_times, tflops, mem_allocated, max_mem_allocated return iteration_times, tflops, mem_allocated, max_mem_allocated
@ -224,7 +234,9 @@ class MegatronGPT(ModelBenchmark):
--deepspeed \ --deepspeed \
--deepspeed_config {self._config_json_path} \ --deepspeed_config {self._config_json_path} \
--zero-stage {self._args.zero_stage} \ --zero-stage {self._args.zero_stage} \
--pipeline-model-parallel-size {self._args.pipeline_model_parallel_size}' --pipeline-model-parallel-size {self._args.pipeline_model_parallel_size} \
--train-tokens {self._args.train_tokens} \
--data-impl {self._args.data_impl}'
if self._args.pipeline_model_parallel_size <= 1: if self._args.pipeline_model_parallel_size <= 1:
deepspeed_options = f'{deepspeed_options} --no-pipeline-parallel' deepspeed_options = f'{deepspeed_options} --no-pipeline-parallel'
@ -255,11 +267,10 @@ class MegatronGPT(ModelBenchmark):
--num-attention-heads {self._args.num_attn_heads} \ --num-attention-heads {self._args.num_attn_heads} \
--seq-length {self._args.seq_len} \ --seq-length {self._args.seq_len} \
--max-position-embeddings {self._args.seq_len} \ --max-position-embeddings {self._args.seq_len} \
--train-tokens {self._args.train_tokens} \
--train-samples {self._args.num_steps * self._args.batch_size} \ --train-samples {self._args.num_steps * self._args.batch_size} \
--lr {self._args.lr} \ --lr {self._args.lr} \
--min-lr {self._args.min_lr} \ --min-lr {self._args.min_lr} \
--split 949,50,1 \ --split {self._args.split} \
--log-interval {self._args.log_interval} \ --log-interval {self._args.log_interval} \
--eval-interval {self._args.eval_interval} \ --eval-interval {self._args.eval_interval} \
--eval-iters {self._args.eval_iters} \ --eval-iters {self._args.eval_iters} \
@ -273,7 +284,8 @@ class MegatronGPT(ModelBenchmark):
--optimizer adam \ --optimizer adam \
--use-distributed-optimizer \ --use-distributed-optimizer \
{precision_megatron} \ {precision_megatron} \
--seed {self._args.seed}' --seed {self._args.seed} \
--log-throughput'
if self._args.sequence_parallel: if self._args.sequence_parallel:
megatron_options = f'{megatron_options} --sequence-parallel' megatron_options = f'{megatron_options} --sequence-parallel'
@ -298,6 +310,8 @@ class MegatronGPT(ModelBenchmark):
script_path = os.path.join(self._args.code_base, 'pretrain_gpt.py') script_path = os.path.join(self._args.code_base, 'pretrain_gpt.py')
if self._args.deepspeed: if self._args.deepspeed:
deepspeed_option = self.__prepare_deespeed_config(precision_megatron.lstrip('--')) deepspeed_option = self.__prepare_deespeed_config(precision_megatron.lstrip('--'))
# No --log-throughput in Megatron-DeepSpeed by 20231219
megatron_options = megatron_options.replace('--log-throughput', '').strip()
if self._num_nodes > 1: if self._num_nodes > 1:
command = f'torchrun {self._distributed_args} ' + \ command = f'torchrun {self._distributed_args} ' + \
f'{script_path} {megatron_options} {self._data_options} {deepspeed_option}' f'{script_path} {megatron_options} {self._data_options} {deepspeed_option}'
@ -379,6 +393,7 @@ class MegatronGPT(ModelBenchmark):
return False return False
self._num_nodes = int(os.getenv('OMPI_COMM_WORLD_SIZE')) // int(os.getenv('OMPI_COMM_WORLD_LOCAL_SIZE')) self._num_nodes = int(os.getenv('OMPI_COMM_WORLD_SIZE')) // int(os.getenv('OMPI_COMM_WORLD_LOCAL_SIZE'))
master_addr = 'localhost'
if self._num_nodes > 1: if self._num_nodes > 1:
if not self._args.hostfile: if not self._args.hostfile:
sb_hostfile = os.path.join(os.environ.get('SB_WORKSPACE', '.'), 'hostfile') sb_hostfile = os.path.join(os.environ.get('SB_WORKSPACE', '.'), 'hostfile')
@ -395,12 +410,13 @@ class MegatronGPT(ModelBenchmark):
if self._num_nodes != len(hosts): if self._num_nodes != len(hosts):
logger.error('MPI init failed since hostfile not match the MPI setting.') logger.error('MPI init failed since hostfile not match the MPI setting.')
return False return False
master_addr = hosts[0].split()[0]
addr = os.getenv('MASTER_ADDR', hosts[0].split()[0]) addr = os.getenv('MASTER_ADDR', master_addr)
port = os.getenv('MASTER_PORT', '29500') port = os.getenv('MASTER_PORT', '29500')
node_rank = int(os.environ['OMPI_COMM_WORLD_RANK']) // int(os.environ['OMPI_COMM_WORLD_LOCAL_SIZE']) node_rank = int(os.environ['OMPI_COMM_WORLD_RANK']) // int(os.environ['OMPI_COMM_WORLD_LOCAL_SIZE'])
self._distributed_args = f'--nproc_per_node {self._args.num_gpus} --nnodes {self._num_nodes} ' + \ self._distributed_args = f'--nproc_per_node {self._args.num_gpus} --nnodes {self._num_nodes} ' + \
f'--node_rank {node_rank} --master_addr {addr} --master_port {port}' f'--node_rank {node_rank} --master_addr {addr} --master_port {port}'
return True return True
def _generate_dataset(self): def _generate_dataset(self):
@ -448,8 +464,7 @@ class MegatronGPT(ModelBenchmark):
self._data_options = f'\ self._data_options = f'\
--vocab-file {self._vocab_path} \ --vocab-file {self._vocab_path} \
--merge-file {self._merges_path} \ --merge-file {self._merges_path} \
--data-path {self._data_path} \ --data-path {self._data_path}'
--data-impl {self._args.data_impl}'
logger.info('Dataset preparation successfully.') logger.info('Dataset preparation successfully.')
return True return True

Просмотреть файл

@ -265,8 +265,8 @@ class ModelBenchmark(Benchmark):
# The unit of step time should be millisecond. # The unit of step time should be millisecond.
step_times = self._train_step(precision) step_times = self._train_step(precision)
if isinstance(step_times, tuple): if isinstance(step_times, tuple):
step_times = step_times[0]
info = step_times[1] info = step_times[1]
step_times = step_times[0]
self._process_info(ModelAction.TRAIN, precision, info) self._process_info(ModelAction.TRAIN, precision, info)
step_times = self.__process_model_result(ModelAction.TRAIN, precision, step_times) step_times = self.__process_model_result(ModelAction.TRAIN, precision, step_times)
if not step_times: if not step_times:

Просмотреть файл

@ -13,7 +13,7 @@ gpu = GPU()
if gpu.vendor == 'nvidia' or gpu.vendor == 'nvidia-graphics': if gpu.vendor == 'nvidia' or gpu.vendor == 'nvidia-graphics':
import py3nvml.py3nvml as nvml import py3nvml.py3nvml as nvml
elif gpu.vendor == 'amd' or gpu.vendor == 'amd-graphics': elif gpu.vendor == 'amd' or gpu.vendor == 'amd-graphics':
from pyrsmi import rocml import amdsmi as rocml
class DeviceManager: class DeviceManager:
@ -150,7 +150,7 @@ class NvidiaDeviceManager(DeviceManager):
try: try:
cap = nvml.nvmlDeviceGetCudaComputeCapability(self._device_handlers[0]) cap = nvml.nvmlDeviceGetCudaComputeCapability(self._device_handlers[0])
except Exception as err: except Exception as err:
logger.error('Get device compute capability failed: {}'.format(str(err))) logger.warning('Get device compute capability failed: {}'.format(str(err)))
return None return None
return cap return cap
@ -166,7 +166,7 @@ class NvidiaDeviceManager(DeviceManager):
try: try:
util = nvml.nvmlDeviceGetUtilizationRates(self._device_handlers[idx]) util = nvml.nvmlDeviceGetUtilizationRates(self._device_handlers[idx])
except Exception as err: except Exception as err:
logger.error('Get device utilization failed: {}'.format(str(err))) logger.warning('Get device utilization failed: {}'.format(str(err)))
return None return None
return util.gpu return util.gpu
@ -182,7 +182,7 @@ class NvidiaDeviceManager(DeviceManager):
try: try:
temp = nvml.nvmlDeviceGetTemperature(self._device_handlers[idx], nvml.NVML_TEMPERATURE_GPU) temp = nvml.nvmlDeviceGetTemperature(self._device_handlers[idx], nvml.NVML_TEMPERATURE_GPU)
except Exception as err: except Exception as err:
logger.error('Get device temperature failed: {}'.format(str(err))) logger.warning('Get device temperature failed: {}'.format(str(err)))
temp = None temp = None
return temp return temp
@ -198,7 +198,7 @@ class NvidiaDeviceManager(DeviceManager):
try: try:
power = nvml.nvmlDeviceGetPowerUsage(self._device_handlers[idx]) power = nvml.nvmlDeviceGetPowerUsage(self._device_handlers[idx])
except Exception as err: except Exception as err:
logger.error('Get device power failed: {}'.format(str(err))) logger.warning('Get device power failed: {}'.format(str(err)))
return None return None
return int(int(power) / 1000) return int(int(power) / 1000)
@ -214,7 +214,7 @@ class NvidiaDeviceManager(DeviceManager):
try: try:
powerlimit = nvml.nvmlDeviceGetPowerManagementLimit(self._device_handlers[idx]) powerlimit = nvml.nvmlDeviceGetPowerManagementLimit(self._device_handlers[idx])
except Exception as err: except Exception as err:
logger.error('Get device power limitation failed: {}'.format(str(err))) logger.warning('Get device power limitation failed: {}'.format(str(err)))
return None return None
return int(int(powerlimit) / 1000) return int(int(powerlimit) / 1000)
@ -231,7 +231,7 @@ class NvidiaDeviceManager(DeviceManager):
try: try:
mem = nvml.nvmlDeviceGetMemoryInfo(self._device_handlers[idx]) mem = nvml.nvmlDeviceGetMemoryInfo(self._device_handlers[idx])
except Exception as err: except Exception as err:
logger.error('Get device memory failed: {}'.format(str(err))) logger.warning('Get device memory failed: {}'.format(str(err)))
return None, None return None, None
return mem.used, mem.total return mem.used, mem.total
@ -304,7 +304,7 @@ class NvidiaDeviceManager(DeviceManager):
except nvml.NVMLError: except nvml.NVMLError:
pass pass
except Exception as err: except Exception as err:
logger.error('Get device ECC information failed: {}'.format(str(err))) logger.warning('Get device ECC information failed: {}'.format(str(err)))
return None, None return None, None
try: try:
@ -316,7 +316,7 @@ class NvidiaDeviceManager(DeviceManager):
except nvml.NVMLError: except nvml.NVMLError:
pass pass
except Exception as err: except Exception as err:
logger.error('Get device ECC information failed: {}'.format(str(err))) logger.warning('Get device ECC information failed: {}'.format(str(err)))
return None, None return None, None
return corrected_ecc, uncorrected_ecc return corrected_ecc, uncorrected_ecc
@ -326,12 +326,13 @@ class AmdDeviceManager(DeviceManager):
"""Device management module for AMD.""" """Device management module for AMD."""
def __init__(self): def __init__(self):
"""Constructor.""" """Constructor."""
rocml.smi_initialize() rocml.amdsmi_init()
self._device_handlers = rocml.amdsmi_get_processor_handles()
super().__init__() super().__init__()
def __del__(self): def __del__(self):
"""Destructor.""" """Destructor."""
rocml.smi_shutdown() rocml.amdsmi_shut_down()
def get_device_count(self): def get_device_count(self):
"""Get the number of device. """Get the number of device.
@ -339,7 +340,7 @@ class AmdDeviceManager(DeviceManager):
Return: Return:
count (int): count of device. count (int): count of device.
""" """
return rocml.smi_get_device_count() return len(self._device_handlers)
def get_device_utilization(self, idx): def get_device_utilization(self, idx):
"""Get the utilization of device. """Get the utilization of device.
@ -351,11 +352,11 @@ class AmdDeviceManager(DeviceManager):
util (int): the utilization of device, None means failed to get the data. util (int): the utilization of device, None means failed to get the data.
""" """
try: try:
util = rocml.smi_get_device_utilization(idx) engine_usage = rocml.amdsmi_get_gpu_activity(self._device_handlers[idx])
except Exception as err: except Exception as err:
logger.error('Get device utilization failed: {}'.format(str(err))) logger.warning('Get device utilization failed: {}'.format(str(err)))
return None return None
return util return engine_usage['gfx_activity']
def get_device_temperature(self, idx): def get_device_temperature(self, idx):
"""Get the temperature of device, unit: celsius. """Get the temperature of device, unit: celsius.
@ -366,8 +367,16 @@ class AmdDeviceManager(DeviceManager):
Return: Return:
temp (int): the temperature of device, None means failed to get the data. temp (int): the temperature of device, None means failed to get the data.
""" """
# Currently no API provided in rocml. try:
return None temp = rocml.amdsmi_get_temp_metric(
self._device_handlers[idx], rocml.AmdSmiTemperatureType.EDGE, rocml.AmdSmiTemperatureMetric.CURRENT
)
except (rocml.AmdSmiLibraryException, rocml.AmdSmiParameterException):
pass
except Exception as err:
logger.warning('Get device temperature failed: {}'.format(str(err)))
temp = None
return temp
def get_device_power(self, idx): def get_device_power(self, idx):
"""Get the realtime power of device, unit: watt. """Get the realtime power of device, unit: watt.
@ -379,11 +388,11 @@ class AmdDeviceManager(DeviceManager):
temp (int): the realtime power of device, None means failed to get the data. temp (int): the realtime power of device, None means failed to get the data.
""" """
try: try:
power = rocml.smi_get_device_average_power(idx) power_measure = rocml.amdsmi_get_power_info(self._device_handlers[idx])
except Exception as err: except Exception as err:
logger.error('Get device power failed: {}'.format(str(err))) logger.warning('Get device power failed: {}'.format(str(err)))
return None return None
return int(int(power) / 1000) return int(power_measure['average_socket_power'])
def get_device_power_limit(self, idx): def get_device_power_limit(self, idx):
"""Get the power management limit of device, unit: watt. """Get the power management limit of device, unit: watt.
@ -394,8 +403,12 @@ class AmdDeviceManager(DeviceManager):
Return: Return:
temp (int): the power management limit of device, None means failed to get the data. temp (int): the power management limit of device, None means failed to get the data.
""" """
# Currently no API provided in rocml. try:
return None power_measure = rocml.amdsmi_get_power_info(self._device_handlers[idx])
except Exception as err:
logger.warning('Get device power limit failed: {}'.format(str(err)))
return None
return int(power_measure['power_limit'])
def get_device_memory(self, idx): def get_device_memory(self, idx):
"""Get the memory information of device, unit: byte. """Get the memory information of device, unit: byte.
@ -408,10 +421,10 @@ class AmdDeviceManager(DeviceManager):
total (int): the total device memory in bytes, None means failed to get the data. total (int): the total device memory in bytes, None means failed to get the data.
""" """
try: try:
mem_used = rocml.smi_get_device_memory_used(idx) mem_used = rocml.amdsmi_get_gpu_memory_usage(self._device_handlers[idx], rocml.AmdSmiMemoryType.VRAM)
mem_total = rocml.smi_get_device_memory_total(idx) mem_total = rocml.amdsmi_get_gpu_memory_total(self._device_handlers[idx], rocml.AmdSmiMemoryType.VRAM)
except Exception as err: except Exception as err:
logger.error('Get device memory failed: {}'.format(str(err))) logger.warning('Get device memory failed: {}'.format(str(err)))
return None, None return None, None
return mem_used, mem_total return mem_used, mem_total
@ -425,8 +438,19 @@ class AmdDeviceManager(DeviceManager):
corrected_ecc (int) : the count of single bit ecc error. corrected_ecc (int) : the count of single bit ecc error.
uncorrected_ecc (int): the count of double bit ecc error. uncorrected_ecc (int): the count of double bit ecc error.
""" """
# Currently no API provided in rocml. corrected_ecc = 0
return None, None uncorrected_ecc = 0
for block in rocml.AmdSmiGpuBlock:
try:
ecc_count = rocml.amdsmi_get_gpu_ecc_count(self._device_handlers[idx], block)
corrected_ecc += ecc_count['correctable_count']
uncorrected_ecc += ecc_count['uncorrectable_count']
except (rocml.AmdSmiLibraryException, rocml.AmdSmiParameterException):
pass
except Exception as err:
logger.info('Get device ECC information failed: {}'.format(str(err)))
return corrected_ecc, uncorrected_ecc
device_manager: Optional[DeviceManager] = DeviceManager() device_manager: Optional[DeviceManager] = DeviceManager()

Просмотреть файл

@ -3,7 +3,7 @@
# Server: # Server:
# - Product: HPE Apollo 6500 # - Product: HPE Apollo 6500
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
var: var:

Просмотреть файл

@ -4,7 +4,7 @@
# - Product: G482-Z53 # - Product: G482-Z53
# - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html # - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
var: var:

Просмотреть файл

@ -1,4 +1,4 @@
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
monitor: monitor:

Просмотреть файл

@ -1,4 +1,4 @@
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
monitor: monitor:

Просмотреть файл

@ -1,4 +1,4 @@
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
monitor: monitor:

Просмотреть файл

@ -3,7 +3,7 @@
# Azure NDm A100 v4 # Azure NDm A100 v4
# reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series # reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
monitor: monitor:

Просмотреть файл

@ -1,5 +1,5 @@
# SuperBench Config # SuperBench Config
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
monitor: monitor:

Просмотреть файл

@ -1,5 +1,5 @@
# SuperBench Config # SuperBench Config
version: v0.9 version: v0.10
superbench: superbench:
enable: null enable: null
monitor: monitor:

Просмотреть файл

@ -100,7 +100,7 @@
docker run -itd --name={{ container }} \ docker run -itd --name={{ container }} \
--privileged --net=host --ipc=host \ --privileged --net=host --ipc=host \
{{ '--gpus=all' if nvidia_gpu_exist else '' }} \ {{ '--gpus=all' if nvidia_gpu_exist else '' }} \
{{ '--security-opt seccomp=unconfined --group-add video' if amd_gpu_exist else '' }} \ {{ '--security-opt seccomp=unconfined --group-add video --device=/dev/kfd --device=/dev/dri --cap-add=SYS_PTRACE --shm-size=16G' if amd_gpu_exist else '' }} \
-w /root -v {{ workspace }}:/root -v /mnt:/mnt \ -w /root -v {{ workspace }}:/root -v /mnt:/mnt \
-v /var/run/docker.sock:/var/run/docker.sock \ -v /var/run/docker.sock:/var/run/docker.sock \
--entrypoint /bin/bash {{ docker_image }} && \ --entrypoint /bin/bash {{ docker_image }} && \

Просмотреть файл

@ -66,6 +66,8 @@ class CudaNcclBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
assert (benchmark._args.iters == 20) assert (benchmark._args.iters == 20)
assert (benchmark._args.warmup_iters == 5) assert (benchmark._args.warmup_iters == 5)
assert (benchmark._args.graph_iters == 0) assert (benchmark._args.graph_iters == 0)
assert (benchmark._args.in_place is False)
assert (benchmark._args.data_type == 'float')
# Check command list # Check command list
bin_names = [ bin_names = [
@ -74,7 +76,7 @@ class CudaNcclBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
] ]
command = bin_names[0] + benchmark._commands[0].split(bin_names[0])[1] command = bin_names[0] + benchmark._commands[0].split(bin_names[0])[1]
expected_command = '{} -b 8 -e 8G -f 2 -g 8 -c 0 -n 20 -w 5 -G 0'.format(bin_names[0]) expected_command = '{} -b 8 -e 8G -f 2 -g 8 -c 0 -n 20 -w 5 -G 0 -d float'.format(bin_names[0])
assert (command == expected_command) assert (command == expected_command)
# Check results and metrics. # Check results and metrics.
@ -91,6 +93,11 @@ class CudaNcclBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
'alltoall': alltoall, 'alltoall': alltoall,
} }
if 'SB_MODE_SERIAL_INDEX' in os.environ:
os.environ.pop('SB_MODE_SERIAL_INDEX')
if 'SB_MODE_PARALLEL_INDEX' in os.environ:
os.environ.pop('SB_MODE_PARALLEL_INDEX')
for op in raw_output.keys(): for op in raw_output.keys():
benchmark._args.operation = op benchmark._args.operation = op
assert (benchmark._process_raw_result(0, raw_output[op])) assert (benchmark._process_raw_result(0, raw_output[op]))
@ -131,3 +138,48 @@ class CudaNcclBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
assert (benchmark.result['alltoall_0_0:8589934592_time'][0] == 33508.0) assert (benchmark.result['alltoall_0_0:8589934592_time'][0] == 33508.0)
assert (benchmark.result['alltoall_0_0:8589934592_algbw'][0] == 256.36) assert (benchmark.result['alltoall_0_0:8589934592_algbw'][0] == 256.36)
assert (benchmark.result['alltoall_0_0:8589934592_busbw'][0] == 224.31) assert (benchmark.result['alltoall_0_0:8589934592_busbw'][0] == 224.31)
@decorator.load_data('tests/data/nccl_allreduce.log')
@decorator.load_data('tests/data/nccl_alltoall.log')
def test_nccl_bw_performance_in_place_parsing(self, allreduce, alltoall):
"""Test nccl-bw benchmark in-place parsing."""
benchmark_name = 'nccl-bw'
(benchmark_class,
predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, Platform.CUDA)
assert (benchmark_class)
benchmark = benchmark_class(benchmark_name, parameters='--ngpus 8 --in_place')
ret = benchmark._preprocess()
assert (ret is True)
assert (benchmark.return_code == ReturnCode.SUCCESS)
assert (benchmark._args.in_place is True)
# Case with valid raw_output
raw_output = {
'allreduce': allreduce,
'alltoall': alltoall,
}
if 'SB_MODE_SERIAL_INDEX' in os.environ:
os.environ.pop('SB_MODE_SERIAL_INDEX')
if 'SB_MODE_PARALLEL_INDEX' in os.environ:
os.environ.pop('SB_MODE_PARALLEL_INDEX')
for op in raw_output.keys():
benchmark._args.operation = op
assert (benchmark._process_raw_result(0, raw_output[op]))
for name in ['time', 'algbw', 'busbw']:
for size in ['8589934592', '4294967296', '2147483648', '1073741824', '536870912', '32']:
metric = op + '_' + size + '_' + name
assert (metric in benchmark.result)
assert (len(benchmark.result[metric]) == 1)
assert (isinstance(benchmark.result[metric][0], numbers.Number))
assert (benchmark.result['allreduce_8589934592_time'][0] == 63959.0)
assert (benchmark.result['allreduce_8589934592_algbw'][0] == 134.30)
assert (benchmark.result['allreduce_8589934592_busbw'][0] == 235.03)
assert (benchmark.result['alltoall_8589934592_time'][0] == 33234.0)
assert (benchmark.result['alltoall_8589934592_algbw'][0] == 258.47)
assert (benchmark.result['alltoall_8589934592_busbw'][0] == 226.16)

Просмотреть файл

@ -3,7 +3,6 @@
"""Tests for distributed inference benchmark.""" """Tests for distributed inference benchmark."""
import numbers
import unittest import unittest
from tests.helper import decorator from tests.helper import decorator
@ -209,19 +208,17 @@ class DistInferenceCppImplTest(BenchmarkTestCase, unittest.TestCase):
# step_times # step_times
assert (len(benchmark.raw_data) == 2) assert (len(benchmark.raw_data) == 2)
# return code + (avg, 50th, 90th, 95th, 99th, 99.9th) # return code + (avg, 50th, 90th, 95th, 99th, 99.9th)
test_latency = float(test_raw_output.splitlines()[-1].split(' ms per iteration')[0].split()[-1])
assert (7 == len(benchmark.result)) assert (7 == len(benchmark.result))
for output_key in benchmark.result: assert (benchmark.result['return_code'] == [0])
if output_key == 'return_code': assert (benchmark.result['step_times'] == [1.9052048])
assert (benchmark.result[output_key] == [0]) assert (benchmark.result['step_times_50'] == [1.851])
else: assert (benchmark.result['step_times_90'] == [1.89637])
assert (output_key.startswith('step_times')) assert (benchmark.result['step_times_95'] == [2.12037])
assert (len(benchmark.result[output_key]) == 1) assert (benchmark.result['step_times_99'] == [2.67155])
assert (isinstance(benchmark.result[output_key][0], numbers.Number)) assert (benchmark.result['step_times_99.9'] == [4.4198])
assert (test_latency == benchmark.result[output_key][0])
# Negative case - invalid raw output. # Negative case - invalid raw output.
assert (benchmark._process_raw_result(1, 'Invalid raw output') is False) assert (benchmark._process_raw_result(1, 'Latency of step: xxx ms') is False)
assert (benchmark.return_code == ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE) assert (benchmark.return_code == ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
@decorator.cuda_test @decorator.cuda_test

Просмотреть файл

@ -55,7 +55,7 @@ class HipblasLtBenchmarkTestCase(BenchmarkTestCase, unittest.TestCase):
self.assertFalse(benchmark._preprocess()) self.assertFalse(benchmark._preprocess())
benchmark = benchmark_cls( benchmark = benchmark_cls(
self.benchmark_name, self.benchmark_name,
parameters='--shapes 2:4,4:8,8:32 2:4,4:8,8:32:+4 --in_types fp16 fp32 bf16', parameters='--shapes 2:4,4:8,8:32 2:4,4:8,8:32:+4 --in_types fp16 fp32 bf16 fp8',
) )
self.assertTrue(benchmark._preprocess()) self.assertTrue(benchmark._preprocess())
self.assertEqual((2 * 2 * 3 + 2 * 2 * 7) * len(benchmark._args.in_types), len(benchmark._commands)) self.assertEqual((2 * 2 * 3 + 2 * 2 * 7) * len(benchmark._args.in_types), len(benchmark._commands))
@ -63,12 +63,16 @@ class HipblasLtBenchmarkTestCase(BenchmarkTestCase, unittest.TestCase):
def cmd(t, b, m, n, k): def cmd(t, b, m, n, k):
if b == 0: if b == 0:
return f'{benchmark._HipBlasLtBenchmark__bin_path} ' + \ return f'{benchmark._HipBlasLtBenchmark__bin_path} ' + \
f'-m {m} -n {n} -k {k} -j 20 -i 50 {benchmark._in_type_map[t]}' f'-m {m} -n {n} -k {k} -j 20 -i 50 {benchmark._in_type_map[t]}' + \
f' --transA {benchmark._args.transA} --transB {benchmark._args.transB}' + \
f' --initialization {benchmark._args.initialization}'
else: else:
return f'{benchmark._HipBlasLtBenchmark__bin_path} ' + \ return f'{benchmark._HipBlasLtBenchmark__bin_path} ' + \
f'-m {m} -n {n} -k {k} -j 20 -i 50 {benchmark._in_type_map[t]} -b {b}' f'-m {m} -n {n} -k {k} -j 20 -i 50 {benchmark._in_type_map[t]} -b {b}' + \
f' --transA {benchmark._args.transA} --transB {benchmark._args.transB}' + \
f' --initialization {benchmark._args.initialization}'
for _t in ['fp16', 'fp32', 'bf16']: for _t in ['fp16', 'fp32', 'bf16', 'fp8']:
for _m in [2, 4]: for _m in [2, 4]:
for _n in [4, 8]: for _n in [4, 8]:
for _k in [8, 16, 32]: for _k in [8, 16, 32]:
@ -102,7 +106,7 @@ N,N,0,1,896,896,896,1,896,802816,0,896,802816,896,802816,896,802816,fp16_r,f32_r
self.assertEqual(ReturnCode.SUCCESS, benchmark.return_code) self.assertEqual(ReturnCode.SUCCESS, benchmark.return_code)
self.assertEqual(2, len(benchmark.result)) self.assertEqual(2, len(benchmark.result))
self.assertEqual(58624.5, benchmark.result['fp16_1_896_896_896_flops'][0]) self.assertEqual(58.6245, benchmark.result['fp16_1_896_896_896_flops'][0])
# Negative case - invalid raw output # Negative case - invalid raw output
self.assertFalse(benchmark._process_raw_result(1, 'HipBLAS API failed')) self.assertFalse(benchmark._process_raw_result(1, 'HipBLAS API failed'))

Просмотреть файл

@ -177,8 +177,7 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
benchmark._data_options = f'\ benchmark._data_options = f'\
--vocab-file {self._tmp_dir}/gpt2-vocab.json \ --vocab-file {self._tmp_dir}/gpt2-vocab.json \
--merge-file {self._tmp_dir}/gpt2-merges.txt \ --merge-file {self._tmp_dir}/gpt2-merges.txt \
--data-path {self._tmp_dir}/dataset_text_document \ --data-path {self._tmp_dir}/dataset_text_document'
--data-impl mmap'
script_path = str(Path(self._tmp_dir) / 'pretrain_gpt.py') script_path = str(Path(self._tmp_dir) / 'pretrain_gpt.py')
expected_command = 'torchrun {distributed_args} {script_path} \ expected_command = 'torchrun {distributed_args} {script_path} \
@ -197,7 +196,6 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
--num-attention-heads 32 \ --num-attention-heads 32 \
--seq-length 2048 \ --seq-length 2048 \
--max-position-embeddings 2048 \ --max-position-embeddings 2048 \
--train-tokens 300000000000 \
--train-samples 20480 \ --train-samples 20480 \
--lr 0.00012 \ --lr 0.00012 \
--min-lr 1e-06 \ --min-lr 1e-06 \
@ -215,7 +213,8 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
--optimizer adam \ --optimizer adam \
--use-distributed-optimizer \ --use-distributed-optimizer \
{precision} \ {precision} \
--seed 1234 {data_options}' --seed 1234 \
--log-throughput {data_options}'
precision = Precision.FLOAT32 precision = Precision.FLOAT32
command = benchmark._megatron_command(precision) command = benchmark._megatron_command(precision)
@ -262,12 +261,10 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
benchmark._data_options = f'\ benchmark._data_options = f'\
--vocab-file {self._tmp_dir}/gpt2-vocab.json \ --vocab-file {self._tmp_dir}/gpt2-vocab.json \
--merge-file {self._tmp_dir}/gpt2-merges.txt \ --merge-file {self._tmp_dir}/gpt2-merges.txt \
--data-path {self._tmp_dir}/dataset_text_document \ --data-path {self._tmp_dir}/dataset_text_document'
--data-impl mmap'
command = benchmark._megatron_command(Precision.BFLOAT16) command = benchmark._megatron_command(Precision.BFLOAT16)
expected_command = 'deepspeed {script_path} \ expected_command = 'deepspeed {script_path} --override-opt_param-scheduler \
--override-opt_param-scheduler \
--adam-beta1 0.9 \ --adam-beta1 0.9 \
--adam-beta2 0.95 \ --adam-beta2 0.95 \
--tensor-model-parallel-size 1 \ --tensor-model-parallel-size 1 \
@ -282,7 +279,6 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
--num-attention-heads 32 \ --num-attention-heads 32 \
--seq-length 2048 \ --seq-length 2048 \
--max-position-embeddings 2048 \ --max-position-embeddings 2048 \
--train-tokens 300000000000 \
--train-samples 20480 \ --train-samples 20480 \
--lr 0.00012 \ --lr 0.00012 \
--min-lr 1e-06 \ --min-lr 1e-06 \
@ -306,7 +302,9 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
--deepspeed \ --deepspeed \
--deepspeed_config {benchmark._config_json_path} \ --deepspeed_config {benchmark._config_json_path} \
--zero-stage 1 \ --zero-stage 1 \
--pipeline-model-parallel-size 1 --no-pipeline-parallel' --pipeline-model-parallel-size 1 \
--train-tokens 300000000000 \
--data-impl mmap --no-pipeline-parallel'
self.assertEqual( self.assertEqual(
command, command,
@ -346,12 +344,12 @@ class MegatronGPTTest(BenchmarkTestCase, unittest.TestCase):
iteration_times, tflops, mem_allocated, max_mem_allocated = benchmark._parse_log(raw_output) iteration_times, tflops, mem_allocated, max_mem_allocated = benchmark._parse_log(raw_output)
assert (statistics.mean(iteration_times) == 75239.24) assert (statistics.mean(iteration_times) == 75239.24)
assert (statistics.mean(tflops) == 149.136) assert (statistics.mean(tflops) == 149.136)
assert (statistics.mean(mem_allocated) == 17.54) assert (statistics.mean(mem_allocated) == 17.535637855529785)
assert (statistics.mean(max_mem_allocated) == 66.97) assert (statistics.mean(max_mem_allocated) == 66.9744234085083)
info = {'tflops': tflops, 'mem_allocated': mem_allocated, 'max_mem_allocated': max_mem_allocated} info = {'tflops': tflops, 'mem_allocated': mem_allocated, 'max_mem_allocated': max_mem_allocated}
benchmark._process_info(ModelAction.TRAIN, Precision.FLOAT16, info) benchmark._process_info(ModelAction.TRAIN, Precision.FLOAT16, info)
assert (benchmark.result is not None) assert (benchmark.result is not None)
assert (benchmark.result['fp16_train_tflops'][0] == 149.136) assert (benchmark.result['fp16_train_tflops'][0] == 149.136)
assert (benchmark.result['fp16_train_mem_allocated'][0] == 17.54) assert (benchmark.result['fp16_train_mem_allocated'][0] == 17.535637855529785)
assert (benchmark.result['fp16_train_max_mem_allocated'][0] == 66.97) assert (benchmark.result['fp16_train_max_mem_allocated'][0] == 66.9744234085083)

Просмотреть файл

@ -1,2 +1,100 @@
Parameters: m=80, n=128, k=128, alpha=1.000000, beta=1.000000, num_layers=50, num_warmups=20, num_iters=100, use_cuda_graph=0 Latency of step 0: 1.8339 ms
Time: 173 ms in total, 1.73 ms per iteration, 0.0346 ms per layer Latency of step 1: 1.84222 ms
Latency of step 2: 1.90869 ms
Latency of step 3: 1.85375 ms
Latency of step 4: 1.87192 ms
Latency of step 5: 1.84254 ms
Latency of step 6: 1.91165 ms
Latency of step 7: 1.8214 ms
Latency of step 8: 1.91427 ms
Latency of step 9: 1.89586 ms
Latency of step 10: 1.86816 ms
Latency of step 11: 1.85105 ms
Latency of step 12: 1.84486 ms
Latency of step 13: 1.84915 ms
Latency of step 14: 1.82332 ms
Latency of step 15: 1.91444 ms
Latency of step 16: 1.85073 ms
Latency of step 17: 1.81812 ms
Latency of step 18: 2.67155 ms
Latency of step 19: 1.85119 ms
Latency of step 20: 1.87989 ms
Latency of step 21: 1.83932 ms
Latency of step 22: 1.84041 ms
Latency of step 23: 1.84789 ms
Latency of step 24: 1.85079 ms
Latency of step 25: 1.82229 ms
Latency of step 26: 1.83376 ms
Latency of step 27: 1.851 ms
Latency of step 28: 1.86246 ms
Latency of step 29: 1.8371 ms
Latency of step 30: 1.88932 ms
Latency of step 31: 1.84459 ms
Latency of step 32: 1.82725 ms
Latency of step 33: 1.83566 ms
Latency of step 34: 1.84041 ms
Latency of step 35: 1.87058 ms
Latency of step 36: 1.84038 ms
Latency of step 37: 1.85555 ms
Latency of step 38: 1.85848 ms
Latency of step 39: 2.40561 ms
Latency of step 40: 1.85029 ms
Latency of step 41: 1.84562 ms
Latency of step 42: 1.8351 ms
Latency of step 43: 1.84196 ms
Latency of step 44: 1.86032 ms
Latency of step 45: 1.87147 ms
Latency of step 46: 1.84832 ms
Latency of step 47: 1.85715 ms
Latency of step 48: 1.86012 ms
Latency of step 49: 1.86327 ms
Latency of step 50: 1.84388 ms
Latency of step 51: 1.86396 ms
Latency of step 52: 1.85538 ms
Latency of step 53: 1.85564 ms
Latency of step 54: 1.83979 ms
Latency of step 55: 1.85334 ms
Latency of step 56: 1.85712 ms
Latency of step 57: 1.85284 ms
Latency of step 58: 1.84534 ms
Latency of step 59: 1.86041 ms
Latency of step 60: 1.86305 ms
Latency of step 61: 2.2213 ms
Latency of step 62: 1.83054 ms
Latency of step 63: 4.4198 ms
Latency of step 64: 1.87245 ms
Latency of step 65: 1.83845 ms
Latency of step 66: 1.82047 ms
Latency of step 67: 1.81191 ms
Latency of step 68: 1.83887 ms
Latency of step 69: 1.8463 ms
Latency of step 70: 2.12037 ms
Latency of step 71: 1.85782 ms
Latency of step 72: 1.84939 ms
Latency of step 73: 1.82054 ms
Latency of step 74: 1.8866 ms
Latency of step 75: 1.83937 ms
Latency of step 76: 1.84167 ms
Latency of step 77: 1.89637 ms
Latency of step 78: 1.8392 ms
Latency of step 79: 1.83754 ms
Latency of step 80: 1.84721 ms
Latency of step 81: 1.88112 ms
Latency of step 82: 1.84474 ms
Latency of step 83: 1.84084 ms
Latency of step 84: 1.85134 ms
Latency of step 85: 1.85315 ms
Latency of step 86: 1.83406 ms
Latency of step 87: 1.87803 ms
Latency of step 88: 1.8369 ms
Latency of step 89: 1.85909 ms
Latency of step 90: 1.84519 ms
Latency of step 91: 2.52689 ms
Latency of step 92: 1.86594 ms
Latency of step 93: 1.86974 ms
Latency of step 94: 1.85219 ms
Latency of step 95: 1.86255 ms
Latency of step 96: 1.82652 ms
Latency of step 97: 1.84379 ms
Latency of step 98: 1.84553 ms
Latency of step 99: 1.87082 ms

56
third_party/Makefile поставляемый
Просмотреть файл

@ -7,18 +7,20 @@ MPI_HOME ?= /usr/local/mpi
HIP_HOME ?= /opt/rocm/hip HIP_HOME ?= /opt/rocm/hip
RCCL_HOME ?= /opt/rocm/rccl RCCL_HOME ?= /opt/rocm/rccl
HPCX_HOME ?= /opt/hpcx HPCX_HOME ?= /opt/hpcx
ROCM_PATH ?= /opt/rocm
CUDA_VER ?= $(shell nvcc --version | grep 'release' | awk '{print $$6}' | cut -c2- | cut -d '.' -f1-2) CUDA_VER ?= $(shell nvcc --version | grep 'release' | awk '{print $$6}' | cut -c2- | cut -d '.' -f1-2)
ROCBLAS_BRANCH ?= rocm-$(shell dpkg -l | grep 'rocm-dev ' | awk '{print $$3}' | cut -d '.' -f1-3) ROCBLAS_BRANCH ?= rocm-$(shell dpkg -l | grep 'rocm-dev ' | awk '{print $$3}' | cut -d '.' -f1-3)
HIPBLASLT_BRANCH ?= rocm-$(shell dpkg -l | grep 'rocm-dev ' | awk '{print $$3}' | cut -d '.' -f1-3) HIPBLASLT_BRANCH ?= rocm-$(shell dpkg -l | grep 'rocm-dev ' | awk '{print $$3}' | cut -d '.' -f1-3)
ROCM_VER ?= $(shell hipconfig -R | grep -oP '\d+\.\d+\.\d+' || echo "0.0.0")
.PHONY: all cuda_with_msccl cuda rocm common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest cuda_msccl rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest gpcnet cuda_gpuburn cpu_stream cpu_hpl directx_amf_encoding_latency directx_amd rocm_hipblaslt megatron_lm megatron_deepspeed .PHONY: all cuda_with_msccl cuda rocm common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest cuda_msccl rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest gpcnet cuda_gpuburn cpu_stream cpu_hpl directx_amf_encoding_latency directx_amd rocm_hipblaslt megatron_lm megatron_deepspeed apex_rocm
# Build all targets. # Build all targets.
all: cuda rocm all: cuda rocm
cuda_with_msccl: cuda cuda_msccl cuda_with_msccl: cuda cuda_msccl
cuda: common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest gpcnet cuda_gpuburn megatron_lm megatron_deepspeed cuda: common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest gpcnet cuda_gpuburn megatron_lm megatron_deepspeed
rocm: common rocm_perftest rocm_rccl_tests rocm_rocblas rocm_bandwidthTest rocm_hipblaslt megatron_deepspeed rocm: common rocm_perftest rocm_rccl_tests rocm_rocblas rocm_bandwidthTest rocm_hipblaslt megatron_deepspeed apex_rocm
cpu: common cpu_perftest cpu: common cpu_perftest
common: cpu_hpl cpu_stream fio common: cpu_hpl cpu_stream fio
directx_amd: directx_amf_encoding_latency directx_amd: directx_amf_encoding_latency
@ -62,7 +64,7 @@ endif
cuda_nccl_tests: sb_micro_path cuda_nccl_tests: sb_micro_path
ifneq (,$(wildcard nccl-tests/Makefile)) ifneq (,$(wildcard nccl-tests/Makefile))
cd ./nccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) -j cd ./nccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) -j
cp -v ./nccl-tests/build/* $(SB_MICRO_PATH)/bin/ cp -v -r ./nccl-tests/build/* $(SB_MICRO_PATH)/bin/
endif endif
# Build perftest. # Build perftest.
@ -86,11 +88,11 @@ ifneq (,$(wildcard fio/Makefile))
cd ./fio && ./configure --prefix=$(SB_MICRO_PATH) --disable-native && make -j && make install cd ./fio && ./configure --prefix=$(SB_MICRO_PATH) --disable-native && make -j && make install
endif endif
# Build rccl-tests from commit 2a18737 of default branch. # Build rccl-tests from commit 46375b1 of default branch.
rocm_rccl_tests: sb_micro_path rocm_rccl_tests: sb_micro_path
ifneq (, $(wildcard rccl-tests/Makefile)) ifneq (, $(wildcard rccl-tests/Makefile))
cd ./rccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) HIP_HOME=$(HIP_HOME) RCCL_HOME=$(RCCL_HOME) -j cd ./rccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) -j
cp -v ./rccl-tests/build/* $(SB_MICRO_PATH)/bin/ cp -v -r ./rccl-tests/build/* $(SB_MICRO_PATH)/bin/
endif endif
# Build rocblas-bench. # Build rocblas-bench.
@ -175,42 +177,58 @@ directx_amf_encoding_latency:
"C:\temp\BuildTools\MSBuild\Current\Bin\MSBuild.exe" "AMF\amf\public\samples\CPPSamples_vs2019.sln" /t:EncoderLatency /p:Platform=x64 /p:Configuration=Release /p:OutDir="%SB_MICRO_PATH%\bin" \ "C:\temp\BuildTools\MSBuild\Current\Bin\MSBuild.exe" "AMF\amf\public\samples\CPPSamples_vs2019.sln" /t:EncoderLatency /p:Platform=x64 /p:Configuration=Release /p:OutDir="%SB_MICRO_PATH%\bin" \
) )
# Install Megatron-LM # Install requirements for Megatron-LM
megatron_lm: megatron_lm:
if [ ! -d "Megatron/Megatron-LM" ]; then \
git clone "https://github.com/NVIDIA/Megatron-LM.git" "Megatron/Megatron-LM"; \
fi
cd Megatron && \ cd Megatron && \
python -m pip install -r requirements.txt apt install -y python3-mpi4py && \
python -m pip install --no-cache-dir -r requirements.txt
# Install Megatron-DeepSpeed # Install requirements for Megatron-DeepSpeed
megatron_deepspeed: megatron_deepspeed:
if [ ! -d "Megatron/Megatron-DeepSpeed" ]; then \
git clone "https://github.com/microsoft/Megatron-DeepSpeed.git" "Megatron/Megatron-DeepSpeed"; \
fi
cd Megatron && \ cd Megatron && \
python -m pip install -r requirements.txt && \ apt install -y python3-mpi4py && \
python -m pip install --no-cache-dir -r requirements.txt && \
python -m pip install DeepSpeed python -m pip install DeepSpeed
# Instal apex of ROCm due to dependency of Megatron
apex_rocm:
$(eval TORCH_VERSION ?= $(shell python -c "import torch; print(torch.__version__)"))
$(eval TORCH_MAJOR_VERSION ?= $(word 1,$(subst ., ,$(TORCH_VERSION))))
$(eval TORCH_MINOR_VERSION ?= $(word 2,$(subst ., ,$(TORCH_VERSION))))
if [ ! -d "apex" ]; then \
git clone https://github.com/ROCmSoftwarePlatform/apex.git ; \
fi
cd apex && \
if [ "$$(expr $(TORCH_MAJOR_VERSION) \> 2)" -eq 1 ] && [ "$$(expr $(TORCH_MINOR_VERSION) \> 1)" -eq 1 ]; then \
git checkout master ; \
elif [ "$$(expr $(TORCH_MAJOR_VERSION) == 2)" -eq 1 ] && [ "$$(expr $(TORCH_MINOR_VERSION) == 1)" -eq 1 ]; then \
git checkout release/1.1.0 ; \
elif [ "$$(expr $(TORCH_MAJOR_VERSION) == 2)" -eq 1 ] && [ "$$(expr $(TORCH_MINOR_VERSION) == 0)" -eq 1 ]; then \
git checkout release/1.0.0 ; \
elif [ "$$(expr $(TORCH_MAJOR_VERSION) == 1)" -eq 1 ]; then \
git checkout release/1.0.0 ; \
fi
pip install -v --disable-pip-version-check --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./apex
# Build MSCCL for CUDA # Build MSCCL for CUDA
cuda_msccl: sb_micro_path cuda_msccl: sb_micro_path
ifneq (,$(wildcard msccl/executor/msccl-executor-nccl/Makefile)) ifneq (,$(wildcard msccl/executor/msccl-executor-nccl/Makefile))
cd ./msccl/executor/msccl-executor-nccl && \ cd ./msccl/executor/msccl-executor-nccl && \
make -j4 src.build && \ make -j $(shell nproc --ignore=2) src.build && \
cd ../../.. cd ../../..
mkdir -p $(SB_MICRO_PATH)/lib/msccl-executor-nccl && \ mkdir -p $(SB_MICRO_PATH)/lib/msccl-executor-nccl && \
cp -r -v ./msccl/executor/msccl-executor-nccl/build/* $(SB_MICRO_PATH)/lib/msccl-executor-nccl/ cp -r -v ./msccl/executor/msccl-executor-nccl/build/* $(SB_MICRO_PATH)/lib/msccl-executor-nccl/
endif endif
ifneq (,$(wildcard msccl/scheduler/msccl-scheduler/Makefile)) ifneq (,$(wildcard msccl/scheduler/msccl-scheduler/Makefile))
cd ./msccl/scheduler/msccl-scheduler && \ cd ./msccl/scheduler/msccl-scheduler && \
CXX=nvcc BIN_HOME=$(SB_MICRO_PATH)/lib/msccl-executor-nccl SRC_HOME=../../../msccl/executor/msccl-executor-nccl make -j4 && \ CXX=nvcc BIN_HOME=$(SB_MICRO_PATH)/lib/msccl-executor-nccl SRC_HOME=../../../msccl/executor/msccl-executor-nccl make -j $(shell nproc --ignore=2) && \
cd ../../.. cd ../../..
mkdir -p $(SB_MICRO_PATH)/lib/msccl-scheduler && \ mkdir -p $(SB_MICRO_PATH)/lib/msccl-scheduler && \
cp -r -v ./msccl/scheduler/msccl-scheduler/build/* $(SB_MICRO_PATH)/lib/msccl-scheduler/ cp -r -v ./msccl/scheduler/msccl-scheduler/build/* $(SB_MICRO_PATH)/lib/msccl-scheduler/
endif endif
ifneq (,$(wildcard msccl/tests/msccl-tests-nccl/Makefile)) ifneq (,$(wildcard msccl/tests/msccl-tests-nccl/Makefile))
cd ./msccl/tests/msccl-tests-nccl && \ cd ./msccl/tests/msccl-tests-nccl && \
make MPI=1 MPI_HOME=$(MPI_HOME) NCCL_HOME=$(SB_MICRO_PATH)/lib/msccl-executor-nccl -j4 && cd ../../.. make MPI=1 MPI_HOME=$(MPI_HOME) NCCL_HOME=$(SB_MICRO_PATH)/lib/msccl-executor-nccl -j $(shell nproc --ignore=2) && cd ../../..
mkdir -p $(SB_MICRO_PATH)/bin/msccl-tests-nccl && \ mkdir -p $(SB_MICRO_PATH)/bin/msccl-tests-nccl && \
cp -r -v ./msccl/tests/msccl-tests-nccl/build/* $(SB_MICRO_PATH)/bin/msccl-tests-nccl/ cp -r -v ./msccl/tests/msccl-tests-nccl/build/* $(SB_MICRO_PATH)/bin/msccl-tests-nccl/
endif endif

1
third_party/Megatron/Megatron-DeepSpeed поставляемый Submodule

@ -0,0 +1 @@
Subproject commit 71e8407c98bacacb002823ea587c321fe58b28a6

1
third_party/Megatron/Megatron-LM поставляемый Submodule

@ -0,0 +1 @@
Subproject commit 52b7a18a00bced8b3670eededfd58ee0c4bd7d06

26
third_party/Megatron/megatron_deepspeed_rocm6.patch поставляемый Normal file
Просмотреть файл

@ -0,0 +1,26 @@
diff --git a/megatron/fused_kernels/scaled_softmax_cuda.cu b/megatron/fused_kernels/scaled_softmax_cuda.cu
index 90e1c9f..d217aec 100644
--- a/megatron/fused_kernels/scaled_softmax_cuda.cu
+++ b/megatron/fused_kernels/scaled_softmax_cuda.cu
@@ -4,7 +4,7 @@
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef __HIP_PLATFORM_AMD__
#include <cuda_profiler_api.h>
#endif
#include <ATen/cuda/CUDAContext.h>
diff --git a/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu b/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu
index 74c9f3d..03b5fc8 100644
--- a/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu
+++ b/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu
@@ -4,7 +4,7 @@
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef __HIP_PLATFORM_AMD__
#include <cuda_profiler_api.h>
#endif
#include <ATen/cuda/CUDAContext.h>

4
third_party/Megatron/requirements.txt поставляемый
Просмотреть файл

@ -10,4 +10,6 @@ tqdm
sentencepiece sentencepiece
wandb wandb
einops einops
typing_extensions==4.5.0 typing_extensions==4.9.0
apex
mpi4py

2
third_party/nccl-tests поставляемый

@ -1 +1 @@
Subproject commit 8274cb47b6dc70ce4411e7f114b77173d3892414 Subproject commit 1292b25553bd0384f2faa2965f9d82b99797a348

2
third_party/perftest поставляемый

@ -1 +1 @@
Subproject commit 5fb4f10a7e7827ed15e53c25810a10be279d6e23 Subproject commit dffd1dd8b8a26dad2634a546e7e4d082dc882fbc

28
third_party/perftest_rocm6.patch поставляемый Normal file
Просмотреть файл

@ -0,0 +1,28 @@
diff --git a/configure.ac b/configure.ac
index 20eceda..c8f0c07 100755
--- a/configure.ac
+++ b/configure.ac
@@ -237,7 +237,7 @@ AC_ARG_WITH([rocm],
],
[AS_CASE([$with_rocm],
[yes|no], [],
- [CPPFLAGS="-I$with_rocm/include $CPPFLAGS"
+ [CPPFLAGS="-I$with_rocm/include -D__HIP_PLATFORM_AMD__=1 $CPPFLAGS"
LDFLAGS="-L$with_rocm/lib64 -Wl,-rpath=$with_rocm/lib64 -L$with_rocm/lib -Wl,-rpath=$with_rocm/lib -lamdhip64 $LDFLAGS"])
])
diff --git a/src/rocm_memory.c b/src/rocm_memory.c
index e9a9136..b6cb23a 100644
--- a/src/rocm_memory.c
+++ b/src/rocm_memory.c
@@ -44,8 +44,8 @@ static int init_rocm(int device_id) {
hipDeviceProp_t prop = {0};
ROCM_CHECK(hipGetDeviceProperties(&prop, device_id));
- printf("Using ROCm Device with ID: %d, Name: %s, PCI Bus ID: 0x%x, GCN Arch: %d\n",
- device_id, prop.name, prop.pciBusID, prop.gcnArch);
+ printf("Using ROCm Device with ID: %d, Name: %s, PCI Bus ID: 0x%x, GCN Arch: %s\n",
+ device_id, prop.name, prop.pciBusID, prop.gcnArchName);
return SUCCESS;
}

2
third_party/rccl-tests поставляемый

@ -1 +1 @@
Subproject commit 2a18737dc681e03ce82c046caa71b28db65017b5 Subproject commit 46375b1c527b2e3afe80fdd6dd136151bd939675

Просмотреть файл

@ -0,0 +1,53 @@
---
slug: release-sb-v0.10
title: Releasing SuperBench v0.10
author: Peng Cheng
author_title: SuperBench Team
author_url: https://github.com/cp5555
author_image_url: https://github.com/cp5555.png
tags: [superbench, announcement, release]
---
We are very happy to announce that **SuperBench 0.10.0 version** is officially released today!
You can install and try superbench by following [Getting Started Tutorial](https://microsoft.github.io/superbenchmark/docs/getting-started/installation).
## SuperBench 0.10.0 Release Notes
### SuperBench Improvements
- Support monitoring for AMD GPUs.
- Support ROCm 5.7 and ROCm 6.0 dockerfile.
- Add MSCCL support for Nvidia GPU.
- Fix NUMA domains swap issue in NDv4 topology file.
- Add NDv5 topo file.
- Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.
### Micro-benchmark Improvements
- Add HPL random generator to gemm-flops with ROCm.
- Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
- Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
- Update Docker image for H100 support.
- Update MLC version into 3.10 for CUDA/ROCm dockerfile.
- Bug fix for GPU Burn test.
- Support INT8 in cublaslt function.
- Add hipBLASLt function benchmark.
- Support cpu-gpu and gpu-cpu in ib-validation.
- Support graph mode in NCCL/RCCL benchmarks for latency metrics.
- Support cpp implementation in distributed inference benchmark.
- Add O2 option for gpu copy ROCm build.
- Support different hipblasLt data types in dist inference.
- Support in-place in NCCL/RCCL benchmark.
- Support data type option in NCCL/RCCL benchmark.
- Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
- Update hipblaslt GEMM metric unit to tflops.
- Support FP8 for hipblaslt benchmark.
### Model Benchmark Improvements
- Change torch.distributed.launch to torchrun.
- Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.
### Result Analysis
- Support baseline generation from multiple nodes.

Просмотреть файл

@ -101,7 +101,7 @@ module.exports = {
announcementBar: { announcementBar: {
id: 'supportus', id: 'supportus',
content: content:
'📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.9">v0.9.0</a> has been released! ' + '📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.10">v0.10.0</a> has been released! ' +
'⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️', '⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️',
}, },
algolia: { algolia: {

4
website/package-lock.json сгенерированный
Просмотреть файл

@ -1,6 +1,6 @@
{ {
"name": "superbench-website", "name": "superbench-website",
"version": "0.9.0", "version": "0.10.0",
"lockfileVersion": 1, "lockfileVersion": 1,
"requires": true, "requires": true,
"dependencies": { "dependencies": {
@ -11678,4 +11678,4 @@
"integrity": "sha512-V50KMwwzqJV0NpZIZFwfOD5/lyny3WlSzRiXgA0G7VUnRlqttta1L6UQIHzd6EuBY/cHGfwTIck7w1yH6Q5zUw==" "integrity": "sha512-V50KMwwzqJV0NpZIZFwfOD5/lyny3WlSzRiXgA0G7VUnRlqttta1L6UQIHzd6EuBY/cHGfwTIck7w1yH6Q5zUw=="
} }
} }
} }

Просмотреть файл

@ -1,6 +1,6 @@
{ {
"name": "superbench-website", "name": "superbench-website",
"version": "0.9.0", "version": "0.10.0",
"private": true, "private": true,
"scripts": { "scripts": {
"docusaurus": "docusaurus", "docusaurus": "docusaurus",
@ -38,4 +38,4 @@
"last 1 safari version" "last 1 safari version"
] ]
} }
} }