Release - SuperBench v0.3.0 (#212)

**Description** Cherry-pick bug fixes from v0.3.0 to main. **Major Revisions** * Docs - Upgrade version and release note (#209) * Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210) * Benchmarks: Update - Update benchmarks in configuration file (#208) * CI/CD - Update GitHub Action VM (#211) * Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203) * CI/CD - Fix bug in build image for push event (#205) * Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204) * Tool: Fix bug - Fix function naming issue in system info (#200) * CI/CD - Push images in GitHub Action (#202) * Bug - Fix torch.distributed command for single node (#201) * CLI - Integrate system info for node (#199) * Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196) * CI/CD - Add ROCm image build in GitHub Actions (#194) * Bug: Fix bug - fix bug of hipBusBandwidth build (#193) * Benchmarks: Build Pipeline - Restore rocblas build logic (#197) * Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198) * Bug - Revise 'docker run' in sb deploy (#195) * Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190) Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com> Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com> Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2021-09-26 09:30:31 +08:00 · 2021-09-26 09:30:31 +08:00 · dfbd70b129
--- a/.github/workflows/build-image.yml
+++ b/.github/workflows/build-image.yml
@ -4,15 +4,32 @@ on:
  push:
    branches:
    - main
+    - release/*
  pull_request:
    branches:
    - main
    - release/*
+  release:
+    types:
+    - published
+  workflow_dispatch:

 jobs:
  docker:
-    name: Docker build
+    name: Docker build ${{ matrix.name }}
    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      packages: write
+    strategy:
+      matrix:
+        include:
+        - name: cuda11.1.1
+          tags: superbench/main:cuda11.1.1,superbench/superbench:latest
+        - name: rocm4.2-pytorch1.7.0
+          tags: superbench/main:rocm4.2-pytorch1.7.0
+        - name: rocm4.0-pytorch1.7.0
+          tags: superbench/main:rocm4.0-pytorch1.7.0
    steps:
      - name: Checkout
        uses: actions/checkout@v2
@ -26,18 +43,29 @@ jobs:
          done
          sudo apt-get clean
          df -h
-          echo 'nproc: '$(nproc)
      - name: Prepare metadata
        id: metadata
        run: |
-          DOCKER_IMAGE=superbench/superbench
-          IMAGE_TAG=latest
+          TAGS=${{ matrix.tags }}
+          if [[ "${{ github.event_name }}" == "push" ]] && [[ "${{ github.ref }}" == "refs/heads/release/"* ]]; then
+            TAGS=$(sed "s/main:/release:${GITHUB_REF##*/}-/g" <<< ${TAGS})
+          fi
+          if [[ "${{ github.event_name }}" == "pull_request" ]] && [[ "${{ github.base_ref }}" == "release/"* ]]; then
+            TAGS=$(sed "s/main:/release:${GITHUB_BASE_REF##*/}-/g" <<< ${TAGS})
+          fi
+          if [[ "${{ github.event_name }}" == "release" ]]; then
+            TAGS=$(sed "s/main:/superbench:${GITHUB_REF##*/}-/g" <<< ${TAGS})
+            GHCR_TAG=$(cut -d, -f1 <<< ${TAGS} | sed "s#superbench/superbench#ghcr.io/${{ github.repository }}/superbench#g")
+            TAGS="${TAGS},${GHCR_TAG}"
+          fi
+          if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
+            TAGS=$(sed "s/main:/dev:/g" <<< ${TAGS})
+          fi
+          DOCKERFILE=dockerfile/${{ matrix.name }}.dockerfile

-          DOCKERFILE=dockerfile/cuda11.1.1.dockerfile
-          TAGS="${DOCKER_IMAGE}:${IMAGE_TAG}"
-          CACHE_FROM="type=registry,ref=${DOCKER_IMAGE}:${IMAGE_TAG}"
+          CACHE_FROM="type=registry,ref=$(cut -d, -f1 <<< ${TAGS})"
          CACHE_TO=""
-          if [ "${{ github.event_name }}" = "push" ]; then
+          if [[ "${{ github.event_name }}" != "pull_request" ]]; then
            CACHE_TO="type=inline,mode=max"
          fi

@ -45,16 +73,25 @@ jobs:
          echo ::set-output name=tags::${TAGS}
          echo ::set-output name=cache_from::${CACHE_FROM}
          echo ::set-output name=cache_to::${CACHE_TO}
+      - name: Echo image tag
+        run: echo ${{ steps.metadata.outputs.tags }}
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v1
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
      - name: Login to Docker Hub
        uses: docker/login-action@v1
-        if: ${{ github.event_name == 'push' }}
+        if: ${{ github.event_name != 'pull_request' }}
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
+      - name: Login to the GitHub Container Registry
+        uses: docker/login-action@v1
+        if: ${{ github.event_name == 'release' }}
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and push
        id: docker_build
        uses: docker/build-push-action@v2
@ -62,7 +99,7 @@ jobs:
          platforms: linux/amd64
          context: .
          file: ${{ steps.metadata.outputs.dockerfile }}
-          push: ${{ github.event_name == 'push' }}
+          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.metadata.outputs.tags }}
          cache-from: ${{ steps.metadata.outputs.cache_from }}
          cache-to: ${{ steps.metadata.outputs.cache_to }}
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@ -9,7 +9,7 @@ on:
 jobs:
  spelling:
    name: Spelling check
-    runs-on: ubuntu-16.04
+    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v2
--- a/README.md
+++ b/README.md
@ -15,7 +15,7 @@

 __SuperBench__ is a validation and profiling tool for AI infrastructure.

-📢 [v0.2.1](https://github.com/microsoft/superbenchmark/releases/tag/v0.2.1) has been released!
+📢 [v0.3.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.3.0) has been released!

 ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._

--- a/dockerfile/rocm4.0-pytorch1.7.0.dockerfile
+++ b/dockerfile/rocm4.0-pytorch1.7.0.dockerfile
@ -88,7 +88,7 @@ ENV PATH="${PATH}" \
 WORKDIR ${SB_HOME}

 ADD third_party third_party
-RUN ROCM_VERSION=rocm-4.0.0 make -j -C third_party rocm
+RUN ROCM_VERSION=rocm-4.0.0 make -j -C third_party -o rocm_rocblas rocm

 # Workaround for image having package installed in user path
 RUN mv /root/.local/bin/* /opt/conda/bin/ && \
--- a/docs/developer-guides/using-docker.mdx
+++ b/docs/developer-guides/using-docker.mdx
@ -36,7 +36,10 @@ docker buildx build \
 <TabItem value='rocm'>

 ```bash
-# coming soon
+export DOCKER_BUILDKIT=1
+docker buildx build \
+  --platform linux/amd64 --cache-to type=inline,mode=max \
+  --tag superbench-dev --file dockerfile/rocm4.2-pytorch1.7.0.dockerfile .
 ```

 </TabItem>
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@ -57,7 +57,7 @@ You can clone the source from GitHub and build it.
 :::note Note
 You should checkout corresponding tag to use release version, for example,

-`git clone -b v0.2.1 https://github.com/microsoft/superbenchmark`
+`git clone -b v0.3.0 https://github.com/microsoft/superbenchmark`
 :::

 ```bash
--- a/docs/getting-started/run-superbench.md
+++ b/docs/getting-started/run-superbench.md
@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
 :::note Note
 You should deploy corresponding Docker image to use release version, for example,

-`sb deploy -f local.ini -i superbench/superbench:v0.2.1-cuda11.1.1`
+`sb deploy -f local.ini -i superbench/superbench:v0.3.0-cuda11.1.1`
 :::

 ## Run
--- a/docs/superbench-config.mdx
+++ b/docs/superbench-config.mdx
@ -66,7 +66,7 @@ superbench:
 <TabItem value='example'>

 ```yaml
-version: v0.2
+version: v0.3
 superbench:
  enable: benchmark_1
  var:
--- a/docs/tutorial/container-images.mdx
+++ b/docs/tutorial/container-images.mdx
@ -29,13 +29,17 @@ available tags are listed below for all stable versions.

 | Tag               | Description                        |
 | ----------------- | ---------------------------------- |
+| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
 | v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
 | v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 |

 </TabItem>
 <TabItem value='rocm'>

-  Coming soon.
+| Tag                         | Description                                    |
+| --------------------------- | ---------------------------------------------- |
+| v0.3.0-rocm4.2-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0 |
+| v0.3.0-rocm4.0-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.0, PyTorch 1.7.0 |

 </TabItem>
 </Tabs>
--- a/superbench/init.py
+++ b/superbench/init.py
@ -6,5 +6,5 @@
 Provide hardware and software benchmarks for AI systems.
 """

-__version__ = '0.2.1'
+__version__ = '0.3.0'
 __author__ = 'Microsoft'
--- a/superbench/benchmarks/build.sh
+++ b/superbench/benchmarks/build.sh
@ -3,6 +3,7 @@
 # Copyright (c) Microsoft Corporation - All rights reserved
 # Licensed under the MIT License

+set -e

 SB_MICRO_PATH="${SB_MICRO_PATH:-/usr/local}"

@ -12,6 +13,7 @@ for dir in micro_benchmarks/*/ ; do
        BUILD_ROOT=$dir/build
        mkdir -p $BUILD_ROOT
        cmake -DCMAKE_INSTALL_PREFIX=$SB_MICRO_PATH -DCMAKE_BUILD_TYPE=Release -S $SOURCE_DIR -B $BUILD_ROOT
-        cmake --build $BUILD_ROOT --target install
+        cmake --build $BUILD_ROOT
+        cmake --install $BUILD_ROOT
    fi
 done
--- a/superbench/benchmarks/micro_benchmarks/computation_communication_overlap.py
+++ b/superbench/benchmarks/micro_benchmarks/computation_communication_overlap.py
@ -264,11 +264,7 @@ class ComputationCommunicationOverlap(MicroBenchmark):
            torch.distributed.destroy_process_group()
        except BaseException as e:
            self._result.set_return_code(ReturnCode.DISTRIBUTED_SETTING_DESTROY_FAILURE)
-            logger.error(
-                'Post process failed - benchmark: {}, mode: {}, message: {}.'.format(
-                    self._name, self._args.mode, str(e)
-                )
-            )
+            logger.error('Post process failed - benchmark: {}, message: {}.'.format(self._name, str(e)))
            return False

        return True
--- a/superbench/benchmarks/micro_benchmarks/cublas_function/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/cublas_function/CMakeLists.txt
@ -4,9 +4,9 @@
 cmake_minimum_required(VERSION 3.18)
 project(cublas_benchmark LANGUAGES CXX)

-include(../cuda_common.cmake)
 find_package(CUDAToolkit QUIET)
 if(CUDAToolkit_FOUND)
+  include(../cuda_common.cmake)
  set(SRC "cublas_helper.cpp" CACHE STRING "source file")
  set(TARGET_NAME "cublas_function" CACHE STRING "target name")

@ -25,8 +25,8 @@ if(CUDAToolkit_FOUND)
    add_subdirectory(${json_SOURCE_DIR} ${json_BINARY_DIR} EXCLUDE_FROM_ALL)
  endif()

-  add_executable(cublas_benchmark cublas_test.cpp)   
-  target_link_libraries(cublas_benchmark ${TARGET_NAME} nlohmann_json::nlohmann_json CUDA::cudart CUDA::cublas) 
+  add_executable(cublas_benchmark cublas_test.cpp)
+  target_link_libraries(cublas_benchmark ${TARGET_NAME} nlohmann_json::nlohmann_json CUDA::cudart CUDA::cublas)

  install(TARGETS cublas_benchmark ${TARGET_NAME} RUNTIME DESTINATION bin LIBRARY DESTINATION lib)
 endif()
--- a/superbench/benchmarks/micro_benchmarks/cuda_common.cmake
+++ b/superbench/benchmarks/micro_benchmarks/cuda_common.cmake
@ -6,6 +6,8 @@ if(NOT DEFINED CMAKE_CUDA_STANDARD)
    set(CMAKE_CUDA_STANDARD_REQUIRED ON)
 endif()

+enable_language(CUDA)
+
 if(NOT DEFINED NVCC_ARCHS_SUPPORTED)
    # Reference: https://github.com/NVIDIA/cutlass/blob/0e137486498a52954eff239d874ee27ab23358e7/CMakeLists.txt#L89
    set(NVCC_ARCHS_SUPPORTED "")
--- a/superbench/benchmarks/micro_benchmarks/cudnn_function/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/cudnn_function/CMakeLists.txt
@ -4,9 +4,9 @@
 cmake_minimum_required(VERSION 3.18)
 project(cudnn_benchmark LANGUAGES CXX)

-include(../cuda_common.cmake)
 find_package(CUDAToolkit QUIET)
 if(CUDAToolkit_FOUND)
+  include(../cuda_common.cmake)
  set(SRC "cudnn_helper.cpp" CACHE STRING "source file")
  set(TARGET_NAME "cudnn_function" CACHE STRING "target name")

@ -28,7 +28,7 @@ if(CUDAToolkit_FOUND)
    add_subdirectory(${json_SOURCE_DIR} ${json_BINARY_DIR} EXCLUDE_FROM_ALL)
  endif()

-  add_executable(cudnn_benchmark cudnn_test.cpp)   
-  target_link_libraries(cudnn_benchmark ${TARGET_NAME} nlohmann_json::nlohmann_json CUDA::cudart ${CUDNN_LIBRARY}) 
+  add_executable(cudnn_benchmark cudnn_test.cpp)
+  target_link_libraries(cudnn_benchmark ${TARGET_NAME} nlohmann_json::nlohmann_json CUDA::cudart ${CUDNN_LIBRARY})
  install(TARGETS cudnn_benchmark ${TARGET_NAME} RUNTIME DESTINATION bin LIBRARY DESTINATION lib)
 endif()
--- a/superbench/benchmarks/micro_benchmarks/gpu_sm_copy_performance/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/gpu_sm_copy_performance/CMakeLists.txt
@ -5,36 +5,34 @@ cmake_minimum_required(VERSION 3.18)

 project(gpu_sm_copy LANGUAGES CXX)

-include(../cuda_common.cmake)
 find_package(CUDAToolkit QUIET)
-include(../rocm_common.cmake)
-find_package(HIP QUIET)

 # Cuda environment
 if(CUDAToolkit_FOUND)
    message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})
-    enable_language(CUDA)

+    include(../cuda_common.cmake)
    add_executable(gpu_sm_copy gpu_sm_copy.cu)
    set_property(TARGET gpu_sm_copy PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
    install(TARGETS gpu_sm_copy RUNTIME DESTINATION bin)
-
-# ROCm environment
-elseif(HIP_FOUND)
-    message(STATUS "Found ROCm: " ${HIP_VERSION})
-    
-    # Convert cuda code to hip code inplace
-    execute_process(COMMAND hipify-perl -inplace -print-stats gpu_sm_copy.cu
-                     WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
-
-    # Add HIP targets
-    set_source_files_properties(gpu_sm_copy.cu PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
-    # Link with HIP
-    hip_add_executable(gpu_sm_copy gpu_sm_copy.cu)
-    # Install tergets
-    install(TARGETS gpu_sm_copy RUNTIME DESTINATION bin)
-
 else()
-    message(FATAL_ERROR "No CUDA or ROCm environment found.")
-endif()
+    # ROCm environment
+    include(../rocm_common.cmake)
+    find_package(HIP QUIET)
+    if(HIP_FOUND)
+        message(STATUS "Found ROCm: " ${HIP_VERSION})

+        # Convert cuda code to hip code inplace
+        execute_process(COMMAND hipify-perl -inplace -print-stats gpu_sm_copy.cu
+                        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
+
+        # Add HIP targets
+        set_source_files_properties(gpu_sm_copy.cu PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
+        # Link with HIP
+        hip_add_executable(gpu_sm_copy gpu_sm_copy.cu)
+        # Install tergets
+        install(TARGETS gpu_sm_copy RUNTIME DESTINATION bin)
+    else()
+        message(FATAL_ERROR "No CUDA or ROCm environment found.")
+    endif()
+endif()
--- a/superbench/benchmarks/micro_benchmarks/kernel_launch_overhead/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/kernel_launch_overhead/CMakeLists.txt
@ -5,36 +5,34 @@ cmake_minimum_required(VERSION 3.18)

 project(kernel_launch_overhead LANGUAGES CXX)

-include(../cuda_common.cmake)
 find_package(CUDAToolkit QUIET)
-include(../rocm_common.cmake)
-find_package(HIP QUIET)

 # Cuda environment
 if(CUDAToolkit_FOUND)
    message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})
-    enable_language(CUDA)

+    include(../cuda_common.cmake)
    add_executable(kernel_launch_overhead kernel_launch.cu)
    set_property(TARGET kernel_launch_overhead PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
    install(TARGETS kernel_launch_overhead RUNTIME DESTINATION bin)
-
-# ROCm environment
-elseif(HIP_FOUND)
-    message(STATUS "Found HIP: " ${HIP_VERSION})
-    
-    # Convert cuda code to hip code inplace
-    execute_process(COMMAND hipify-perl -inplace -print-stats kernel_launch.cu
-                     WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
-
-    # Add HIP targets
-    set_source_files_properties(kernel_launch.cu PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
-    # Link with HIP
-    hip_add_executable(kernel_launch_overhead kernel_launch.cu)
-    # Install tergets
-    install(TARGETS kernel_launch_overhead RUNTIME DESTINATION bin)
-
 else()
-    message(FATAL_ERROR "No CUDA or ROCm environment found.")
-endif()
+    # ROCm environment
+    include(../rocm_common.cmake)
+    find_package(HIP QUIET)
+    if(HIP_FOUND)
+        message(STATUS "Found HIP: " ${HIP_VERSION})

+        # Convert cuda code to hip code inplace
+        execute_process(COMMAND hipify-perl -inplace -print-stats kernel_launch.cu
+                        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
+
+        # Add HIP targets
+        set_source_files_properties(kernel_launch.cu PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
+        # Link with HIP
+        hip_add_executable(kernel_launch_overhead kernel_launch.cu)
+        # Install tergets
+        install(TARGETS kernel_launch_overhead RUNTIME DESTINATION bin)
+    else()
+        message(FATAL_ERROR "No CUDA or ROCm environment found.")
+    endif()
+endif()
--- a/superbench/benchmarks/model_benchmarks/pytorch_base.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_base.py
@ -174,6 +174,7 @@ class PytorchBase(ModelBenchmark):

        try:
            if self._args.distributed_impl == DistributedImpl.DDP:
+                torch.distributed.barrier()
                torch.distributed.destroy_process_group()
        except BaseException as e:
            self._result.set_return_code(ReturnCode.DISTRIBUTED_SETTING_DESTROY_FAILURE)
--- a/superbench/cli/_commands.py
+++ b/superbench/cli/_commands.py
@ -23,6 +23,8 @@ class SuperBenchCommandsLoader(CLICommandsLoader):
            g.command('deploy', 'deploy_command_handler')
            g.command('exec', 'exec_command_handler')
            g.command('run', 'run_command_handler')
+        with CommandGroup(self, 'node', 'superbench.cli._node_handler#{}') as g:
+            g.command('info', 'info_command_handler')
        return super().load_command_table(args)

    def load_arguments(self, command):
--- a/superbench/cli/_node_handler.py
+++ b/superbench/cli/_node_handler.py
@ -0,0 +1,19 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""SuperBench CLI node subgroup command handler."""
+
+from superbench.tools import SystemInfo
+
+
+def info_command_handler():
+    """Get node hardware info.
+
+    Returns:
+        dict: node info.
+    """
+    try:
+        info = SystemInfo().get_all()
+    except Exception as ex:
+        raise RuntimeError('Failed to get node info.') from ex
+    return info
--- a/superbench/config/amd_mi100_hpe.yaml
+++ b/superbench/config/amd_mi100_hpe.yaml
@ -3,7 +3,7 @@
 # Server:
 #   - Product: HPE Apollo 6500

-version: v0.2
+version: v0.3
 superbench:
  enable: null
  var:
@ -40,24 +40,20 @@ superbench:
    rccl-bw:
      enable: true
      modes:
-        - name: mpi
-          proc_num: 8
-          env:  
-            NCCL_SOCKET_IFNAME: ens17f0 
-            NCCL_IB_GDR_LEVEL: 1
+        - name: local
+          proc_num: 1
+          parallel: no
      parameters:
-        maxbytes: 128M
-        minbytes: 32M
-        iters: 50
-        ngpus: 1
-        operations: allreduce
+        maxbytes: 8G
+        ngpus: 8
+        operation: allreduce
    mem-bw:
      <<: *default_local_mode
    gemm-flops:
      <<: *default_local_mode
      parameters:
-        m: 7680 
-        n: 8192 
+        m: 7680
+        n: 8192
        k: 8192
    ib-loopback:
      enable: true
@ -75,15 +71,16 @@ superbench:
      parameters:
        block_devices: []
    gpu-sm-copy-bw:
-      enable: false
+      enable: true
      modes:
        - name: local
          proc_num: 32
-          prefix: CUDA_VISIBLE_DEVICES=$(({proc_rank}%8)) numactl -N $(({proc_rank}%4)) -m $(({proc_rank}%4))
+          prefix: HIP_VISIBLE_DEVICES=$(({proc_rank}%8)) numactl -N $(({proc_rank}%4)) -m $(({proc_rank}%4))
          parallel: no
      parameters:
-        dtoh: true
-        htod: true
+        mem_type:
+          - dtoh
+          - htod
    gpt_models:
      <<: *default_pytorch_mode
      models:
--- a/superbench/config/amd_mi100_z53.yaml
+++ b/superbench/config/amd_mi100_z53.yaml
@ -4,7 +4,7 @@
 #   - Product: G482-Z53
 #   - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html

-version: v0.2
+version: v0.3
 superbench:
  enable: null
  var:
@ -13,7 +13,7 @@ superbench:
      modes:
        - name: local
          proc_num: 8
-          prefix: CUDA_VISIBLE_DEVICES={proc_rank}
+          prefix: HIP_VISIBLE_DEVICES={proc_rank}
          parallel: yes
    default_pytorch_mode: &default_pytorch_mode
      enable: true
@ -36,6 +36,52 @@ superbench:
        - train
      pin_memory: yes
  benchmarks:
+    kernel-launch:
+      <<: *default_local_mode
+    rccl-bw:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 1
+          parallel: no
+      parameters:
+        maxbytes: 8G
+        ngpus: 8
+        operation: allreduce
+    mem-bw:
+      <<: *default_local_mode
+    gemm-flops:
+      <<: *default_local_mode
+      parameters:
+        m: 7680 
+        n: 8192 
+        k: 8192
+    ib-loopback:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 2
+          prefix: PROC_RANK={proc_rank} IB_DEVICES=0,1
+          parallel: no
+    disk-benchmark:
+      enable: false
+      modes:
+        - name: local
+          proc_num: 1
+          parallel: no
+      parameters:
+        block_devices: []
+    gpu-sm-copy-bw:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 32
+          prefix: HIP_VISIBLE_DEVICES=$(({proc_rank}%8)) numactl -N $(({proc_rank}%4)) -m $(({proc_rank}%4))
+          parallel: no
+      parameters:
+        mem_type:
+          - dtoh
+          - htod
    gpt_models:
      <<: *default_pytorch_mode
      models:
--- a/superbench/config/azure_ndv4.yaml
+++ b/superbench/config/azure_ndv4.yaml
@ -1,5 +1,5 @@
 # SuperBench Config
-version: v0.2
+version: v0.3
 superbench:
  enable: null
  var:
@ -35,6 +35,51 @@ superbench:
      <<: *default_local_mode
    gemm-flops:
      <<: *default_local_mode
+    nccl-bw:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 1
+          parallel: no
+      parameters:
+        ngpus: 8
+    ib-loopback:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 4
+          prefix: PROC_RANK={proc_rank} IB_DEVICES=0,2,4,6 NUMA_NODES=1,0,3,2
+          parallel: yes
+        - name: local
+          proc_num: 4
+          prefix: PROC_RANK={proc_rank} IB_DEVICES=1,3,5,7 NUMA_NODES=1,0,3,2
+          parallel: yes
+    mem-bw:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 8
+          prefix: CUDA_VISIBLE_DEVICES={proc_rank} numactl -c $(({proc_rank}/2))
+          parallel: yes
+    disk-benchmark:
+      enable: false
+      modes:
+        - name: local
+          proc_num: 1
+          parallel: no
+      parameters:
+        block_devices: []
+    gpu-sm-copy-bw:
+      enable: true
+      modes:
+        - name: local
+          proc_num: 32
+          prefix: CUDA_VISIBLE_DEVICES=$(({proc_rank}%8)) numactl -N $(({proc_rank}%4)) -m $(({proc_rank}%4))
+          parallel: no
+      parameters:
+        mem_type:
+          - dtoh
+          - htod
    cudnn-function:
      <<: *default_local_mode
    cublas-function:
--- a/superbench/config/default.yaml
+++ b/superbench/config/default.yaml
@ -1,5 +1,5 @@
 # SuperBench Config
-version: v0.2
+version: v0.3
 superbench:
  enable: null
  var:
@ -32,7 +32,10 @@ superbench:
      enable: true
      modes:
        - name: local
-          prefix: NCCL_DEBUG=INFO NCCL_IB_DISABLE=1
+          proc_num: 1
+          parallel: no
+      parameters:
+        ngpus: 8
    ib-loopback:
      enable: true
      modes:
@ -61,15 +64,16 @@ superbench:
          prefix: CUDA_VISIBLE_DEVICES={proc_rank} numactl -c $(({proc_rank}/2))
          parallel: yes
    gpu-sm-copy-bw:
-      enable: false
+      enable: true
      modes:
        - name: local
          proc_num: 32
          prefix: CUDA_VISIBLE_DEVICES=$(({proc_rank}%8)) numactl -N $(({proc_rank}%4)) -m $(({proc_rank}%4))
          parallel: no
      parameters:
-        dtoh: true
-        htod: true
+        mem_type:
+          - dtoh
+          - htod
    kernel-launch:
      <<: *default_local_mode
    gemm-flops:
--- a/superbench/runner/playbooks/deploy.yaml
+++ b/superbench/runner/playbooks/deploy.yaml
@ -101,7 +101,7 @@
          {{ '--security-opt seccomp=unconfined --group-add video' if amd_gpu_exist else '' }} \
          -w /root -v {{ workspace }}:/root -v /mnt:/mnt \
          -v /var/run/docker.sock:/var/run/docker.sock \
-          {{ docker_image }} bash && \
+          --entrypoint /bin/bash {{ docker_image }} && \
        docker exec {{ container }} bash -c \
          "chown -R root:root ~ && \
          sed -i 's/[# ]*Port.*/Port {{ ssh_port }}/g' /etc/ssh/sshd_config && \
--- a/superbench/runner/runner.py
+++ b/superbench/runner/runner.py
@ -123,20 +123,13 @@ class SuperBenchRunner():
        elif mode.name == 'torch.distributed':
            # TODO: replace with torch.distributed.run in v1.9
            # TODO: only supports node_num=1 and node_num=all currently
+            torch_dist_params = '' if mode.node_num == 1 else \
+                '--nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT '
            mode_command = (
-                'python3 -m torch.distributed.launch '
-                '--use_env --no_python --nproc_per_node={proc_num} '
-                '--nnodes={node_num} --node_rank=$NODE_RANK '
-                '--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT '
-                '{command} {torch_distributed_suffix}'
-            ).format(
-                proc_num=mode.proc_num,
-                node_num=1 if mode.node_num == 1 else '$NNODES',
-                command=exec_command,
-                torch_distributed_suffix=(
-                    'superbench.benchmarks.{name}.parameters.distributed_impl=ddp '
-                    'superbench.benchmarks.{name}.parameters.distributed_backend=nccl'
-                ).format(name=benchmark_name),
+                f'python3 -m torch.distributed.launch'
+                f' --use_env --no_python --nproc_per_node={mode.proc_num} {torch_dist_params}{exec_command}'
+                f' superbench.benchmarks.{benchmark_name}.parameters.distributed_impl=ddp'
+                f' superbench.benchmarks.{benchmark_name}.parameters.distributed_backend=nccl'
            )
        elif mode.name == 'mpi':
            mode_command = (
--- a/superbench/tools/system_info.py
+++ b/superbench/tools/system_info.py
@ -4,15 +4,19 @@
 """Generate system config."""

 import json
+import os
 import subprocess
-import xmltodict
 from pathlib import Path

+import xmltodict
+
+from superbench.common.utils import logger
+

 class SystemInfo():    # pragma: no cover
    """Systsem info class."""
-    def run_cmd(self, command):
-        """Run the command as root or non-root user and return the stdout string..
+    def _run_cmd(self, command):
+        """Run the command and return the stdout string.

        Args:
            command (string): the command to run in terminal.
@ -21,11 +25,17 @@ class SystemInfo():    # pragma: no cover
            string: the stdout string of the command.
        """
        output = subprocess.run(
-            command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True, check=False, universal_newlines=True
+            command,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            shell=True,
+            check=False,
+            universal_newlines=True,
+            timeout=300
        )
        return output.stdout

-    def count_prefix_indent(self, content, symbol='\t'):
+    def __count_prefix_indent(self, content, symbol='\t'):
        r"""Count the number of a specific symbol in the content.

        Args:
@ -43,7 +53,7 @@ class SystemInfo():    # pragma: no cover
                break
        return count

-    def parse_key_value_lines(self, lines, required_keywords=None, omitted_values=None, symbol=':'):    # noqa: C901
+    def _parse_key_value_lines(self, lines, required_keywords=None, omitted_values=None, symbol=':'):    # noqa: C901
        """Parse the lines like "key:value" and convert them to dict.

        if required_keywords is None, include all line. Otherwise,
@ -83,7 +93,7 @@ class SystemInfo():    # pragma: no cover
                while next_indent_index < length and self.__count_prefix_indent(lines[next_indent_index]) > indent:
                    next_indent_index += 1

-                value = self.__parse_key_value_lines(lines[i + 1:next_indent_index])
+                value = self._parse_key_value_lines(lines[i + 1:next_indent_index])
                i = next_indent_index - 1
            # split line by symbol
            elif symbol in line:
@ -112,7 +122,7 @@ class SystemInfo():    # pragma: no cover
            i += 1
        return dict

-    def parse_table_lines(self, lines, key):
+    def _parse_table_lines(self, lines, key):
        """Parse lines like a table and extract the colomns whose table index are the same as key to list of dict.

        Args:
@ -125,22 +135,19 @@ class SystemInfo():    # pragma: no cover
        index = []
        list = []
        valid = False
-        try:
-            for line in lines:
-                line = line.split()
-                if key[0] in line:
-                    for i in range(len(key)):
-                        index.append(line.index(key[i]))
-                    valid = True
-                    continue
-                if valid:
-                    dict = {}
-                    for i in range(len(key)):
-                        if index[i] < len(line):
-                            dict[key[i]] = line[index[i]]
-                    list.append(dict)
-        except Exception:
-            print('Error: key error in __parse_table_lines')
+        for line in lines:
+            line = line.split()
+            if key[0] in line:
+                for i in range(len(key)):
+                    index.append(line.index(key[i]))
+                valid = True
+                continue
+            if valid:
+                dict = {}
+                for i in range(len(key)):
+                    if index[i] < len(line):
+                        dict[key[i]] = line[index[i]]
+                list.append(dict)
        return list

    def get_cpu(self):
@ -152,13 +159,13 @@ class SystemInfo():    # pragma: no cover
        lscpu_dict = {}
        try:
            # get general cpu information from lscpu
-            lscpu = self.__run_cmd('lscpu').splitlines()
+            lscpu = self._run_cmd('lscpu').splitlines()
            # get distinct max_speed and current_speed of cpus from dmidecode
-            speed = self.__run_cmd(r'dmidecode -t processor | grep "Speed"').splitlines()
-            lscpu_dict = self.__parse_key_value_lines(lscpu)
-            lscpu_dict.update(self.__parse_key_value_lines(speed))
+            speed = self._run_cmd(r'dmidecode -t processor | grep "Speed"').splitlines()
+            lscpu_dict = self._parse_key_value_lines(lscpu)
+            lscpu_dict.update(self._parse_key_value_lines(speed))
        except Exception:
-            print('Error: get CPU info failed')
+            logger.exception('Error: get CPU info failed')
        return lscpu_dict

    def get_system(self):
@ -169,27 +176,27 @@ class SystemInfo():    # pragma: no cover
        """
        system_dict = {}
        try:
-            lsmod = self.__run_cmd('lsmod').splitlines()
-            lsmod = self.__parse_table_lines(lsmod, key=['Module', 'Size', 'Used', 'by'])
-            sysctl = self.__run_cmd('sysctl -a').splitlines()
-            sysctl = self.__parse_key_value_lines(sysctl, None, None, '=')
-            system_dict['system_manufacturer'] = self.__run_cmd('dmidecode -s system-manufacturer').strip()
-            system_dict['system_product'] = self.__run_cmd('dmidecode -s system-product-name').strip()
-            system_dict['os'] = self.__run_cmd('cat /proc/version').strip()
-            system_dict['uname'] = self.__run_cmd('uname -a').strip()
-            system_dict['docker'] = self.__get_docker_version()
+            lsmod = self._run_cmd('lsmod').splitlines()
+            lsmod = self._parse_table_lines(lsmod, key=['Module', 'Size', 'Used', 'by'])
+            sysctl = self._run_cmd('sysctl -a').splitlines()
+            sysctl = self._parse_key_value_lines(sysctl, None, None, '=')
+            system_dict['system_manufacturer'] = self._run_cmd('dmidecode -s system-manufacturer').strip()
+            system_dict['system_product'] = self._run_cmd('dmidecode -s system-product-name').strip()
+            system_dict['os'] = self._run_cmd('cat /proc/version').strip()
+            system_dict['uname'] = self._run_cmd('uname -a').strip()
+            system_dict['docker'] = self.get_docker_version()
            system_dict['kernel_parameters'] = sysctl
            system_dict['kernel_modules'] = lsmod
-            system_dict['dmidecode'] = self.__run_cmd('dmidecode').strip()
+            system_dict['dmidecode'] = self._run_cmd('dmidecode').strip()
            if system_dict['system_product'] == 'Virtual Machine':
-                lsvmbus = self.__run_cmd('lsvmbus').splitlines()
-                lsvmbus = self.__parse_key_value_lines(lsvmbus)
+                lsvmbus = self._run_cmd('lsvmbus').splitlines()
+                lsvmbus = self._parse_key_value_lines(lsvmbus)
                system_dict['vmbus'] = lsvmbus
        except Exception:
-            print('Error: get system info failed')
+            logger.exception('Error: get system info failed')
        return system_dict

-    def __get_docker_version(self):
+    def get_docker_version(self):
        """Get docker version info.

        Returns:
@ -197,7 +204,7 @@ class SystemInfo():    # pragma: no cover
        """
        docker_version_dict = {}
        try:
-            docker_version = self.__run_cmd('docker version')
+            docker_version = self._run_cmd('docker version')
            lines = docker_version.splitlines()

            key = ''
@ -209,7 +216,7 @@ class SystemInfo():    # pragma: no cover
                elif 'Version' in line and key not in docker_version_dict:
                    docker_version_dict[key] = line.split(':')[1].strip().strip('\t')
        except Exception:
-            print('Error: get docker info failed')
+            logger.exception('Error: get docker info failed')
        return docker_version_dict

    def get_memory(self):
@ -220,14 +227,14 @@ class SystemInfo():    # pragma: no cover
        """
        memory_dict = {}
        try:
-            lsmem = self.__run_cmd('lsmem')
+            lsmem = self._run_cmd('lsmem')
            lsmem = lsmem.splitlines()
-            lsmem = self.__parse_key_value_lines(lsmem)
+            lsmem = self._parse_key_value_lines(lsmem)
            memory_dict['block_size'] = lsmem.get('Memory block size', '')
            memory_dict['total_capacity'] = lsmem.get('Total online memory', '')
-            dmidecode_memory = self.__run_cmd('dmidecode --type memory')
+            dmidecode_memory = self._run_cmd('dmidecode --type memory')
            dmidecode_memory = dmidecode_memory.splitlines()
-            model = self.__parse_key_value_lines(
+            model = self._parse_key_value_lines(
                dmidecode_memory, ['Manufacturer', 'Part Number', 'Type', 'Speed', 'Number Of Devices'],
                omitted_values=['other', 'unknown']
            )
@ -236,52 +243,47 @@ class SystemInfo():    # pragma: no cover
            memory_dict['clock_frequency'] = model.get('Speed', '')
            memory_dict['model'] = model.get('Manufacturer', [''])[0] + ' ' + model.get('Part Number', [''])[0]
        except Exception:
-            print('Error: get memory info failed')
+            logger.exception('Error: get memory info failed')
        return memory_dict

-    def __get_gpu_nvidia(self):
+    def get_gpu_nvidia(self):
        """Get nvidia gpu info.

        Returns:
            dict: nvidia gpu info dict.
        """
        gpu_dict = {}
-        try:
-            gpu_query = self.__run_cmd('nvidia-smi -q -x')
-            gpu_query = xmltodict.parse(gpu_query).get('nvidia_smi_log', '')
-            gpu_dict['gpu_count'] = gpu_query.get('attached_gpus', '')
-            gpu_dict['nvidia_info'] = gpu_query
-            gpu_dict['topo'] = self.__run_cmd('nvidia-smi topo -m')
-            gpu_dict['nvidia-container-runtime_version'] = self.__run_cmd('nvidia-container-runtime -v').strip()
-            gpu_dict['nvidia-fabricmanager_version'] = self.__run_cmd('nv-fabricmanager --version').strip()
-            gpu_dict['nv_peer_mem_version'] = self.__run_cmd(
-                'dpkg -l | grep \'nvidia-peer-memory \' | awk \'$2=="nvidia-peer-memory" {print $3}\''
-            ).strip()
-        except Exception:
-            print('Error: get nvidia gpu info failed')
+        gpu_query = self._run_cmd('nvidia-smi -q -x')
+        gpu_query = xmltodict.parse(gpu_query).get('nvidia_smi_log', '')
+        gpu_dict['gpu_count'] = gpu_query.get('attached_gpus', '')
+        gpu_dict['nvidia_info'] = gpu_query
+        gpu_dict['topo'] = self._run_cmd('nvidia-smi topo -m')
+        gpu_dict['nvidia-container-runtime_version'] = self._run_cmd('nvidia-container-runtime -v').strip()
+        gpu_dict['nvidia-fabricmanager_version'] = self._run_cmd('nv-fabricmanager --version').strip()
+        gpu_dict['nv_peer_mem_version'] = self._run_cmd(
+            'dpkg -l | grep \'nvidia-peer-memory \' | awk \'$2=="nvidia-peer-memory" {print $3}\''
+        ).strip()

        return gpu_dict

-    def __get_gpu_amd(self):
+    def get_gpu_amd(self):
        """Get amd gpu info.

        Returns:
            dict: amd gpu info dict.
        """
        gpu_dict = {}
-        try:
-            gpu_query = self.__run_cmd('rocm-smi -a --json')
-            gpu_query = json.loads(gpu_query)
-            gpu_per_node = list(filter(lambda x: 'card' in x, gpu_query.keys()))
-            gpu_dict['gpu_count'] = len(gpu_per_node)
-            gpu_mem_info = self.__run_cmd('rocm-smi --showmeminfo vram --json')
-            gpu_mem_info = json.loads(gpu_mem_info)
-            for card in gpu_per_node:
-                gpu_query[card].update(gpu_mem_info.get(card))
-            gpu_dict['rocm_info'] = gpu_query
-            gpu_dict['topo'] = self.__run_cmd('rocm-smi --showtopo')
-        except Exception:
-            print('Error: get amd gpu info failed')
+        gpu_query = self._run_cmd('rocm-smi -a --json')
+        gpu_query = json.loads(gpu_query)
+        gpu_per_node = list(filter(lambda x: 'card' in x, gpu_query.keys()))
+        gpu_dict['gpu_count'] = len(gpu_per_node)
+        gpu_mem_info = self._run_cmd('rocm-smi --showmeminfo vram --json')
+        gpu_mem_info = json.loads(gpu_mem_info)
+        for card in gpu_per_node:
+            gpu_query[card].update(gpu_mem_info.get(card))
+        gpu_dict['rocm_info'] = gpu_query
+        gpu_dict['topo'] = self._run_cmd('rocm-smi --showtopo')
+
        return gpu_dict

    def get_gpu(self):
@ -290,10 +292,13 @@ class SystemInfo():    # pragma: no cover
        Returns:
            dict: gpu info dict.
        """
-        if Path('/dev/nvidiactl').is_char_device() and Path('/dev/nvidia-uvm').is_char_device():
-            return self.__get_gpu_nvidia()
-        if Path('/dev/kfd').is_char_device() and Path('/dev/dri').is_dir():
-            return self.__get_gpu_amd()
+        try:
+            if Path('/dev/nvidiactl').is_char_device() and Path('/dev/nvidia-uvm').is_char_device():
+                return self.get_gpu_nvidia()
+            if Path('/dev/kfd').is_char_device() and Path('/dev/dri').is_dir():
+                return self.get_gpu_amd()
+        except Exception:
+            logger.exception('Error: get gpu info failed')
        print('Warning: no gpu detected')
        return {}

@ -305,10 +310,10 @@ class SystemInfo():    # pragma: no cover
        """
        pcie_dict = {}
        try:
-            pcie_dict['pcie_topo'] = self.__run_cmd('lspci -t -vvv')
-            pcie_dict['pcie_info'] = self.__run_cmd('lspci -vvv')
+            pcie_dict['pcie_topo'] = self._run_cmd('lspci -t -vvv')
+            pcie_dict['pcie_info'] = self._run_cmd('lspci -vvv')
        except Exception:
-            print('Error: get pcie gpu info failed')
+            logger.exception('Error: get pcie info failed')
        return pcie_dict

    def get_storage(self):    # noqa: C901
@ -319,44 +324,45 @@ class SystemInfo():    # pragma: no cover
        """
        storage_dict = {}
        try:
-            fs_info = self.__run_cmd("df -Th | grep -v \'^/dev/loop\'").splitlines()
-            fs_list = self.__parse_table_lines(fs_info, key=['Filesystem', 'Type', 'Size', 'Avail', 'Mounted'])
+            fs_info = self._run_cmd("df -Th | grep -v \'^/dev/loop\'").splitlines()
+            fs_list = self._parse_table_lines(fs_info, key=['Filesystem', 'Type', 'Size', 'Avail', 'Mounted'])
            for fs in fs_list:
                fs_device = fs.get('Filesystem', 'UNKNOWN')
                if fs_device.startswith('/dev'):
-                    fs['Block_size'] = self.__run_cmd('blockdev --getbsz {}'.format(fs_device)).strip()
+                    fs['Block_size'] = self._run_cmd('blockdev --getbsz {}'.format(fs_device)).strip()
                    fs['4k_alignment'] = ''
-                    partition_ids = self.__run_cmd(
-                        'parted {} print | grep -oE "^[[:blank:]]*[0-9]+"'.format(fs_device)
+                    partition_ids = self._run_cmd(
+                        'yes Cancel | parted {} print | grep -oE "^[[:blank:]]*[0-9]+"'.format(fs_device)
                    ).splitlines()
                    for id in partition_ids:
-                        fs['4k_alignment'] += self.__run_cmd('parted {} align-check opt {}'.format(fs_device,
-                                                                                                   id)).strip()
+                        fs['4k_alignment'] += self._run_cmd(
+                            'yes Cancel | parted {} align-check opt {}'.format(fs_device, id)
+                        ).strip()
            storage_dict['file_system'] = fs_list
        except Exception:
-            print('Error: get file system info failed')
+            logger.exception('Error: get file system info failed')

        try:
-            disk_info = self.__run_cmd("lsblk -e 7 -o NAME,ROTA,SIZE,MODEL | grep -v \'^/dev/loop\'").splitlines()
-            disk_list = self.__parse_table_lines(disk_info, key=['NAME', 'ROTA', 'SIZE', 'MODEL'])
+            disk_info = self._run_cmd("lsblk -e 7 -o NAME,ROTA,SIZE,MODEL | grep -v \'^/dev/loop\'").splitlines()
+            disk_list = self._parse_table_lines(disk_info, key=['NAME', 'ROTA', 'SIZE', 'MODEL'])
            for disk in disk_list:
                block_device = disk.get('NAME', 'UNKNOWN').strip('\u251c\u2500').strip('\u2514\u2500')
                disk['NAME'] = block_device
                disk['Rotational'] = disk.pop('ROTA')
-                disk['Block_size'] = self.__run_cmd('fdisk -l -u /dev/{} | grep "Sector size"'.format(block_device)
-                                                    ).strip()
+                disk['Block_size'] = self._run_cmd('fdisk -l -u /dev/{} | grep "Sector size"'.format(block_device)
+                                                   ).strip()
                if 'nvme' in block_device:
-                    nvme_info = self.__run_cmd('nvme list | grep {}'.format(block_device)).strip().split()
+                    nvme_info = self._run_cmd('nvme list | grep {}'.format(block_device)).strip().split()
                    if len(nvme_info) >= 15:
                        disk['Nvme_usage'] = nvme_info[-11] + nvme_info[-10]
            storage_dict['block_device'] = disk_list
-            storage_dict['mapping_bwtween_filesystem_and_blockdevice'] = self.__run_cmd('mount')
+            storage_dict['mapping_bwtween_filesystem_and_blockdevice'] = self._run_cmd('mount')
        except Exception:
-            print('Error: get block device info failed')
+            logger.exception('Error: get block device info failed')

        return storage_dict

-    def __get_ib(self):
+    def get_ib(self):
        """Get available IB devices info.

        Return:
@ -364,19 +370,19 @@ class SystemInfo():    # pragma: no cover
        """
        ib_dict = {}
        try:
-            ibstat = self.__run_cmd('ibstat').splitlines()
-            ib_dict['ib_device_status'] = self.__parse_key_value_lines(ibstat)
-            ibv_devinfo = self.__run_cmd('ibv_devinfo -v').splitlines()
+            ibstat = self._run_cmd('ibstat').splitlines()
+            ib_dict['ib_device_status'] = self._parse_key_value_lines(ibstat)
+            ibv_devinfo = self._run_cmd('ibv_devinfo -v').splitlines()
            for i in range(len(ibv_devinfo) - 1, -1, -1):
                if ':' not in ibv_devinfo[i]:
                    ibv_devinfo[i - 1] = ibv_devinfo[i - 1] + ',' + ibv_devinfo[i].strip('\t')
                    ibv_devinfo.remove(ibv_devinfo[i])
-            ib_dict['ib_device_info'] = self.__parse_key_value_lines(ibv_devinfo)
-        except Exception as e:
-            print('Error: get ib info failed. message: {}.'.format(str(e)))
+            ib_dict['ib_device_info'] = self._parse_key_value_lines(ibv_devinfo)
+        except Exception:
+            logger.exception('Error: get ib info failed')
        return ib_dict

-    def __get_nic(self):
+    def get_nic(self):
        """Get nic info.

        Returns:
@ -384,8 +390,10 @@ class SystemInfo():    # pragma: no cover
        """
        nic_list = []
        try:
-            lsnic_xml = self.__run_cmd('lshw -c network -xml')
+            lsnic_xml = self._run_cmd('lshw -c network -xml')
            lsnic_list = xmltodict.parse(lsnic_xml).get('list', {}).get('node', [])
+            if not isinstance(lsnic_list, list):
+                lsnic_list = [lsnic_list]
            lsnic_list = list(filter(lambda x: 'logicalname' in x, lsnic_list))

            for nic in lsnic_list:
@ -404,15 +412,15 @@ class SystemInfo():    # pragma: no cover
                            'driverversion', ''
                        )
                        nic_info['firmware'] = configuration_dict.get('firmware', '')
-                    speed = self.__run_cmd('cat /sys/class/net/{}/speed'.format(nic_info['logical_name'])).strip()
+                    speed = self._run_cmd('cat /sys/class/net/{}/speed'.format(nic_info['logical_name'])).strip()
                    if speed.isdigit():
                        nic_info['speed'] = str(int(speed) / 1000) + ' Gbit/s'
                except Exception:
-                    print('Error: get nic device {} info failed'.format(nic_info['logical_name']))
+                    logger.exception('Error: get nic device {} info failed')
+
                nic_list.append(nic_info)
        except Exception:
-            print('Error: get nic info failed')
-        return nic_list
+            logger.exception('Error: get nic info failed')

    def get_network(self):
        """Get network info, including nic info, ib info and ofed version.
@ -421,15 +429,21 @@ class SystemInfo():    # pragma: no cover
            dict: dict of network info.
        """
        network_dict = {}
-        network_dict['nic'] = self.__get_nic()
-        network_dict['ib'] = self.__get_ib()
-        ofed_version = self.__run_cmd('ofed_info  -s').strip()
-        network_dict['ofed_version'] = ofed_version
+        try:
+            network_dict['nic'] = self.get_nic()
+            network_dict['ib'] = self.get_ib()
+            ofed_version = self._run_cmd('ofed_info  -s').strip()
+            network_dict['ofed_version'] = ofed_version
+        except Exception:
+            logger.exception('Error: get network info failed')
        return network_dict

    def get_all(self):
        """Get all system info and save them to file in json format."""
        sum_dict = {}
+        if os.geteuid() != 0:
+            logger.error('You need to be as a root user to run this tool.')
+            return sum_dict
        sum_dict['System'] = self.get_system()
        sum_dict['CPU'] = self.get_cpu()
        sum_dict['Memory'] = self.get_memory()
--- a/tests/cli/test_sb.py
+++ b/tests/cli/test_sb.py
@ -81,3 +81,7 @@ class SuperBenchCLIScenarioTest(ScenarioTest):
        """Test sb run, --host-file does not exist, should fail."""
        result = self.cmd('sb run --host-file ./nonexist.yaml', expect_failure=True)
        self.assertEqual(result.exit_code, 1)
+
+    def test_sb_node_info(self):
+        """Test sb node info, should fail."""
+        self.cmd('sb node info', expect_failure=False)
--- a/tests/runner/test_runner.py
+++ b/tests/runner/test_runner.py
@ -116,8 +116,6 @@ class RunnerTestCase(unittest.TestCase):
                'expected_command': (
                    'python3 -m torch.distributed.launch '
                    '--use_env --no_python --nproc_per_node=8 '
-                    '--nnodes=1 --node_rank=$NODE_RANK '
-                    '--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT '
                    f'sb exec --output-dir {self.sb_output_dir} -c sb.config.yaml -C superbench.enable=foo '
                    'superbench.benchmarks.foo.parameters.distributed_impl=ddp '
                    'superbench.benchmarks.foo.parameters.distributed_backend=nccl'
--- a/third_party/Makefile
+++ b/third_party/Makefile
@ -8,7 +8,6 @@ MPI_HOME ?= /usr/local/mpi
 HIP_HOME ?= /opt/rocm/hip
 RCCL_HOME ?= /opt/rocm/rccl
 ROCM_VERSION ?= rocm-$(shell dpkg -l | grep 'rocm-dev ' | awk '{print $$3}' | cut -d '.' -f1-3)
-ROCM_ARCH ?= $(shell rocminfo | grep " gfx" | uniq | awk '{print $$2}')

 .PHONY: all cuda rocm common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest

@ -66,7 +65,7 @@ ifneq (,$(wildcard fio/Makefile))
 	cd ./fio && ./configure --prefix=$(SB_MICRO_PATH) && make -j && make install
 endif

-# Build rccl-tests from commit cc34c5 of develop branch (default branch).
+# Build rccl-tests from commit dc1ad48 of develop branch (default branch).
 rocm_rccl_tests: sb_micro_path
 ifneq (, $(wildcard rccl-tests/Makefile))
 	cd ./rccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) HIP_HOME=$(HIP_HOME) RCCL_HOME=$(RCCL_HOME) -j
@ -81,21 +80,14 @@ rocm_rocblas: sb_micro_path
 ifeq (, $(wildcard $(SB_MICRO_PATH)/bin/rocblas-bench))
 	if [ -d rocBLAS ]; then rm -rf rocBLAS; fi
 	git clone -b ${ROCM_VERSION} https://github.com/ROCmSoftwarePlatform/rocBLAS.git ./rocBLAS
-ifeq (${ROCM_VERSION}, rocm-4.0.0)
-	sed -i '/CMAKE_MATCH_1/a\      get_filename_component(HIP_CLANG_ROOT "$${HIP_CLANG_ROOT}" DIRECTORY)'  /opt/rocm/hip/lib/cmake/hip/hip-config.cmake
-	cd ./rocBLAS && HIPCC_COMPILE_FLAGS_APPEND="-D_OPENMP=201811 -O3 -Wno-format-nonliteral -DCMAKE_HAVE_LIBC_PTHREAD -parallel-jobs=2" HIPCC_LINK_FLAGS_APPEND="-lpthread -O3 -parallel-jobs=2" ./install.sh -idc -a ${ROCM_ARCH}
-else 
-	cd ./rocBLAS && ./install.sh -idc
-endif
+	cd ./rocBLAS && ./install.sh --dependencies --clients-only
 	cp -v ./rocBLAS/build/release/clients/staging/rocblas-bench $(SB_MICRO_PATH)/bin/
 endif

 # Build hipBusBandwidth.
 # HIP is released with rocm, like rocm-4.2.0 and so on.
 # The version we use is the released tag which is consistent with the rocm version in the environment or docker.
-rocm_bandwidthTest:
-	cp -r -v $(shell hipconfig -p) ./
-ifneq (, $(wildcard hip/samples/1_Utils/hipBusBandwidth/CMakeLists.txt))
-	cd ./hip/samples/1_Utils/hipBusBandwidth/ && mkdir -p build && cd build && cmake .. && make
-	cp -v ./hip/samples/1_Utils/hipBusBandwidth/build/hipBusBandwidth $(SB_MICRO_PATH)/bin/
-endif
+rocm_bandwidthTest: sb_micro_path
+	cp -r -v $(shell hipconfig -p)/samples/1_Utils/hipBusBandwidth ./
+	cd ./hipBusBandwidth/ && mkdir -p build && cd build && cmake .. && make
+	cp -v ./hipBusBandwidth/build/hipBusBandwidth $(SB_MICRO_PATH)/bin/
--- a/third_party/rccl-tests
+++ b/third_party/rccl-tests
@ -1 +1 @@
-Subproject commit cc34c545098145bc148e5035e4c8e767b4d71ece
+Subproject commit dc1ad4853d7ec738387d42a75a58a98d7af00c7b
--- a/website/blog/2021-09-22-release-0-3.md
+++ b/website/blog/2021-09-22-release-0-3.md
@ -0,0 +1,132 @@
+---
+slug: release-sb-v0.3
+title: Releasing SuperBench v0.3
+author: Peng Cheng
+author_title: SuperBench Team
+author_url: https://github.com/cp5555
+author_image_url: https://github.com/cp5555.png
+tags: [superbench, announcement, release]
+---
+
+We are very happy to announce that **SuperBench 0.3.0 version** is officially released today!
+
+You can install and try superbench by following [Getting Started Tutorial](https://microsoft.github.io/superbenchmark/docs/getting-started/installation).
+
+## SuperBench 0.3.0 Release Notes
+
+### SuperBench Framework
+
+#### Runner
+
+- Implement MPI mode.
+
+#### Benchmarks
+
+- Support Docker benchmark.
+
+### Single-node Validation
+
+#### Micro Benchmarks
+
+1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)
+
+   | Metrics        | Unit | Description                         |
+   |----------------|------|-------------------------------------|
+   | H2D_Mem_BW_GPU | GB/s | host-to-GPU bandwidth for each GPU  |
+   | D2H_Mem_BW_GPU | GB/s | GPU-to-host bandwidth  for each GPU |
+
+2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)
+
+   | Metrics  | Unit | Description                                                   |
+   |----------|------|---------------------------------------------------------------|
+   | IB_Write | MB/s | The IB write loopback throughput with different message sizes |
+   | IB_Read  | MB/s | The IB read loopback throughput with different message sizes  |
+   | IB_Send  | MB/s | The IB send loopback throughput with different message sizes  |
+
+3. NCCL/RCCL (Tool: NCCL/RCCL Tests)
+
+   | Metrics             | Unit | Description                                                     |
+   |---------------------|------|-----------------------------------------------------------------|
+   | NCCL_AllReduce      | GB/s | The NCCL AllReduce performance with different message sizes     |
+   | NCCL_AllGather      | GB/s | The NCCL AllGather performance with different message sizes     |
+   | NCCL_broadcast      | GB/s | The NCCL Broadcast performance with different message sizes     |
+   | NCCL_reduce         | GB/s | The NCCL Reduce performance with different message sizes        |
+   | NCCL_reduce_scatter | GB/s | The NCCL ReduceScatter performance with different message sizes |
+
+4. Disk (Tool: FIO – Standard Disk Performance Tool)
+
+   | Metrics        | Unit | Description                                                                     |
+   |----------------|------|---------------------------------------------------------------------------------|
+   | Seq_Read       | MB/s | Sequential read performance                                                     |
+   | Seq_Write      | MB/s | Sequential write performance                                                    |
+   | Rand_Read      | MB/s | Random read performance                                                         |
+   | Rand_Write     | MB/s | Random write performance                                                        |
+   | Seq_R/W_Read   | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) |
+   | Seq_R/W_Write  | MB/s | Write performance in sequential read/write (read:write = 4:1)                   |
+   | Rand_R/W_Read  | MB/s | Read performance in random read/write (read:write = 4:1)                        |
+   | Rand_R/W_Write | MB/s | Write performance in random read/write (read:write = 4:1)                       |
+
+5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)
+
+   | Metrics       | Unit | Description                                         |
+   |---------------|------|-----------------------------------------------------|
+   | H2D_SM_BW_GPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU |
+   | D2H_SM_BW_GPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |
+
+### AMD GPU Support
+
+#### Docker Image Support
+
+- ROCm 4.2 PyTorch 1.7.0
+- ROCm 4.0 PyTorch 1.7.0
+
+#### Micro Benchmarks
+
+1. Kernel Launch (Tool: MSR-A build)
+
+   | Metrics                  | Unit      | Description                                                  |
+   |--------------------------|-----------|--------------------------------------------------------------|
+   | Kernel_Launch_Event_Time | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() |
+   | Kernel_Launch_Wall_Time  | Time (ms) | Dispatch latency measured in CPU time                        |
+
+2. GEMM FLOPS (Tool: AMD rocblas-bench Tool)
+
+   | Metrics  | Unit   | Description                   |
+   |----------|--------|-------------------------------|
+   | FP64     | GFLOPS | FP64 FLOPS without MatrixCore |
+   | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore    |
+   | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore    |
+   | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore    |
+   | INT8(MC) | GOPS   | INT8 FLOPS with MatrixCore    |
+
+#### E2E Benchmarks
+
+1. CNN models -- Use PyTorch torchvision models
+   - ResNet: ResNet-50, ResNet-101, ResNet-152
+   - DenseNet: DenseNet-169, DenseNet-201
+   - VGG: VGG-11, VGG-13, VGG-16, VGG-19
+
+2. BERT -- Use huggingface Transformers
+   - BERT
+   - BERT Large
+
+3. LSTM -- Use PyTorch
+4. GPT-2 -- Use huggingface Transformers
+
+### Bug Fix
+
+- VGG models failed on A100 GPU with batch_size=128
+
+### Other Improvement
+
+1. Contribution related
+   - Contribute rule
+   - System information collection
+
+2. Document
+   - Add release process doc
+   - Add design documents
+   - Add developer guide doc for coding style
+   - Add contribution rules
+   - Add docker image list
+   - Add initial validation results
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@ -101,7 +101,7 @@ module.exports = {
    announcementBar: {
      id: 'supportus',
      content:
-        '📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.2">v0.2.1</a> has been released! ' +
+        '📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.3">v0.3.0</a> has been released! ' +
        '⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️',
    },
    algolia: {
--- a/website/package-lock.json
+++ b/website/package-lock.json
@ -1,6 +1,6 @@
 {
  "name": "superbench-website",
-  "version": "0.2.1",
+  "version": "0.3.0",
  "lockfileVersion": 1,
  "requires": true,
  "dependencies": {
--- a/website/package.json
+++ b/website/package.json
@ -1,6 +1,6 @@
 {
  "name": "superbench-website",
-  "version": "0.2.1",
+  "version": "0.3.0",
  "private": true,
  "scripts": {
    "docusaurus": "docusaurus",