machine-learning-at-scale/examples/train/pytorch-ddp
keonabut a50a6fde5e add getting started 2021-11-14 15:16:04 +00:00
..
src init 2021-11-05 08:20:58 +00:00
README.md add getting started 2021-11-14 15:16:04 +00:00
dataprep.py init 2021-11-05 08:20:58 +00:00
job.yml cli version change asof 2021-11-14 2021-11-14 14:54:53 +00:00

README.md

PyTorch Distributed Data Parallel (DDP)

This example shows how to use Distributed Data Parallel (DDP) with PyTorch on Azure Machine Learning.

Prerequisites

  • Azure Machine Learning Workspace
    • Compute Clusters with GPU for distributed training
    • Compute Instance with Azure ML CLI 2.0 installed

Getting Started

  1. Create training data from python scirpt that create data folder.

    python dataprep.py
    
  2. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.

    cd examples/train/pytorch-ddp/
    az ml job create --file job.yml --stream
    
  3. Access to Azure ML studio and see Experiment logs.

  • In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.

Reference