Adds PyTorch (#16)

* Adds PyTorch examples * Adds pytorch to cookiecutter.json * Adds pytorch instructions to readme * Adds modifications for PyTorch
2019-10-09 21:24:58 +01:00 · 2019-10-09 21:24:58 +01:00 · 092a17372e
--- a/README.md
+++ b/README.md
@ -8,7 +8,12 @@ This is a demo template that allows you to easily run [tf_cnn_benchmarks](https:
 This is another demo template that shows you how to train a ResNet50 model using Imagenet on Azure. We include scripts for processing the Imagenet data, transforming them to TF Records as well as leveraging AzCopy to quickly upload the data to the cloud.
 #### Tensorflow Template 
 This is a blank template you can use for your own distributed training projects. It allows you to leverage all the tooling built around the previous two demos to speed up the time it takes to run your model in a distributed fashion on Azure.  
-
+#### PyTorch Benchmark 
 This is a demo template that allows you to easily run a simple PyTorch benchmarking script on Azure ML. This is a great way to test performance as well as compare to other platforms
 #### PyTorch Imagenet 
 This is another demo template that shows you how to train a ResNet50 model using Imagenet on Azure. We include scripts for processing the Imagenet data as well as leveraging AzCopy to quickly upload the data to the cloud.  
 #### PyTorch Template 
 This is a blank template you can use for your own distributed training projects. It allows you to leverage all the tooling built around the previous two demos to speed up the time it takes to run your model in a distributed fashion on Azure.
 # Prerequisites
 Before you get started you need a PC running Ubuntu and the following installed:  
@ -92,7 +97,7 @@ make run
 This will put you in an environment inside your container in a tmux session (for a tutorial on tmux see [here](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/)). The tmux control key has been mapped to **ctrl+a** rather than the standard ctrl+b so as not to interfere with outer tmux session if you are already a tmux user. You can alter this in the tmux.conf file in the Docker folder. The docker container will map the location you launched it from to the location /workspace inside the docker container. Therefore you can edit files outside of the container in the project folder and the changes will be reflected inside the container.
 ## Imagenet data
-If you have selected **all** or **imagenet** in the type question during cookiecutter invocation then you will need to have **ILSVRC2012_img_train.tar** and **ILSVRC2012_img_val.tar** present in the direcotry you specified as your data directory. Go to the [download page](http://www.image-net.org/download-images) (you may need to register an account), and find the page for ILSVRC2012. You will need to download the two files mentioned earlier.
+If you have selected **all**, **tensorflow_imagenet** or **pytorch_imagenet** in the type question during cookiecutter invocation then you will need to have **ILSVRC2012_img_train.tar** and **ILSVRC2012_img_val.tar** present in the direcotry you specified as your data directory. Go to the [download page](http://www.image-net.org/download-images) (you may need to register an account), and find the page for ILSVRC2012. You will need to download the two files mentioned earlier.
 ## Template selection
 Based on the option you selected for **type** during the cookiecutter invocation you will get all or one of the options below. Cookiecutter will create your project folder which will contain the tempalte folders. When inside your project folder make sure you have run the **make build** and **make run** commands as mentioned in _building environment_ section above. Once you run the run command you will be greeted by a prompt, this is now your control plane. First you will need to set everything up. To do this run
@ -116,7 +121,7 @@ inv tf-benchmark.submit.remote.synthetic
 Note that this will create the cluster if it wasn't created earlier and create the appropriate environment.
 #### Tensorflow Imagenet
-This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data, image data as well as tfrecords. To use this you must either select **imagenet** or **all** when cookiecutter asks what type of project you want to create.
+This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data, image data as well as tfrecords. To use this you must either select **tensorflow_imagenet** or **all** when cookiecutter asks what type of project you want to create.
 The run things locally using synthetic data simply run:
 ```
 inv tf-imagenet.submit.local.synthetic
@ -131,6 +136,35 @@ This only covers a small number of commands, to see the full list of commands si
 #### Tensorflow Experiment
 This is the option that you should use if you want to run your own training script. It is up to you to add the appropriate training scripts and modify the tensorflow_experiment.py file to run the appropriate commands. If you want to see how to invoke things simply look at the other examples.
 #### Pytorch Benchmark
 This is a demo template allows you to easily run a simple PyTorch benchmarking script on Azure ML. To use this you must either select benchmark or all when invoking cookiecutter. 
 Once setup is complete then simply run:
 ```bash
 inv pytorch-benchmark.submit.local.synthetic
 ```
 to run things locally on a single GPU. Note that the first time you run things you will have to build the environment.
 To run things on a cluster simply run:
 ```bash
 inv pytorch-benchmark.submit.remote.synthetic
 ```
 Note that this will create the cluster if it wasn't created earlier and create the appropriate environment.
 #### PyTorch Imagenet
 This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data or image data. To use this you must either select **pytorch_imagenet** or **all** when cookiecutter asks what type of project you want to create.
 The run things locally using synthetic data simply run:
 ```
 inv pytorch-imagenet.submit.local.synthetic
 ```
 To run things on a remote cluster with real data in tfrecords format simply run:
 ```
 inv pytorch-imagenet.submit.remote.tfrecords
 ```
 #### Pytorch Experiment
 This is the option that you should use if you want to run your own training script. It is up to you to add the appropriate training scripts and modify the pytorch_experiment.py file to run the appropriate commands. If you want to see how to invoke things simply look at the other examples.
 # Architecture
 Below is a diagram that shows how the project is set up.
@ -175,6 +209,26 @@ The original project structure is as shown below.
   │  ├── storage.py <-- Invoke module for using Azure storage
   │  └── tfrecords.py <-- Invoke module for working with tf records
   ├── tasks.py <-- Main invoke module
   ├── PyTorch_benchmark<-- Template for running PyTorch benchmarks
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_benchmark.py<-- Invoke module for running benchmarks
   │  └── src
   │     └── pytorch_synthetic_benchmark.py
   ├── PyTorch_imagenet
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_imagenet.py<-- Invoke module for running benchmarks
   │  └── src
   │     ├── imagenet_pytorch_horovod.py
   │     ├── logging.conf
   │     └── timer.py
   ├── PyTorch_experiment<-- PyTorch distributed training template [Put your code here]
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_experiment.py<-- Invoke module for running benchmarks
   │  └── src
   │     └── train_model.py
   ├── TensorFlow_benchmark <-- Template for running Tensorflow benchmarks
   │  ├── environment_cpu.yml 
   │  ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
@ -223,6 +277,12 @@ inv --list
  select-subscription                        Select Azure subscription to use
  setup                                      Setup the environment and process the imagenet data
  tensorboard                                Runs tensorboard in a seperate tmux session
  pytorch-benchmark.submit.local.synthetic    Submit PyTorch training job using synthetic data for local execution
  pytorch-benchmark.submit.remote.synthetic   Submit PyTorch training job using synthetic data to remote cluster
  pytorch-imagenet.submit.local.images        Submit PyTorch training job using real imagenet data for local execution
  pytorch-imagenet.submit.local.synthetic     Submit PyTorch training job using synthetic imagenet data for local execution
  pytorch-imagenet.submit.remote.images       Submit PyTorch training job using real imagenet data to remote cluster
  pytorch-imagenet.submit.remote.synthetic    Submit PyTorch training job using synthetic imagenet data to remote cluster
  storage.create-resource-group
  storage.store-key                          Retrieves premium storage account key from Azure and stores it in .env file
  storage.image.create-container             Creates container based on the parameters found in the .env file
--- a/cookiecutter.json
+++ b/cookiecutter.json
@ -17,9 +17,12 @@
  "container_registry": "dockerhub",
  "type": [
    "all",
-    "template",
+    "tesnorflow_template",
-    "benchmark",
+    "tesnorflow_benchmark",
-    "imagenet"
+    "tesnorflow_imagenet",
    "pytorch_benchmark",
    "pytorch_imagenet",
    "pytorch_template"
  ],
  "region": [
    "eastus",
--- a/hooks/post_gen_project.py
+++ b/hooks/post_gen_project.py
@ -16,11 +16,15 @@ def _remove_directories(*directories):
 def _copy_env_file():
    shutil.move("_dotenv_template", ".env")
 _ALL_DIRECTORIES = "TensorFlow_benchmark",  "TensorFlow_imagenet", "TensorFlow_experiment",  "PyTorch_benchmark",  "PyTorch_imagenet", "PyTorch_experiment"
 _CHOICES_DICT = {
-    "template":  ("TensorFlow_benchmark",  "TensorFlow_imagenet"),
+    "tensorflow_template":  filter(lambda x: x.lower()!="tensorflow_experiment", _ALL_DIRECTORIES),
-    "benchmark": ("TensorFlow_experiment", "TensorFlow_imagenet"),
+    "tensorflow_benchmark": filter(lambda x: x.lower()!="tensorflow_benchmark", _ALL_DIRECTORIES),
-    "imagenet":  ("TensorFlow_benchmark",  "TensorFlow_experiment")
+    "tensorflow_imagenet":  filter(lambda x: x.lower()!="tensorflow_imagenet", _ALL_DIRECTORIES),
    "pytorch_imagenet":  filter(lambda x: x.lower()!="pytorch_imagenet", _ALL_DIRECTORIES),
    "pytorch_benchmark":  filter(lambda x: x.lower()!="pytorch_benchmark", _ALL_DIRECTORIES),
    "pytorch_template":  filter(lambda x: x.lower()!="pytorch_experiment", _ALL_DIRECTORIES),
 }
 if __name__ == "__main__":
--- a/{{cookiecutter.project_name}}/PyTorch_benchmark/environment_cpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_benchmark/environment_cpu.yml
@ -0,0 +1,15 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch-cpu==1.0.0
  - torchvision-cpu
  - horovod==0.15.2
  - pillow
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_benchmark/environment_gpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_benchmark/environment_gpu.yml
@ -0,0 +1,15 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch==1.0.0
  - torchvision==0.2.1
  - horovod==0.15.2
  - pillow
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_benchmark/pytorch_benchmark.py
+++ b/{{cookiecutter.project_name}}/PyTorch_benchmark/pytorch_benchmark.py
@ -0,0 +1,59 @@
 """Module for running PyTorch benchmark using synthetic data
 """
 from invoke import task, Collection
 import os
 from config import load_config
 _BASE_PATH = os.path.dirname(os.path.abspath(__file__))
 env_values = load_config()
@task
 def submit_benchmark_remote(c, node_count=int(env_values["CLUSTER_MAX_NODES"])):
    """Submit PyTorch training job using synthetic data to remote cluster
    Args:
        node_count (int, optional): The number of nodes to use in cluster. Defaults to env_values['CLUSTER_MAX_NODES'].
    """
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("synthetic_benchmark_remote")
    run = exp.submit(
        os.path.join(_BASE_PATH, "src"),
        "pytorch_synthetic_benchmark.py",
        {"--model": "resnet50", "--batch-size": 64},
        node_count=node_count,
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_benchmark_local(c):
    """Submit PyTorch training job using synthetic data for local execution
    """
    from aml_compute import TFExperimentCLI
    exp = TFExperimentCLI("synthetic_images_local")
    run = exp.submit_local(
        os.path.join(_BASE_PATH, "src"),
        "pytorch_synthetic_benchmark.py",
        {"--model": "resnet50", "--batch-size": 64},
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
 remote_collection = Collection("remote")
 remote_collection.add_task(submit_benchmark_remote, "synthetic")
 local_collection = Collection("local")
 local_collection.add_task(submit_benchmark_local, "synthetic")
 submit_collection = Collection("submit", local_collection, remote_collection)
 namespace = Collection("pytorch_benchmark", submit_collection)
--- a/{{cookiecutter.project_name}}/PyTorch_benchmark/src/pytorch_synthetic_benchmark.py
+++ b/{{cookiecutter.project_name}}/PyTorch_benchmark/src/pytorch_synthetic_benchmark.py
@ -0,0 +1,126 @@
 from __future__ import print_function
 import argparse
 import torch.backends.cudnn as cudnn
 import torch.nn.functional as F
 import torch.optim as optim
 import torch.utils.data.distributed
 from torchvision import models
 import horovod.torch as hvd
 import timeit
 import numpy as np
 # Benchmark settings
 parser = argparse.ArgumentParser(
    description="PyTorch Synthetic Benchmark",
    formatter_class=argparse.ArgumentDefaultsHelpFormatter,
 )
 parser.add_argument(
    "--fp16-allreduce",
    action="store_true",
    default=False,
    help="use fp16 compression during allreduce",
 )
 parser.add_argument("--model", type=str, default="resnet50", help="model to benchmark")
 parser.add_argument("--batch-size", type=int, default=32, help="input batch size")
 parser.add_argument(
    "--num-warmup-batches",
    type=int,
    default=10,
    help="number of warm-up batches that don't count towards benchmark",
 )
 parser.add_argument(
    "--num-batches-per-iter",
    type=int,
    default=10,
    help="number of batches per benchmark iteration",
 )
 parser.add_argument(
    "--num-iters", type=int, default=10, help="number of benchmark iterations"
 )
 parser.add_argument(
    "--no-cuda", action="store_true", default=False, help="disables CUDA training"
 )
 args = parser.parse_args()
 args.cuda = not args.no_cuda and torch.cuda.is_available()
 hvd.init()
 if args.cuda:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())
 cudnn.benchmark = True
 # Set up standard model.
 model = getattr(models, args.model)()
 if args.cuda:
    # Move model to GPU.
    model.cuda()
 optimizer = optim.SGD(model.parameters(), lr=0.01)
 # Horovod: (optional) compression algorithm.
 compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
 # Horovod: wrap optimizer with DistributedOptimizer.
 optimizer = hvd.DistributedOptimizer(
    optimizer, named_parameters=model.named_parameters(), compression=compression
 )
 # Horovod: broadcast parameters & optimizer state.
 hvd.broadcast_parameters(model.state_dict(), root_rank=0)
 hvd.broadcast_optimizer_state(optimizer, root_rank=0)
 # Set up fixed fake data
 data = torch.randn(args.batch_size, 3, 224, 224)
 target = torch.LongTensor(args.batch_size).random_() % 1000
 if args.cuda:
    data, target = data.cuda(), target.cuda()
 def benchmark_step():
    optimizer.zero_grad()
    output = model(data)
    loss = F.cross_entropy(output, target)
    loss.backward()
    optimizer.step()
 def log(s, nl=True):
    if hvd.rank() != 0:
        return
    print(s, end="\n" if nl else "")
 log("Model: %s" % args.model)
 log("Batch size: %d" % args.batch_size)
 device = "GPU" if args.cuda else "CPU"
 log("Number of %ss: %d" % (device, hvd.size()))
 # Warm-up
 log("Running warmup...")
 timeit.timeit(benchmark_step, number=args.num_warmup_batches)
 # Benchmark
 log("Running benchmark...")
 img_secs = []
 for x in range(args.num_iters):
    time = timeit.timeit(benchmark_step, number=args.num_batches_per_iter)
    img_sec = args.batch_size * args.num_batches_per_iter / time
    log("Iter #%d: %.1f img/sec per %s" % (x, img_sec, device))
    img_secs.append(img_sec)
 # Results
 img_sec_mean = np.mean(img_secs)
 img_sec_conf = 1.96 * np.std(img_secs)
 log("Img/sec per %s: %.1f +-%.1f" % (device, img_sec_mean, img_sec_conf))
 log(
    "Total img/sec on %d %s(s): %.1f +-%.1f"
    % (hvd.size(), device, hvd.size() * img_sec_mean, hvd.size() * img_sec_conf)
 )
--- a/{{cookiecutter.project_name}}/PyTorch_hvd/environment_cpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_hvd/environment_cpu.yml
@ -0,0 +1,16 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch-cpu==1.0.0
  - torchvision-cpu
  - horovod==0.15.2
  - pillow
  - fire
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_hvd/environment_gpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_hvd/environment_gpu.yml
@ -0,0 +1,17 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch==1.0.0
  - torchvision
  - horovod==0.15.2
  - pillow
  - fire
  - tensorboardX
  - tqdm
--- a/{{cookiecutter.project_name}}/PyTorch_hvd/src/imagenet_pytorch_horovod.py
+++ b/{{cookiecutter.project_name}}/PyTorch_hvd/src/imagenet_pytorch_horovod.py
@ -0,0 +1,257 @@
 from __future__ import print_function
 import argparse
 import torch.backends.cudnn as cudnn
 import torch.nn.functional as F
 import torch.optim as optim
 import torch.utils.data.distributed
 from torchvision import datasets, transforms, models
 import horovod.torch as hvd
 import tensorboardX
 import os
 from tqdm import tqdm
 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch ImageNet Example',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 parser.add_argument('--train-dir', default=os.path.expanduser('~/imagenet/train'),
                    help='path to training data')
 parser.add_argument('--val-dir', default=os.path.expanduser('~/imagenet/validation'),
                    help='path to validation data')
 parser.add_argument('--log-dir', default='./logs',
                    help='tensorboard log directory')
 parser.add_argument('--checkpoint-format', default='./checkpoint-{epoch}.pth.tar',
                    help='checkpoint file format')
 parser.add_argument('--fp16-allreduce', action='store_true', default=False,
                    help='use fp16 compression during allreduce')
 # Default settings from https://arxiv.org/abs/1706.02677.
 parser.add_argument('--batch-size', type=int, default=32,
                    help='input batch size for training')
 parser.add_argument('--val-batch-size', type=int, default=32,
                    help='input batch size for validation')
 parser.add_argument('--epochs', type=int, default=90,
                    help='number of epochs to train')
 parser.add_argument('--base-lr', type=float, default=0.0125,
                    help='learning rate for a single GPU')
 parser.add_argument('--warmup-epochs', type=float, default=5,
                    help='number of warmup epochs')
 parser.add_argument('--momentum', type=float, default=0.9,
                    help='SGD momentum')
 parser.add_argument('--wd', type=float, default=0.00005,
                    help='weight decay')
 parser.add_argument('--no-cuda', action='store_true', default=False,
                    help='disables CUDA training')
 parser.add_argument('--seed', type=int, default=42,
                    help='random seed')
 args = parser.parse_args()
 args.cuda = not args.no_cuda and torch.cuda.is_available()
 hvd.init()
 torch.manual_seed(args.seed)
 if args.cuda:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(args.seed)
 cudnn.benchmark = True
 # If set > 0, will resume training from a given checkpoint.
 resume_from_epoch = 0
 for try_epoch in range(args.epochs, 0, -1):
    if os.path.exists(args.checkpoint_format.format(epoch=try_epoch)):
        resume_from_epoch = try_epoch
        break
 # Horovod: broadcast resume_from_epoch from rank 0 (which will have
 # checkpoints) to other ranks.
 resume_from_epoch = hvd.broadcast(torch.tensor(resume_from_epoch), root_rank=0,
                                  name='resume_from_epoch').item()
 # Horovod: print logs on the first worker.
 verbose = 1 if hvd.rank() == 0 else 0
 # Horovod: write TensorBoard logs on first worker.
 log_writer = tensorboardX.SummaryWriter(args.log_dir) if hvd.rank() == 0 else None
 # kwargs = {'num_workers': 4, 'pin_memory': True} if args.cuda else {}
 kwargs={'num_workers': 4}
 train_dataset = \
    datasets.ImageFolder(args.train_dir,
                         transform=transforms.Compose([
                             transforms.RandomResizedCrop(224),
                             transforms.RandomHorizontalFlip(),
                             transforms.ToTensor(),
                             transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                  std=[0.229, 0.224, 0.225])
                         ]))
 # Horovod: use DistributedSampler to partition data among workers. Manually specify
 # `num_replicas=hvd.size()` and `rank=hvd.rank()`.
 train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
 train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs)
 val_dataset = \
    datasets.ImageFolder(args.val_dir,
                         transform=transforms.Compose([
                             transforms.Resize(256),
                             transforms.CenterCrop(224),
                             transforms.ToTensor(),
                             transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                  std=[0.229, 0.224, 0.225])
                         ]))
 val_sampler = torch.utils.data.distributed.DistributedSampler(
    val_dataset, num_replicas=hvd.size(), rank=hvd.rank())
 val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=args.val_batch_size,
                                         sampler=val_sampler, **kwargs)
 print('Model')
 # Set up standard ResNet-50 model.
 model = models.resnet50()
 if args.cuda:
    # Move model to GPU.
    model.cuda()
 # Horovod: scale learning rate by the number of GPUs.
 optimizer = optim.SGD(model.parameters(), lr=args.base_lr * hvd.size(),
                      momentum=args.momentum, weight_decay=args.wd)
 # Horovod: (optional) compression algorithm.
 compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
 # Horovod: wrap optimizer with DistributedOptimizer.
 optimizer = hvd.DistributedOptimizer(optimizer,
                                     named_parameters=model.named_parameters(),
                                     compression=compression)
 # Restore from a previous checkpoint, if initial_epoch is specified.
 # Horovod: restore on the first worker which will broadcast weights to other workers.
 if resume_from_epoch > 0 and hvd.rank() == 0:
    filepath = args.checkpoint_format.format(epoch=resume_from_epoch)
    checkpoint = torch.load(filepath)
    model.load_state_dict(checkpoint['model'])
    optimizer.load_state_dict(checkpoint['optimizer'])
 print('Broadcasting')
 # Horovod: broadcast parameters & optimizer state.
 hvd.broadcast_parameters(model.state_dict(), root_rank=0)
 hvd.broadcast_optimizer_state(optimizer, root_rank=0)
 def train(epoch):
    model.train()
    train_sampler.set_epoch(epoch)
    train_loss = Metric('train_loss')
    train_accuracy = Metric('train_accuracy')
    with tqdm(total=len(train_loader),
              desc='Train Epoch     #{}'.format(epoch + 1),
              disable=not verbose) as t:
        for batch_idx, (data, target) in enumerate(train_loader):
            adjust_learning_rate(epoch, batch_idx)
            if args.cuda:
                data, target = data.cuda(), target.cuda()
            optimizer.zero_grad()
            output = model(data)
            loss = F.cross_entropy(output, target)
            loss.backward()
            optimizer.step()
            train_loss.update(loss)
            train_accuracy.update(accuracy(output, target))
            t.set_postfix({'loss': train_loss.avg.item(),
                           'accuracy': 100. * train_accuracy.avg.item()})
            t.update(1)
    if log_writer:
        log_writer.add_scalar('train/loss', train_loss.avg, epoch)
        log_writer.add_scalar('train/accuracy', train_accuracy.avg, epoch)
 def validate(epoch):
    model.eval()
    val_loss = Metric('val_loss')
    val_accuracy = Metric('val_accuracy')
    with tqdm(total=len(val_loader),
              desc='Validate Epoch  #{}'.format(epoch + 1),
              disable=not verbose) as t:
        with torch.no_grad():
            for data, target in val_loader:
                if args.cuda:
                    data, target = data.cuda(), target.cuda()
                output = model(data)
                val_loss.update(F.cross_entropy(output, target))
                val_accuracy.update(accuracy(output, target))
                t.set_postfix({'loss': val_loss.avg.item(),
                               'accuracy': 100. * val_accuracy.avg.item()})
                t.update(1)
    if log_writer:
        log_writer.add_scalar('val/loss', val_loss.avg, epoch)
        log_writer.add_scalar('val/accuracy', val_accuracy.avg, epoch)
 # Horovod: using `lr = base_lr * hvd.size()` from the very beginning leads to worse final
 # accuracy. Scale the learning rate `lr = base_lr` ---> `lr = base_lr * hvd.size()` during
 # the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
 # After the warmup reduce learning rate by 10 on the 30th, 60th and 80th epochs.
 def adjust_learning_rate(epoch, batch_idx):
    if epoch < args.warmup_epochs:
        epoch += float(batch_idx + 1) / len(train_loader)
        lr_adj = 1. / hvd.size() * (epoch * (hvd.size() - 1) / args.warmup_epochs + 1)
    elif epoch < 30:
        lr_adj = 1.
    elif epoch < 60:
        lr_adj = 1e-1
    elif epoch < 80:
        lr_adj = 1e-2
    else:
        lr_adj = 1e-3
    for param_group in optimizer.param_groups:
        param_group['lr'] = args.base_lr * hvd.size() * lr_adj
 def accuracy(output, target):
    # get the index of the max log-probability
    pred = output.max(1, keepdim=True)[1]
    return pred.eq(target.view_as(pred)).cpu().float().mean()
 def save_checkpoint(epoch):
    if hvd.rank() == 0:
        filepath = args.checkpoint_format.format(epoch=epoch + 1)
        state = {
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }
        torch.save(state, filepath)
 # Horovod: average metrics from distributed training.
 class Metric(object):
    def __init__(self, name):
        self.name = name
        self.sum = torch.tensor(0.)
        self.n = torch.tensor(0.)
    def update(self, val):
        self.sum += hvd.allreduce(val.detach().cpu(), name=self.name)
        self.n += 1
    @property
    def avg(self):
        return self.sum / self.n
 print('Training')
 for epoch in range(resume_from_epoch, args.epochs):
    train(epoch)
    validate(epoch)
    save_checkpoint(epoch)
--- a/{{cookiecutter.project_name}}/PyTorch_hvd/src/logging.conf
+++ b/{{cookiecutter.project_name}}/PyTorch_hvd/src/logging.conf
@ -0,0 +1,27 @@
 [loggers]
 keys=root,__main__
 [handlers]
 keys=consoleHandler
 [formatters]
 keys=simpleFormatter
 [logger_root]
 level=WARNING
 handlers=consoleHandler
 [logger___main__]
 level=DEBUG
 handlers=consoleHandler
 qualname=__main__
 propagate=0
 [handler_consoleHandler]
 class=StreamHandler
 level=DEBUG
 formatter=simpleFormatter
 args=(sys.stdout,)
 [formatter_simpleFormatter]
 format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
--- a/{{cookiecutter.project_name}}/PyTorch_hvd/src/timer.py
+++ b/{{cookiecutter.project_name}}/PyTorch_hvd/src/timer.py
@ -0,0 +1,105 @@
 import collections
 import functools
 import logging
 from timeit import default_timer
 class Timer(object):
    """
    Keyword arguments:
        output:   if True, print output after exiting context.
                  if callable, pass output to callable.
        format:   str.format string to be used for output; default "took {} seconds"
        prefix:   string to prepend (plus a space) to output
                  For convenience, if you only specify this, output defaults to True.
    """
    def __init__(self,
                 timer=default_timer,
                 factor=1,
                 output=None,
                 fmt="took {:.3f} seconds",
                 prefix=""):
        self._timer = timer
        self._factor = factor
        self._output = output
        self._fmt = fmt
        self._prefix = prefix
        self._end = None
        self._start = None
    def start(self):
        self._start = self()
    def stop(self):
        self._end = self()
    def __call__(self):
        """ Return the current time """
        return self._timer()
    def __enter__(self):
        """ Set the start time """
        self.start()
        return self
    def __exit__(self, exc_type, exc_value, exc_traceback):
        """ Set the end time """
        self.stop()
        if self._output is True or (self._output is None and self._prefix):
            self._output = print
        if callable(self._output):
            output = " ".join([self._prefix, self._fmt.format(self.elapsed)])
            self._output(output)
    def __str__(self):
        return '%.3f' % (self.elapsed)
    @property
    def elapsed(self):
        """ Return the elapsed time
        """
        if self._end is None:
            # if elapsed is called in the context manager scope
            return (self() - self._start) * self._factor
        else:
            # if elapsed is called out of the context manager scope
            return (self._end - self._start) * self._factor
 def timer(logger=None,
          level=logging.INFO,
          fmt="function %(function_name)s execution time: %(execution_time).3f",
          *func_or_func_args,
          **timer_kwargs):
    """ Function decorator displaying the function execution time
    """
    def wrapped_f(f):
        @functools.wraps(f)
        def wrapped(*args, **kwargs):
            with Timer(**timer_kwargs) as t:
                out = f(*args, **kwargs)
            context = {
                'function_name': f.__name__,
                'execution_time': t.elapsed,
            }
            if logger:
                logger.log(
                    level,
                    fmt % context,
                    extra=context)
            else:
                print(fmt % context)
            return out
        return wrapped
    if (len(func_or_func_args) == 1
            and isinstance(func_or_func_args[0], collections.Callable)):
        return wrapped_f(func_or_func_args[0])
    else:
        return wrapped_f
--- a/{{cookiecutter.project_name}}/PyTorch_imagenet/environment_cpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_imagenet/environment_cpu.yml
@ -0,0 +1,16 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch-cpu==1.0.0
  - torchvision-cpu
  - horovod==0.15.2
  - pillow
  - fire
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_imagenet/environment_gpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_imagenet/environment_gpu.yml
@ -0,0 +1,16 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch==1.0.0
  - torchvision==0.2.1
  - horovod==0.15.2
  - pillow
  - fire
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_imagenet/pytorch_imagenet.py
+++ b/{{cookiecutter.project_name}}/PyTorch_imagenet/pytorch_imagenet.py
@ -0,0 +1,119 @@
 """Module for running PyTorch training on Imagenet data
 """
 from invoke import task, Collection
 import os
 from config import load_config
 _BASE_PATH = os.path.dirname(os.path.abspath(__file__))
 env_values = load_config()
@task
 def submit_synthetic(c, node_count=int(env_values["CLUSTER_MAX_NODES"]), epochs=1):
    """Submit PyTorch training job using synthetic imagenet data to remote cluster
    Args:
        node_count (int, optional): The number of nodes to use in cluster. Defaults to env_values['CLUSTER_MAX_NODES'].
        epochs (int, optional): Number of epochs to run training for. Defaults to 1.
    """
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("pytorch_synthetic_images_remote")
    run = exp.submit(
        os.path.join(_BASE_PATH, "src"),
        "imagenet_pytorch_horovod.py",
        {"--epochs": epochs, "--use_gpu":True},
        node_count=node_count,
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_synthetic_local(c, epochs=1):
    """Submit PyTorch training job using synthetic imagenet data for local execution
    Args:
        epochs (int, optional): Number of epochs to run training for. Defaults to 1.
    """
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("pytorch_synthetic_images_local")
    run = exp.submit_local(
        os.path.join(_BASE_PATH, "src"),
        "imagenet_pytorch_horovod.py",
        {"--epochs": epochs, "--use_gpu":True},
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_images(c, node_count=int(env_values["CLUSTER_MAX_NODES"]), epochs=1):
    """Submit PyTorch training job using real imagenet data to remote cluster
    Args:
        node_count (int, optional): The number of nodes to use in cluster. Defaults to env_values['CLUSTER_MAX_NODES'].
        epochs (int, optional): Number of epochs to run training for. Defaults to 1.
    """
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("pytorch_real_images_remote")
    run = exp.submit(
        os.path.join(_BASE_PATH, "src"),
         "imagenet_pytorch_horovod.py",
        {
            "--use_gpu":True, 
            "--epochs":epochs,
            "--training_data_path": "{datastore}/train",
            "--validation_data_path": "{datastore}/validation",
        },
        node_count=node_count,
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_images_local(c, epochs=1):
    """Submit PyTorch training job using real imagenet data for local execution
    Args:
        epochs (int, optional): Number of epochs to run training for. Defaults to 1.
    """
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("pytorch_real_images_local")
    run = exp.submit_local(
        os.path.join(_BASE_PATH, "src"),
          "imagenet_pytorch_horovod.py",
        {
            "--epochs": epochs, 
            "--use_gpu":True, 
            "--training_data_path": "/data/train",
            "--validation_data_path": "/data/validation",
        },
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        docker_args=["-v", f"{env_values['DATA']}:/data"],
        wait_for_completion=True,
    )
    print(run)
 remote_collection = Collection("remote")
 remote_collection.add_task(submit_images, "images")
 remote_collection.add_task(submit_synthetic, "synthetic")
 local_collection = Collection("local")
 local_collection.add_task(submit_images_local, "images")
 local_collection.add_task(submit_synthetic_local, "synthetic")
 submit_collection = Collection("submit", local_collection, remote_collection)
 namespace = Collection("pytorch_imagenet", submit_collection)
--- a/{{cookiecutter.project_name}}/PyTorch_imagenet/src/imagenet_pytorch_horovod.py
+++ b/{{cookiecutter.project_name}}/PyTorch_imagenet/src/imagenet_pytorch_horovod.py
@ -0,0 +1,446 @@
 """ Trains ResNet50 in PyTorch using Horovod.
 """
 import logging
 import logging.config
 import os
 import shutil
 from os import path
 import fire
 import numpy as np
 import torch.backends.cudnn as cudnn
 import torch.nn.functional as F
 import torch.optim as optim
 import torch.utils.data.distributed
 import torchvision.models as models
 from azureml.core.run import Run
 from tensorboardX import SummaryWriter
 from torch.utils.data import Dataset
 from torchvision import transforms, datasets
 from timer import Timer
 def _str_to_bool(in_str):
    if "t" in in_str.lower():
        return True
    else:
        return False
 _WIDTH = 224
 _HEIGHT = 224
 _CHANNELS = 3
 _LR = 0.001
 _EPOCHS = os.getenv("EPOCHS", 5)
 _BATCHSIZE = 64
 _RGB_MEAN = [0.485, 0.456, 0.406]
 _RGB_SD = [0.229, 0.224, 0.225]
 _SEED = 42
 # Settings from https://arxiv.org/abs/1706.02677.
 _WARMUP_EPOCHS = 5
 _WEIGHT_DECAY = 0.00005
 _DATA_LENGTH = int(
    os.getenv("FAKE_DATA_LENGTH", 1281167)  # 1281167
 )  # How much fake data to simulate, default to size of imagenet dataset
 _DISTRIBUTED = _str_to_bool(os.getenv("DISTRIBUTED", "False"))
 if _DISTRIBUTED:
    import horovod.torch as hvd
    hvd.init()
 def _get_rank():
    if _DISTRIBUTED:
        try:
            return hvd.rank()
        except:
            return 0
    else:
        return 0
 def _append_path_to(data_path, data_series):
    return data_series.apply(lambda x: path.join(data_path, x))
 def _create_data(batch_size, num_batches, dim, channels, seed=42):
    np.random.seed(seed)
    return np.random.rand(batch_size * num_batches, channels, dim[0], dim[1]).astype(
        np.float32
    )
 def _create_labels(batch_size, num_batches, n_classes):
    return np.random.choice(n_classes, batch_size * num_batches)
 class FakeData(Dataset):
    def __init__(
        self,
        batch_size=32,
        num_batches=20,
        dim=(224, 224),
        n_channels=3,
        n_classes=10,
        length=_DATA_LENGTH,
        data_transform=None,
    ):
        self.dim = dim
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.num_batches = num_batches
        self._data = _create_data(
            batch_size, self.num_batches, self.dim, self.n_channels
        )
        self._labels = _create_labels(batch_size, self.num_batches, self.n_classes)
        self.translation_index = np.random.choice(len(self._labels), length)
        self._length = length
        self._data_transform = data_transform
        logger = logging.getLogger(__name__)
        logger.info(
            "Creating fake data {} labels and {} images".format(
                n_classes, len(self._data)
            )
        )
    def __getitem__(self, idx):
        logger = logging.getLogger(__name__)
        logger.debug("Retrieving samples")
        logger.debug(str(idx))
        tr_index_array = self.translation_index[idx]
        if self._data_transform is not None:
            data = self._data_transform(self._data[tr_index_array])
        else:
            data = self._data[tr_index_array]
        return data, self._labels[tr_index_array]
    def __len__(self):
        return self._length
 class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()
    def reset(self):
        self._val = 0
        self._sum = 0
        self._count = 0
    def update(self, val, n=1):
        self._val = val
        self._sum += val * n
        self._count += n
    @property
    def avg(self):
        return self._sum / self._count
 def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)
        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))
        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
    return res
 def train(train_loader, model, criterion, optimizer, base_lr, warmup_epochs, epoch):
    logger = logging.getLogger(__name__)
    batch_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()
    msg = " duration({})  loss:{} total-samples: {}"
    t = Timer()
    t.start()
    for i, (data, target) in enumerate(train_loader):
        adjust_learning_rate(optimizer, base_lr, warmup_epochs, train_loader, epoch, i)
        data, target = data.cuda(), target.cuda()
        optimizer.zero_grad()
        # compute output
        output = model(data)
        loss = criterion(output, target)
        # compute gradient and do SGD step
        loss.backward()
        optimizer.step()
        losses.update(loss.item(), data.size(0))
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        top1.update(acc1.item(), data.size(0))
        top5.update(acc5.item(), data.size(0))
        if i % 100 == 0:
            t.stop()
            batch_time.update(t.elapsed, n=100)
            logger.info(msg.format(t.elapsed, loss.item(), i * len(data)))
            t.start()
    return {"acc": top1.avg, "loss": losses.avg, "batch_time": batch_time.avg}
 def validate(val_loader, model, criterion, device):
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()
    logger = logging.getLogger(__name__)
    msg = " duration({})  loss:{} total-samples: {}"
    t = Timer()
    t.start()
    for i, (data, target) in enumerate(val_loader):
        logger.debug("bug")
        data, target = (
            data.to(device, non_blocking=True),
            target.to(device, non_blocking=True),
        )
        # compute output
        output = model(data)
        loss = criterion(output, target)
        losses.update(loss.item(), data.size(0))
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        top1.update(acc1.item(), data.size(0))
        top5.update(acc5.item(), data.size(0))
        if i % 100 == 0:
            logger.info(msg.format(t.elapsed, loss.item(), i * len(data)))
            t.start()
    return {"acc": top1.avg, "loss": losses.avg}
 def _log_summary(data_length, duration, batch_size):
    logger = logging.getLogger(__name__)
    images_per_second = data_length / duration
    logger.info("Data length:      {}".format(data_length))
    logger.info("Total duration:   {:.3f}".format(duration))
    logger.info("Total images/sec: {:.3f}".format(images_per_second))
    logger.info(
        "Batch size:       (Per GPU {}: Total {})".format(
            batch_size, hvd.size() * batch_size if _DISTRIBUTED else batch_size
        )
    )
    logger.info("Distributed:      {}".format("True" if _DISTRIBUTED else "False"))
    logger.info("Num GPUs:         {:.3f}".format(hvd.size() if _DISTRIBUTED else 1))
 def _get_sampler(dataset, is_distributed=_DISTRIBUTED):
    if is_distributed:
        return torch.utils.data.distributed.DistributedSampler(
            dataset, num_replicas=hvd.size(), rank=hvd.rank()
        )
    else:
        return torch.utils.data.sampler.RandomSampler(dataset)
 def save_checkpoint(model, optimizer, filepath):
    if hvd.rank() == 0:
        state = {"model": model.state_dict(), "optimizer": optimizer.state_dict()}
    torch.save(state, filepath)
 # Horovod: using `lr = base_lr * hvd.size()` from the very beginning leads to worse final
 # accuracy. Scale the learning rate `lr = base_lr` ---> `lr = base_lr * hvd.size()` during
 # the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
 # After the warmup reduce learning rate by 10 on the 30th, 60th and 80th epochs.
 def adjust_learning_rate(
    optimizer, base_lr, warmup_epochs, data_loader, epoch, batch_idx
 ):
    logger = logging.getLogger(__name__)
    size = hvd.size() if _DISTRIBUTED else 1
    if epoch < warmup_epochs:
        epoch += float(batch_idx + 1) / len(data_loader)
        lr_adj = 1.0 / size * (epoch * (size - 1) / warmup_epochs + 1)
    elif epoch < 30:
        lr_adj = 1.0
    elif epoch < 60:
        lr_adj = 1e-1
    elif epoch < 80:
        lr_adj = 1e-2
    else:
        lr_adj = 1e-3
    for param_group in optimizer.param_groups:
        new_lr = base_lr * size * lr_adj
        if param_group["lr"]!=new_lr:
            param_group["lr"] = new_lr
            if _get_rank()==0:
                logger.info(f"setting lr to {param_group['lr']}")
 def main(
    training_data_path=None,
    validation_data_path=None,
    use_gpu=False,
    save_filepath=None,
    model="resnet50",
    epochs=_EPOCHS,
    batch_size=_BATCHSIZE,
    fp16_allreduce=False,
    base_lr=0.0125,
    warmup_epochs=5,
 ):
    logger = logging.getLogger(__name__)
    device = torch.device("cuda" if use_gpu else "cpu")
    logger.info(f"Running on {device}")
    if _DISTRIBUTED:
        # Horovod: initialize Horovod.
        logger.info("Running Distributed")
        torch.manual_seed(_SEED)
        if use_gpu:
            # Horovod: pin GPU to local rank.
            torch.cuda.set_device(hvd.local_rank())
            torch.cuda.manual_seed(_SEED)
    logger.info("PyTorch version {}".format(torch.__version__))
    # Horovod: write TensorBoard logs on first worker.
    if (_DISTRIBUTED and hvd.rank() == 0) or not _DISTRIBUTED:
        run = Run.get_context()
        run.tag("model", value=model)
        logs_dir = os.path.join(os.curdir, "logs")
        if os.path.exists(logs_dir):
            logger.debug(f"Log directory {logs_dir} found | Deleting")
            shutil.rmtree(logs_dir)
        summary_writer = SummaryWriter(logdir=logs_dir)
    if training_data_path is None:
        logger.info("Setting up fake loaders")
        train_dataset = FakeData(n_classes=1000, data_transform=torch.FloatTensor)
        validation_dataset = None
    else:
        normalize = transforms.Normalize(_RGB_MEAN, _RGB_SD)
        logger.info("Setting up loaders")
        logger.info(f"Loading training from {training_data_path}")
        train_dataset = datasets.ImageFolder(
            training_data_path,
            transforms.Compose(
                [
                    transforms.RandomResizedCrop(_WIDTH),
                    transforms.RandomHorizontalFlip(),
                    transforms.ToTensor(),
                    normalize,
                ]
            ),
        )
        if validation_data_path is not None:
            logger.info(f"Loading validation from {validation_data_path}")
            validation_dataset = datasets.ImageFolder(
                validation_data_path,
                transforms.Compose(
                    [
                        transforms.Resize(256),
                        transforms.CenterCrop(224),
                        transforms.ToTensor(),
                        normalize,
                    ]
                ),
            )
    train_sampler = _get_sampler(train_dataset)
    kwargs = {"num_workers": 5, "pin_memory": True}
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler, **kwargs
    )
    if validation_data_path is not None:
        val_sampler = _get_sampler(validation_dataset)
        val_loader = torch.utils.data.DataLoader(
            validation_dataset, batch_size=batch_size, sampler=val_sampler, **kwargs
        )
    # Autotune
    cudnn.benchmark = True
    logger.info("Loading model")
    # Load symbol
    model = models.__dict__[model](pretrained=False)
    # model.to(device)
    if use_gpu:
        # Move model to GPU.
        model.cuda()
    # # Horovod: (optional) compression algorithm.
    # compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
    num_gpus = hvd.size() if _DISTRIBUTED else 1
    # Horovod: scale learning rate by the number of GPUs.
    optimizer = optim.SGD(model.parameters(), lr=_LR * num_gpus, momentum=0.9)
    if _DISTRIBUTED:
        compression = hvd.Compression.fp16 if fp16_allreduce else hvd.Compression.none
        # Horovod: wrap optimizer with DistributedOptimizer.
        optimizer = hvd.DistributedOptimizer(
            optimizer,
            named_parameters=model.named_parameters(),
            compression=compression,
        )
        # Horovod: broadcast parameters & optimizer state.
        hvd.broadcast_parameters(model.state_dict(), root_rank=0)
        hvd.broadcast_optimizer_state(optimizer, root_rank=0)
    criterion = F.cross_entropy
    # Main training-loop
    logger.info("Training ...")
    for epoch in range(epochs):
        with Timer(output=logger.info, prefix=f"Training epoch {epoch} ") as t:
            model.train()
            if _DISTRIBUTED:
                train_sampler.set_epoch(epoch)
            metrics = train(
                train_loader, model, criterion, optimizer, base_lr, warmup_epochs, epoch
            )
            if (_DISTRIBUTED and hvd.rank() == 0) or not _DISTRIBUTED:
                run.log_row("Training metrics", epoch=epoch, **metrics)
                summary_writer.add_scalar("Train/Loss", metrics["loss"], epoch)
                summary_writer.add_scalar("Train/Acc", metrics["acc"], epoch)
                summary_writer.add_scalar("Train/BatchTime", metrics["batch_time"], epoch)
        if validation_data_path is not None:
            model.eval()
            metrics = validate(val_loader, model, criterion, device)
            if (_DISTRIBUTED and hvd.rank() == 0) or not _DISTRIBUTED:
                run.log_row("Validation metrics", epoch=epoch, **metrics)
                summary_writer.add_scalar("Validation/Loss", metrics["loss"], epoch)
                summary_writer.add_scalar("Validation/Acc", metrics["acc"], epoch)
        if save_filepath is not None:
            save_checkpoint(model, optimizer, save_filepath)
    _log_summary(epochs * len(train_dataset), t.elapsed, batch_size)
 if __name__ == "__main__":
    logging.config.fileConfig(os.getenv("LOG_CONFIG", "logging.conf"))
    fire.Fire(main)
--- a/{{cookiecutter.project_name}}/PyTorch_imagenet/src/logging.conf
+++ b/{{cookiecutter.project_name}}/PyTorch_imagenet/src/logging.conf
@ -0,0 +1,27 @@
 [loggers]
 keys=root,__main__
 [handlers]
 keys=consoleHandler
 [formatters]
 keys=simpleFormatter
 [logger_root]
 level=WARNING
 handlers=consoleHandler
 [logger___main__]
 level=INFO
 handlers=consoleHandler
 qualname=__main__
 propagate=0
 [handler_consoleHandler]
 class=StreamHandler
 level=INFO
 formatter=simpleFormatter
 args=(sys.stdout,)
 [formatter_simpleFormatter]
 format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
--- a/{{cookiecutter.project_name}}/PyTorch_imagenet/src/timer.py
+++ b/{{cookiecutter.project_name}}/PyTorch_imagenet/src/timer.py
@ -0,0 +1,105 @@
 import collections
 import functools
 import logging
 from timeit import default_timer
 class Timer(object):
    """
    Keyword arguments:
        output:   if True, print output after exiting context.
                  if callable, pass output to callable.
        format:   str.format string to be used for output; default "took {} seconds"
        prefix:   string to prepend (plus a space) to output
                  For convenience, if you only specify this, output defaults to True.
    """
    def __init__(self,
                 timer=default_timer,
                 factor=1,
                 output=None,
                 fmt="took {:.3f} seconds",
                 prefix=""):
        self._timer = timer
        self._factor = factor
        self._output = output
        self._fmt = fmt
        self._prefix = prefix
        self._end = None
        self._start = None
    def start(self):
        self._start = self()
    def stop(self):
        self._end = self()
    def __call__(self):
        """ Return the current time """
        return self._timer()
    def __enter__(self):
        """ Set the start time """
        self.start()
        return self
    def __exit__(self, exc_type, exc_value, exc_traceback):
        """ Set the end time """
        self.stop()
        if self._output is True or (self._output is None and self._prefix):
            self._output = print
        if callable(self._output):
            output = " ".join([self._prefix, self._fmt.format(self.elapsed)])
            self._output(output)
    def __str__(self):
        return '%.3f' % (self.elapsed)
    @property
    def elapsed(self):
        """ Return the elapsed time
        """
        if self._end is None:
            # if elapsed is called in the context manager scope
            return (self() - self._start) * self._factor
        else:
            # if elapsed is called out of the context manager scope
            return (self._end - self._start) * self._factor
 def timer(logger=None,
          level=logging.INFO,
          fmt="function %(function_name)s execution time: %(execution_time).3f",
          *func_or_func_args,
          **timer_kwargs):
    """ Function decorator displaying the function execution time
    """
    def wrapped_f(f):
        @functools.wraps(f)
        def wrapped(*args, **kwargs):
            with Timer(**timer_kwargs) as t:
                out = f(*args, **kwargs)
            context = {
                'function_name': f.__name__,
                'execution_time': t.elapsed,
            }
            if logger:
                logger.log(
                    level,
                    fmt % context,
                    extra=context)
            else:
                print(fmt % context)
            return out
        return wrapped
    if (len(func_or_func_args) == 1
            and isinstance(func_or_func_args[0], collections.Callable)):
        return wrapped_f(func_or_func_args[0])
    else:
        return wrapped_f
--- a/{{cookiecutter.project_name}}/PyTorch_template/environment_cpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_template/environment_cpu.yml
@ -0,0 +1,15 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch-cpu==1.0.0
  - torchvision-cpu
  - horovod==0.15.2
  - pillow
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_template/environment_gpu.yml
+++ b/{{cookiecutter.project_name}}/PyTorch_template/environment_gpu.yml
@ -0,0 +1,15 @@
 name: project_environment
 dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
 - python=3.6.2
 - pandas
 - numpy
 - pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults
  - torch==1.0.0
  - torchvision
  - horovod==0.15.2
  - pillow
  - tensorboardX
--- a/{{cookiecutter.project_name}}/PyTorch_template/pytorch_experiment.py
+++ b/{{cookiecutter.project_name}}/PyTorch_template/pytorch_experiment.py
@ -0,0 +1,126 @@
 """ This is an example template that you can use to create functions that you can call with invoke
 """
 from invoke import task, Collection
 import os
 from config import load_config
 _BASE_PATH = os.path.dirname(os.path.abspath( __file__ ))
 env_values = load_config()
@task
 def submit_local(c):
    """This command isn't implemented please modify to use.
    The call below will work for submitting jobs to execute locally on a GPU.
    """
    raise NotImplementedError(
        "You need to modify this call before being able to use it"
    )
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("<YOUR-EXPERIMENT-NAME>")
    run = exp.submit_local(
        os.path.join(_BASE_PATH, "src"),
        "<YOUR-TRAINING-SCRIPT>",
        {"YOUR": "ARGS"},
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_remote(c, node_count=int(env_values["CLUSTER_MAX_NODES"])):
    """This command isn't implemented please modify to use.
    The call below will work for submitting jobs to execute on a remote cluster using GPUs.
    """
    raise NotImplementedError(
        "You need to modify this call before being able to use it"
    )
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("<YOUR-EXPERIMENT-NAME>")
    run = exp.submit(
        os.path.join(_BASE_PATH, "src"),
        "<YOUR-TRAINING-SCRIPT>",
        {"YOUR": "ARGS"},
        node_count=node_count,
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_images_remote(c, node_count=int(env_values["CLUSTER_MAX_NODES"])):
    """This command isn't implemented please modify to use.
    The call below will work for submitting jobs to execute on a remote cluster using GPUs.
    Notive that we are passing in a {datastore} parameter to the path. This tells the submit
    method that we want the location as mapped by the datastore to be inserted here. Upon
    execution the appropriate path will be prepended to the training_data_path and validation_data_path.
    """
    raise NotImplementedError(
        "You need to modify this call before being able to use it"
    )
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("<YOUR-EXPERIMENT-NAME>")
    run = exp.submit(
        os.path.join(_BASE_PATH, "src"),
        "<YOUR-TRAINING-SCRIPT>",
        {
            "--training_data_path": "{datastore}/train",
            "--validation_data_path": "{datastore}/validation",
            "--epochs": "1",
            "--data_type": "images",
            "--data-format": "channels_first",
        },
        node_count=node_count,
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        wait_for_completion=True,
    )
    print(run)
@task
 def submit_images_local(c):
    """This command isn't implemented please modify to use.
    The call below will work for submitting jobs to execute locally on a GPU.
    Here we also map a volume to the docker container executing locally. This is the 
    location we tell our script to look for our training and validation data. Feel free to 
    adjust the other arguments as required by your trainining script.
    """
    raise NotImplementedError(
        "You need to modify this call before being able to use it"
    )
    from aml_compute import PyTorchExperimentCLI
    exp = PyTorchExperimentCLI("<YOUR-EXPERIMENT-NAME>")
    run = exp.submit_local(
        os.path.join(_BASE_PATH, "src"),
        "<YOUR-TRAINING-SCRIPT>",
        {
            "--training_data_path": "/data/train",
            "--validation_data_path": "/data/validation",
            "--epochs": "1",
            "--data_type": "images",
            "--data-format": "channels_first",
        },
        dependencies_file=os.path.join(_BASE_PATH, "environment_gpu.yml"),
        docker_args=["-v", f"{env_values['data']}:/data"],
        wait_for_completion=True,
    )
    print(run)
 remote_collection = Collection("remote")
 remote_collection.add_task(submit_images_remote, "images")
 remote_collection.add_task(submit_remote, "synthetic")
 local_collection = Collection("local")
 local_collection.add_task(submit_images_local, "images")
 local_collection.add_task(submit_local, "synthetic")
 submit_collection = Collection("submit", local_collection, remote_collection)
 namespace = Collection("pytorch_experiment", submit_collection)
--- a/{{cookiecutter.project_name}}/PyTorch_template/src/train_model.py
+++ b/{{cookiecutter.project_name}}/PyTorch_template/src/train_model.py
@ -0,0 +1,3 @@
 """
 Place Script to train your PyTorch model here
 """
--- a/{{cookiecutter.project_name}}/control/Docker/dockerfile
+++ b/{{cookiecutter.project_name}}/control/Docker/dockerfile
@ -87,6 +87,12 @@ ENV PYTHONPATH /workspace/TensorFlow_benchmark:$PYTHONPATH
 # imagenet {% if cookiecutter.type == "imagenet" or cookiecutter.type == "all"%}
 ENV PYTHONPATH /workspace/TensorFlow_imagenet:$PYTHONPATH 
 # ------- {% endif %}
 # pytorch benchmark {% if cookiecutter.type == "pytorch_benchmark" or cookiecutter.type == "all"%}
 ENV PYTHONPATH /workspace/PyTorch_benchmark:$PYTHONPATH 
 # ------- {% endif %}
 # pytorch imagenet {% if cookiecutter.type == "pytorch_imagenet" or cookiecutter.type == "all"%}
 ENV PYTHONPATH /workspace/PyTorch_imagenet:$PYTHONPATH 
 # ------- {% endif %}
 # Completion script
 COPY bash.completion /etc/bash_completion.d/
 RUN echo "source /etc/bash_completion.d/bash.completion" >> /root/.bashrc
--- a/{{cookiecutter.project_name}}/control/src/aml_compute.py
+++ b/{{cookiecutter.project_name}}/control/src/aml_compute.py
@ -15,7 +15,7 @@ from azureml.core.conda_dependencies import (
 )
 from azureml.core.runconfig import EnvironmentDefinition
 from azureml.tensorboard import Tensorboard
-from azureml.train.dnn import TensorFlow
+from azureml.train.dnn import TensorFlow, PyTorch
 from config import load_config
 from toolz import curry, pipe
 from pprint import pformat
@ -71,7 +71,7 @@ def _create_cluster(
    return compute_target
-def _prepare_environment_definition(dependencies_file, distributed):
+def _prepare_environment_definition(base_image, dependencies_file, distributed):
    logger = logging.getLogger(__name__)
    env_def = EnvironmentDefinition()
    conda_dep = CondaDependencies(conda_dependencies_file_path=dependencies_file)
@ -79,7 +79,7 @@ def _prepare_environment_definition(dependencies_file, distributed):
    env_def.python.conda_dependencies = conda_dep
    env_def.docker.enabled = True
    env_def.docker.gpu_support = True
-    env_def.docker.base_image = "mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda9.0-cudnn7-ubuntu16.04"
+    env_def.docker.base_image = base_image
    env_def.docker.shm_size = "8g"
    env_def.environment_variables["NCCL_SOCKET_IFNAME"] = "eth0"
    env_def.environment_variables["NCCL_IB_DISABLE"] = 1
@ -104,16 +104,18 @@ def _create_estimator(
    entry_script,
    compute_target,
    script_params,
    base_image,
    node_count=_CLUSTER_MAX_NODES,
    process_count_per_node=4,
    docker_args=(),
 ):
    logger = logging.getLogger(__name__)
    logger.debug(f"Base image {base_image}")
    logger.debug(f"Loading dependencies from {dependencies_file}")
    # If the compute target is "local" then don't run distributed
    distributed = not (isinstance(compute_target, str) and compute_target == "local")
-    env_def = _prepare_environment_definition(dependencies_file, distributed)
+    env_def = _prepare_environment_definition(base_image, dependencies_file, distributed)
    env_def.docker.arguments.extend(list(docker_args))
    estimator = estimator_class(
@ -362,6 +364,8 @@ class TFExperimentCLI(ExperimentCLI):
        wait_for_completion,
    ):
        self._logger.debug(script_params)
        base_image = "mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda9.0-cudnn7-ubuntu16.04"
        estimator = _create_estimator(
            TensorFlow,
            dependencies_file,
@ -369,6 +373,7 @@ class TFExperimentCLI(ExperimentCLI):
            entry_script,
            cluster,
            script_params,
            base_image,
            node_count=node_count,
            process_count_per_node=process_count_per_node,
            docker_args=docker_args
@ -398,6 +403,139 @@ class TFExperimentCLI(ExperimentCLI):
        return {key: _replace(value) for key, value in script_params.items()}
 class PyTorchExperimentCLI(ExperimentCLI):
    """Creates Experiment object that can be used to create clusters and submit experiments
    Returns:
        PyTorchExperimentCLI: Experiment object
    """
    def submit_local(
        self,
        project_folder,
        entry_script,
        script_params,
        dependencies_file=_DEPENDENCIES_FILE,
        wait_for_completion=True,
        docker_args=(),
    ):
        """Submit experiment for local execution
        Args:
            project_folder (string): Path of you source files for the experiment
            entry_script (string): The filename of your script to run. Must be found in your project_folder
            script_params (dict): Dictionary of script parameters
            dependencies_file (string, optional): The location of your environment.yml to use to create the
                                                  environment your training script requires. 
                                                  Defaults to _DEPENDENCIES_FILE.
            wait_for_completion (bool, optional): Whether to block until experiment is done. Defaults to True.
            docker_args (tuple, optional): Docker arguments to pass. Defaults to ().
        """
        self._logger.info("Running in local mode")
        self._submit(
            dependencies_file,
            project_folder,
            entry_script,
            "local",
            script_params,
            1,
            1,
            docker_args,
            wait_for_completion,
        )
    def submit(
        self,
        project_folder,
        entry_script,
        script_params,
        dependencies_file=_DEPENDENCIES_FILE,
        node_count=_CLUSTER_MAX_NODES,
        process_count_per_node=4,
        wait_for_completion=True,
        docker_args=(),
    ):
        """Submit experiment for remote execution on AzureML clusters
        Args:
            project_folder (string): Path of you source files for the experiment
            entry_script (string): The filename of your script to run. Must be found in your project_folder
            script_params (dict): Dictionary of script parameters
            dependencies_file (string, optional): The location of your environment.yml to use to
                                                  create the environment your training script requires. 
                                                  Defaults to _DEPENDENCIES_FILE.
            node_count (int, optional): [description]. Defaults to _CLUSTER_MAX_NODES.
            process_count_per_node (int, optional): Number of precesses to run on each node. 
                                                    Usually should be the same as the number of GPU for GPU exeuction. 
                                                    Defaults to 4.
            wait_for_completion (bool, optional): Whether to block until experiment is done. Defaults to True.
            docker_args (tuple, optional): Docker arguments to pass. Defaults to ().
        Returns:
            azureml.core.Run: AzureML Run object
        """
        self._logger.debug(script_params)
        transformed_params = self._complete_datastore(script_params)
        self._logger.debug("Transformed script params")
        self._logger.debug(transformed_params)
        return self._submit(
            dependencies_file,
            project_folder,
            entry_script,
            self.cluster,
            transformed_params,
            node_count,
            process_count_per_node,
            docker_args,
            wait_for_completion,
        )
    def _submit(
        self,
        dependencies_file,
        project_folder,
        entry_script,
        cluster,
        script_params,
        node_count,
        process_count_per_node,
        docker_args,
        wait_for_completion,
    ):
        self._logger.debug(script_params)
        base_image = "mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda9.0-cudnn7-ubuntu16.04"
        estimator = _create_estimator(
            PyTorch,
            dependencies_file,
            project_folder,
            entry_script,
            cluster,
            script_params,
            base_image,
            node_count=node_count,
            process_count_per_node=process_count_per_node,
            docker_args=docker_args,
        )
        self._logger.debug(estimator.conda_dependencies.__dict__)
        run = self._experiment.submit(estimator)
        if wait_for_completion:
            run.wait_for_completion(show_output=True)
        return run
    def _complete_datastore(self, script_params):
        def _replace(value):
            if isinstance(value, str) and "{datastore}" in value:
                data_path = value.replace("{datastore}/", "")
                return self.datastore.path(data_path).as_mount()
            else:
                return value
        return {key: _replace(value) for key, value in script_params.items()}
 def workspace_for_user(
    workspace_name=_WORKSPACE,
    resource_group=_RESOURCE_GROUP,
--- a/{{cookiecutter.project_name}}/tasks.py
+++ b/{{cookiecutter.project_name}}/tasks.py
@ -7,11 +7,14 @@ from config import load_config
 import os
 # Experiment imports {% if cookiecutter.type == "template" or cookiecutter.type == "all"%}
 import tensorflow_experiment # {%- endif -%} Template {% if cookiecutter.type == "benchmark" or cookiecutter.type == "all"%}
-import tensorflow_benchmark # {%- endif -%} Benchmark{% if cookiecutter.type == "imagenet" or cookiecutter.type == "all"%}
+import tensorflow_benchmark # {%- endif -%} Benchmark {% if cookiecutter.type == "imagenet" or cookiecutter.type == "all"%}
 import storage
 import image
 import tfrecords
-import tensorflow_imagenet  # Imagenet {% endif %}
+import tensorflow_imagenet  #  {%- endif -%} Imagenet {% if cookiecutter.type == "pytorch_imagenet" or cookiecutter.type == "all"%}
 import pytorch_imagenet # {%- endif -%} PyTorch Imagenet {% if cookiecutter.type == "imagenet" or cookiecutter.type == "pytorch_imagenet" or cookiecutter.type == "all"%}
 import storage 
 import image #  {%- endif -%} {% if cookiecutter.type == "pytorch_experiment" or cookiecutter.type == "all"%}
 import pytorch_experiment #  {%- endif -%} {% if cookiecutter.type == "pytorch_benchmark" or cookiecutter.type == "all"%}
 import pytorch_benchmark # {% endif %} PyTorch Benchmark 
 from invoke.executor import Executor
 logging.config.fileConfig(os.getenv("LOG_CONFIG", "logging.conf"))
@ -37,7 +40,8 @@ def _prompt_sub_id_selection(c):
    from prompt_toolkit import prompt
    results = c.run(f"az account list", pty=True, hide="out")
-    sub_dict = json.loads(results.stdout)
+    parsestr = "["+results.stdout[1:-7]+"]" #TODO: Figure why this is necessary
    sub_dict = json.loads(parsestr)
    sub_list = [
        {"Index": i, "Name": sub["name"], "id": sub["id"]}
        for i, sub in enumerate(sub_dict)
@ -192,7 +196,7 @@ namespace.add_collection(tf_exp_collection)
 tf_bench_collection = Collection.from_module(tensorflow_benchmark)
 namespace.add_collection(tf_bench_collection)
 #{%- endif -%}
-# Imagenet{% if cookiecutter.type == "imagenet" or cookiecutter.type == "all"%}
+# Imagenet {% if cookiecutter.type == "imagenet" or cookiecutter.type == "all"%}
 storage_collection = Collection.from_module(storage)
 storage_collection.add_collection(Collection.from_module(image))
 storage_collection.add_collection(Collection.from_module(tfrecords))
@ -200,6 +204,21 @@ tf_collection = Collection.from_module(tensorflow_imagenet)
 namespace.add_collection(tf_collection)
 namespace.add_collection(storage_collection) # {% endif %}
 # PyTorch Benchmark {% if cookiecutter.type == "pytorch_benchmark" or cookiecutter.type == "all"%}
 pytorch_bench_collection = Collection.from_module(pytorch_benchmark)
 namespace.add_collection(pytorch_bench_collection)
 #{%- endif %}
 # PyTorch Imagenet {% if cookiecutter.type == "pytorch_imagenet" or cookiecutter.type == "all"%}
 pytorch_imagenet_collection = Collection.from_module(pytorch_imagenet)
 namespace.add_collection(pytorch_imagenet_collection)
 #{%- endif %}
 # PyTorch Experiment {% if cookiecutter.type == "pytorch_experiment" or cookiecutter.type == "all"%}
 pytorch_exp_collection = Collection.from_module(pytorch_experiment)
 namespace.add_collection(pytorch_exp_collection)
 #{%- endif %}
 namespace.configure({
    'root_namespace': namespace,
    'invoke_execute': invoke_execute,