Bug fix: multi-GPU jobs on a VM use wrong folders (#622)

Co-authored-by: Shruthi42 <13177030+Shruthi42@users.noreply.github.com>
2021-12-15 17:25:42 +00:00 · 2021-12-15 17:25:42 +00:00 · 276e0f5253
--- a/.idea/InnerEye-DeepLearning.iml
+++ b/.idea/InnerEye-DeepLearning.iml
@ -3,10 +3,10 @@
  <component name="NewModuleRootManager">
    <content url="file://$MODULE_DIR$">
      <sourceFolder url="file://$MODULE_DIR$" isTestSource="false" />
-      <sourceFolder url="file://$MODULE_DIR$/fastMRI" isTestSource="false" />
      <sourceFolder url="file://$MODULE_DIR$/hi-ml/hi-ml-azure/src" isTestSource="false" />
      <sourceFolder url="file://$MODULE_DIR$/hi-ml/hi-ml/src" isTestSource="false" />
      <excludeFolder url="file://$MODULE_DIR$/InnerEye-DataQuality" />
+      <excludeFolder url="file://$MODULE_DIR$/fastMRI" />
    </content>
    <orderEntry type="jdk" jdkName="3.7 @ Ubuntu-20.04" jdkType="Python SDK" />
    <orderEntry type="sourceFolder" forTests="false" />
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -96,6 +96,7 @@ in inference-only runs when using lightning containers.
 - ([#553](https://github.com/microsoft/InnerEye-DeepLearning/pull/553)) Fix incomplete test data module setup in Lightning inference.
 - ([#557](https://github.com/microsoft/InnerEye-DeepLearning/pull/557)) Fix issue where learning rate was not set
  correctly in the SimCLR module
+- ([#622](https://github.com/microsoft/InnerEye-DeepLearning/pull/622)) Fix issue with multi-GPU jobs on a VM: each process tries to create a folder structure
 - ([#558](https://github.com/microsoft/InnerEye-DeepLearning/pull/558)) Fix issue with the CovidModel config where model
  weights from a finetuning run were incompatible with the model architecture created for non-finetuning runs.
 - ([#604](https://github.com/microsoft/InnerEye-DeepLearning/pull/604)) Fix issue where runs on a VM would download the dataset even when a local dataset is provided.
--- a/InnerEye/ML/deep_learning_config.py
+++ b/InnerEye/ML/deep_learning_config.py
@ -22,6 +22,7 @@ from InnerEye.Common.type_annotations import PathOrString, T, TupleFloat2
 from InnerEye.ML.common import CHECKPOINT_FOLDER, DATASET_CSV_FILE_NAME, \
    ModelExecutionMode, VISUALIZATION_FOLDER, \
    create_unique_timestamp_id, get_best_checkpoint_path
+from health_azure.utils import is_global_rank_zero


@unique
@ -135,8 +136,14 @@ class DeepLearningFileSystemConfig(Parameterized):
            else:
                logging.info("All results will be written to a subfolder of the project root folder.")
                root = project_root.absolute() / DEFAULT_AML_UPLOAD_DIR
-            timestamp = create_unique_timestamp_id()
-            run_folder = root / f"{timestamp}_{model_name}"
+            if is_global_rank_zero():
+                timestamp = create_unique_timestamp_id()
+                run_folder = root / f"{timestamp}_{model_name}"
+            else:
+                # Handle the case where there are multiple DDP threads on the same machine outside AML.
+                # Each child process will be started with the current working directory set to be the output
+                # folder of the rank 0 process. We want all other process to write to that same folder.
+                run_folder = Path.cwd().absolute()
            outputs_folder = run_folder
            logs_folder = run_folder / DEFAULT_LOGS_DIR_NAME
        else: