Bug fix: multi-GPU jobs on a VM use wrong folders (#622)

Co-authored-by: Shruthi42 <13177030+Shruthi42@users.noreply.github.com>
This commit is contained in:
Anton Schwaighofer 2021-12-15 17:25:42 +00:00 коммит произвёл GitHub
Родитель e477c9dd83
Коммит 276e0f5253
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
3 изменённых файлов: 11 добавлений и 3 удалений

Просмотреть файл

@ -3,10 +3,10 @@
<component name="NewModuleRootManager">
<content url="file://$MODULE_DIR$">
<sourceFolder url="file://$MODULE_DIR$" isTestSource="false" />
<sourceFolder url="file://$MODULE_DIR$/fastMRI" isTestSource="false" />
<sourceFolder url="file://$MODULE_DIR$/hi-ml/hi-ml-azure/src" isTestSource="false" />
<sourceFolder url="file://$MODULE_DIR$/hi-ml/hi-ml/src" isTestSource="false" />
<excludeFolder url="file://$MODULE_DIR$/InnerEye-DataQuality" />
<excludeFolder url="file://$MODULE_DIR$/fastMRI" />
</content>
<orderEntry type="jdk" jdkName="3.7 @ Ubuntu-20.04" jdkType="Python SDK" />
<orderEntry type="sourceFolder" forTests="false" />

Просмотреть файл

@ -96,6 +96,7 @@ in inference-only runs when using lightning containers.
- ([#553](https://github.com/microsoft/InnerEye-DeepLearning/pull/553)) Fix incomplete test data module setup in Lightning inference.
- ([#557](https://github.com/microsoft/InnerEye-DeepLearning/pull/557)) Fix issue where learning rate was not set
correctly in the SimCLR module
- ([#622](https://github.com/microsoft/InnerEye-DeepLearning/pull/622)) Fix issue with multi-GPU jobs on a VM: each process tries to create a folder structure
- ([#558](https://github.com/microsoft/InnerEye-DeepLearning/pull/558)) Fix issue with the CovidModel config where model
weights from a finetuning run were incompatible with the model architecture created for non-finetuning runs.
- ([#604](https://github.com/microsoft/InnerEye-DeepLearning/pull/604)) Fix issue where runs on a VM would download the dataset even when a local dataset is provided.

Просмотреть файл

@ -22,6 +22,7 @@ from InnerEye.Common.type_annotations import PathOrString, T, TupleFloat2
from InnerEye.ML.common import CHECKPOINT_FOLDER, DATASET_CSV_FILE_NAME, \
ModelExecutionMode, VISUALIZATION_FOLDER, \
create_unique_timestamp_id, get_best_checkpoint_path
from health_azure.utils import is_global_rank_zero
@unique
@ -135,8 +136,14 @@ class DeepLearningFileSystemConfig(Parameterized):
else:
logging.info("All results will be written to a subfolder of the project root folder.")
root = project_root.absolute() / DEFAULT_AML_UPLOAD_DIR
timestamp = create_unique_timestamp_id()
run_folder = root / f"{timestamp}_{model_name}"
if is_global_rank_zero():
timestamp = create_unique_timestamp_id()
run_folder = root / f"{timestamp}_{model_name}"
else:
# Handle the case where there are multiple DDP threads on the same machine outside AML.
# Each child process will be started with the current working directory set to be the output
# folder of the rank 0 process. We want all other process to write to that same folder.
run_folder = Path.cwd().absolute()
outputs_folder = run_folder
logs_folder = run_folder / DEFAULT_LOGS_DIR_NAME
else: