Bug fix: multi-GPU jobs on a VM use wrong folders (#622)
Co-authored-by: Shruthi42 <13177030+Shruthi42@users.noreply.github.com>
This commit is contained in:
Родитель
e477c9dd83
Коммит
276e0f5253
|
@ -3,10 +3,10 @@
|
|||
<component name="NewModuleRootManager">
|
||||
<content url="file://$MODULE_DIR$">
|
||||
<sourceFolder url="file://$MODULE_DIR$" isTestSource="false" />
|
||||
<sourceFolder url="file://$MODULE_DIR$/fastMRI" isTestSource="false" />
|
||||
<sourceFolder url="file://$MODULE_DIR$/hi-ml/hi-ml-azure/src" isTestSource="false" />
|
||||
<sourceFolder url="file://$MODULE_DIR$/hi-ml/hi-ml/src" isTestSource="false" />
|
||||
<excludeFolder url="file://$MODULE_DIR$/InnerEye-DataQuality" />
|
||||
<excludeFolder url="file://$MODULE_DIR$/fastMRI" />
|
||||
</content>
|
||||
<orderEntry type="jdk" jdkName="3.7 @ Ubuntu-20.04" jdkType="Python SDK" />
|
||||
<orderEntry type="sourceFolder" forTests="false" />
|
||||
|
|
|
@ -96,6 +96,7 @@ in inference-only runs when using lightning containers.
|
|||
- ([#553](https://github.com/microsoft/InnerEye-DeepLearning/pull/553)) Fix incomplete test data module setup in Lightning inference.
|
||||
- ([#557](https://github.com/microsoft/InnerEye-DeepLearning/pull/557)) Fix issue where learning rate was not set
|
||||
correctly in the SimCLR module
|
||||
- ([#622](https://github.com/microsoft/InnerEye-DeepLearning/pull/622)) Fix issue with multi-GPU jobs on a VM: each process tries to create a folder structure
|
||||
- ([#558](https://github.com/microsoft/InnerEye-DeepLearning/pull/558)) Fix issue with the CovidModel config where model
|
||||
weights from a finetuning run were incompatible with the model architecture created for non-finetuning runs.
|
||||
- ([#604](https://github.com/microsoft/InnerEye-DeepLearning/pull/604)) Fix issue where runs on a VM would download the dataset even when a local dataset is provided.
|
||||
|
|
|
@ -22,6 +22,7 @@ from InnerEye.Common.type_annotations import PathOrString, T, TupleFloat2
|
|||
from InnerEye.ML.common import CHECKPOINT_FOLDER, DATASET_CSV_FILE_NAME, \
|
||||
ModelExecutionMode, VISUALIZATION_FOLDER, \
|
||||
create_unique_timestamp_id, get_best_checkpoint_path
|
||||
from health_azure.utils import is_global_rank_zero
|
||||
|
||||
|
||||
@unique
|
||||
|
@ -135,8 +136,14 @@ class DeepLearningFileSystemConfig(Parameterized):
|
|||
else:
|
||||
logging.info("All results will be written to a subfolder of the project root folder.")
|
||||
root = project_root.absolute() / DEFAULT_AML_UPLOAD_DIR
|
||||
timestamp = create_unique_timestamp_id()
|
||||
run_folder = root / f"{timestamp}_{model_name}"
|
||||
if is_global_rank_zero():
|
||||
timestamp = create_unique_timestamp_id()
|
||||
run_folder = root / f"{timestamp}_{model_name}"
|
||||
else:
|
||||
# Handle the case where there are multiple DDP threads on the same machine outside AML.
|
||||
# Each child process will be started with the current working directory set to be the output
|
||||
# folder of the rank 0 process. We want all other process to write to that same folder.
|
||||
run_folder = Path.cwd().absolute()
|
||||
outputs_folder = run_folder
|
||||
logs_folder = run_folder / DEFAULT_LOGS_DIR_NAME
|
||||
else:
|
||||
|
|
Загрузка…
Ссылка в новой задаче