Граф коммитов

18 Коммитов

Автор SHA1 Сообщение Дата
Shruthi42 aa09b9db31
Register all models after training, not only Segmentation models. (#455)
This PR changes the codepath so all models trained on AzureML are registered. The codepath previously allowed only segmentation models (subclasses of `SegmentationModelBase`) to be registered. Models are registered after a training run or if the `only_register_model` flag is set. Models may be legacy InnerEye config-based models or may be defined using the LightningContainer class.

The PR also removes the AzureRunner conda environment. The full InnerEye conda environment is needed to submit a training job to AzureML.

It splits the `TrainHelloWorldAndHelloContainer` job in the PR build into two jobs, `TrainHelloWorld` and `TrainHelloContainer`. It adds a pytest marker `after_training_hello_container` for tests that can be run after training is finished in the `TrainHelloContainer` job.

This will solve the issue of model registration in #377 and #398.
2021-05-12 15:03:35 +01:00
melanibe adffa95a14
Checkpoint recovery refactoring (#439)
* Add auto-restart

* Change handling of checkpoints and clean-up

* Save last k recovery checkpoints

* Log epoch for keeping last ckpt

* Keeping k last checkpoints

* Add possibility to recover from particular checkpoint

* Update tests

* Check k recovery

* Re-add skipif

* Correct pick up of recovery runs and add test

* Correct pick up of recovery runs and add test

* Remove all start epochs

* Remove all start epochs

* Spimplify run recovery logic

* Fix it

* Merge conflicts import errors

* Fix it

* Fix tests in test_scalar_model.py

* Fix tests in test_model_util.py

* Fix tests in test_scalar_model.py

* Fix tests in test_model_training.py

* Avoid forcing the user to log epoch

* Fix test_get_checkpoints

* Fix test_checkpoint_handling.py

* Fix callback

* Update CHANGELOG.md

* Self PR review comments

* Fix more tests

* Fix argument in test

* Mypy

* Update InnerEye-DeepLearning.iml

* Update InnerEye-DeepLearning.iml

* Fix mypy errors

* Address PR comment

* Typo

* mypy fix

* just style
2021-04-21 15:40:20 +01:00
Anton Schwaighofer 3fa74c2a2c
Remove pre-processing of source version message (#356)
Pre-processing of source code message causes problems when those contains shell special characters. Remove and rely on git package to pick those up
2021-01-11 15:50:26 +00:00
Jonathan Tripp c54a7281f9
Check more locations for dataset file and fail if comparison files mi… (#348)
Check more locations for dataset file and fail if model comparison requested but comparison files missing.
2021-01-07 11:02:55 +00:00
Anton Schwaighofer 1e86bfd008
Ensure that PR builds fail on any job errors, fix component governance (#317)
- The "TrainViaSubmodule" step presently only fails if the last python call fails. Fix that.
- Component Governance was accidentally disabled in #290
2020-11-16 16:27:10 +00:00
Anton Schwaighofer cd4458e15c
Ensure that models are registered with consistent file structure (#276)
- Make file structure consistent across normal training and training when InnerEye is a submodule
- Add test coverage for the file structure of registered models
- Add documentation around how the model structure looks like
- If multiple Conda files are used in an InnerEye run, they are merged into one environment file for deployment. The complicated merge inside of `run_scoring` could be deprecated in principle, but leaving it there if we need for legacy models.
- Add test coverage for `submit_for_inference`: Previous test was using a hardcoded legacy model, meaning that any changes to model structure could have broken the script
- The test for `submit_for_inference` is no longer submitted from the big AzureML run, shortening the runtime of that part of the PR build. Instead, it is triggered after the `TrainViaSubmodule` part of the build. The corresponding AzureML experiment is no longer `model_inference`, but the same experiment as all other AzureML runs.
- The test for `submit_for_inference` was previously running on the expensive `training-nd24` cluster, now on the cheaper `nc12`.
- `submit_for_inference` now correctly uses the `score.py` file that is inside of the model, rather than copying it from the repository root.
2020-11-16 10:05:32 +00:00
Anton Schwaighofer e7a88877c5
Switch more code to using Path (#305)
- Rename the `TestOutputDirectories` class because it is picked up by pytest as something it expects to contain tests
- Switch fields to using `Path`, rather than `str`
2020-11-02 19:49:13 +00:00
Anton Schwaighofer 9686f12728
Improve GPU resource monitoring (#296)
- Compute aggregate metrics over the whole training run
- Get allocated and reserved memory
- Store aggregate metrics in AzureML
Note, diagnostic metrics are no longer stored in AzureML. Tensorboard is better for vast amounts of metrics.
2020-10-26 16:44:08 +00:00
Anton Schwaighofer 7a98d4d62d
Add user alias for notifications, add max run duration (#271)
- Supply user alias and/or email address, so that notifications can be sent
- Add argument for maximum run duration, limit PR build to 1h
2020-10-06 13:50:16 +01:00
Shruthi42 1c9d67bd55
Refactor to separate checkpoint, model and optimizer logic (#259)
- Separates the logic used to determine from what checkpoint/checkpoint path we will recover
- Separates model creation, and model checkpoint loading from optimizer creation and checkpoint loading and keeps all this under class ModelAndInfo.
- Optimizers created after model is moved to GPU - Fixes #198
- Test added to train_via_submodule.yml which continues training from a previous run using run recovery.
2020-10-02 19:40:57 +01:00
Anton Schwaighofer d2f5327c79
Allow private settings files (#245)
Add the capability to not check in the complete `settings.yml` file, and fill in the missing ones via the file `InnerEyePrivateSettings.yml` file in the repository root.
2020-09-25 19:38:01 +01:00
Anton Schwaighofer 8b0b47941a
Update documentation (#239)
* fix

* move IDs out

* syntax
2020-09-22 14:47:56 +01:00
Anton Schwaighofer 3e8b92d0f1
Shorten the most frequent commandline options, rename settings file (#232)
Rename commandline options: --submit_to_azureml -> --azureml, --is_train -> --train, --gpu_cluster_name -> --cluster
Rename train_variables.yml -> settings.yml
2020-09-21 17:40:05 +01:00
Anton Schwaighofer b654c23e0c
Reduce logging noise (#222)
* PR builds throw repeated " mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.". Known issue with MKL, as per pytorch/pytorch#37377
* Avoid logging noise from Urllib
* Typos and documentation fixes
2020-09-16 15:31:48 +01:00
Anton Schwaighofer 95085f2d5e
Read git-related information via gitpython (#193)
Most git-related information is presently expected in commandline arguments, populated in pipelines. Change that to read via gitpython, so that also in runs from user's local machines the branch info is correct. This fixes #151
Improve documentation around queuing training runs and run recovery. Write run recovery ID to a file for later use in pipelines.
2020-09-04 14:16:03 +01:00
David Carter a372f49e13
Sanitize source version message (#182)
* Sanitize source version message

* Sanitize source version message

* Syntax

* Allow longer build source message

Co-authored-by: Shruthi42 <13177030+Shruthi42@users.noreply.github.com>
2020-08-28 16:08:14 +01:00
David Carter 1136e23352
Improve mypy_runner.py (#171)
This PR reworks mypy_runner.py both to ensure all files are checked, and to speed up the process (from about 3 minutes to about 12 seconds in the PR build). Rather than processing one file at a time, mypy is called repeatedly with "--verbose" set, and the logs are (silently) checked to see if files have been visited. Visited files are excluded from the set to be checked, and mypy is invoked again on the remaining ones until there are none (or until no further files are visited - though this should not and does not seem to happen).

Care is taken to ensure that this script can also be called when this repo is present as a submodule (assumed to be called innereye-deeplearning as usual). When this is the case, we do not check the files inside the submodule, as we assume they have already been checked as part of the build process here.

It is also now possible to provide the script with a specific list of files to check, by supplying them on the command line.

Running this new version turned up a couple of previously undetected type issues, which are also fixed here.
2020-08-14 15:41:44 +01:00
Shruthi42 d6a3d73ccf Add source code 2020-07-29 00:30:35 +05:30