Граф коммитов

57 Коммитов

Автор SHA1 Сообщение Дата
Peter Hessey 9083966673
ENH: Improve regression tests (#827)
Closes #740. Updates the lung model regression test to use the latest
parameters and train for a substantial number of steps to ensure
training is progressing as expecting. The small number of epochs and
smaller data subset is used as running a full training run isn't
feasible. The new test runs in < 30 minutes but on real data.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Bypassing branch protections for failing PyTest as the test that is failing is `Tests/SSL/test_ssl_containers.py::test_innereye_ssl_container_cifar10_resnet_simclr` as the URL for CIFAR seems to be broken currently (returning 500 server error).
2022-11-16 15:05:46 +00:00
Javier 3c919dca14
Replace data quality folder with a link to the commit (#692)
* Move to commit link

* Remove build

* Update paper link

* Update README

* Add changelog
2022-03-10 09:14:25 +00:00
Javier 1606729c7a
Clean up legacy code (#671)
* Remove rnns

* Fix flake8

* Edit README

* Edit README

* Remove sequence

* Remove sequence

* Fix all

* Remove more

* Remove ignore

* Fix tests

* Undo config

* Fix config

* Revert pycharm

* Fix tests

* Undo outputlogger

* Fix flake8

* Fix ignore file

* Revert hi-ml

* Disable fail on alert
2022-03-09 14:53:12 +00:00
maxilse 1600ef3ddf
Fix DeepMIL for TCGA CRCK dataset (#659)
While we updated DeepMIL for the Panda dataset to work with the latest changes, we did not update DeepMIL for the TCGA CRCK dataset.

This PR updates how the caching of the encoded tiles is done and how the checkpoints of the DeepMIL model is saved and loaded.

No additional tests are required since these are the same functions that we use for the Panda dataset. For all of them a test already exists.

Last, the PR updates the cudatoolkit version, Anton and I found that this is the root cause for all our problems with ddp
2022-02-16 09:50:59 +00:00
Anton Schwaighofer d617c8107c
Re-enable Linux build (#655)
Also refactored tests that caused test discovery to slow down
2022-02-04 15:41:42 +00:00
Anton Schwaighofer a6b15166b7
Fix for stuck Linux build: Move pytest to Windows (#652)
Also renamed many build legs so that they can be found more easily in the UI.
2022-02-04 11:20:03 +00:00
Anton Schwaighofer 4fdeed23e4
Workaround for bug in PL 1.5.5: CombinedLoader cannot be used with DDP for training data (#646)
Fix for https://github.com/PyTorchLightning/pytorch-lightning/issues/11632
2022-02-01 14:25:21 +00:00
Anton Schwaighofer 61d9cab5cc
Cancel queued AzureML jobs when starting a PR build (#640)
AzureML jobs from failed previous PR builds do not get cancelled, consuming excessive resources. Now kill all queued and running jobs before starting new ones.
2022-01-25 09:07:18 +00:00
Anton Schwaighofer e477c9dd83
Upgrade to Pytorch Lightning 1.5.5 (#591) 2021-12-15 10:48:35 +00:00
Anton Schwaighofer 212e65c4a3
Adding pipeline cache for conda environment (#618)
Add cache that re-uses the full Conda environment
2021-12-14 12:49:20 +00:00
Daniel Coelho de Castro f5b7298c57
Add `--pl_deterministic` to build training jobs (#605) 2021-12-06 19:28:23 +00:00
Daniel Coelho de Castro 490e5bc900
Add cudatoolkit=11.1 to environment.yml (#596) 2021-12-03 16:24:26 +00:00
melanibe 94553a5c0b
Adding Active Label Cleaning code (#559)
* initial commit

* updating the build

* flake8

* update main page

* add the links

* try to fix the env

* update build, gitignore and remove duplicate license

* update gitignore again

* Adding to changelog

* conda activate

* update again

* wrong instruction

* add data quality

* rephrase

* first pass on Readme.md

* switch from our to the, and clarify the cxr datasets

* move content to a separate markdown file

* move additional content to config readme file

* finish updating dataquality readme

* Rename

* pr ocmment

* todos

* changed default dir for cifar10 dataset

Co-authored-by: Ozan Oktay <ozan.oktay@microsoft.com>
2021-09-21 10:22:03 +01:00
Anton Schwaighofer b35399fb84
Moving InnerEye's Azure code to hi-ml package (#548)
Moving InnerEye to use the new HI-ML package. 

See Issue 62 on the HI-ML package
2021-08-26 09:17:09 +01:00
Anton Schwaighofer 64646c5106
Update to fix issues in daily build (#545)
Daily builds fail of segmentation models with missing files for validation set.
2021-07-19 17:45:45 +01:00
Anton Schwaighofer 53999877d0
Enable a disabled test (#536)
Enabling required changing several stored result files because of the PL upgrade that happened in between.
2021-07-15 20:54:31 +01:00
Shruthi42 9fcc08f6cd
Run inference using checkpoints from registered models (#509) 2021-07-15 14:31:15 +00:00
Tim Regan 732ddfcb34
Dropping windows tests, but keeping cred-scan (#542)
Removing the Windows branch of our testing as we have been experiencing intermittent random failures in windows, which could be the result of a change in image on the test machines.

Closes #541
2021-07-15 12:30:44 +00:00
Jonathan Tripp cab68ccc61
Split validation and test infer config (#502)
Split validation, test, ensemble inference flags
2021-07-05 16:25:49 +01:00
Jonathan Tripp 43d31ce413
Comment out glaucoma job (#520)
Comment out glaucoma job
2021-07-05 14:20:29 +00:00
Anton Schwaighofer 01c31ed0e5
Regression test coverage for AzureML runs (#492)
- Enable regression tests on text and binary files, that are either produced by the job or uploaded to the run context
- Adding a large set of these regression test files to all models in PR builds
2021-06-17 20:37:57 +00:00
Anton Schwaighofer 109e5800b1
Fix for SDK regression bug, minor doc updates (#475)
- Dowgrade AML SDK to 1.23
- Reduce default number of dataloader workers
- Doc updates
2021-06-04 16:55:58 +00:00
Anton Schwaighofer 8bae42eb92
FastMRI dataset onboarding script and detailed examples (#444)
Add necessary tooling and examples for running fastMRI reconstruction models.
- Script to create and run an Azure Data Factory to download the raw data, and place them into a storage account
- Detailed examples to run the VarNet model from the fastMRI github repo
- Ability to work with fixed mounting points for datasets
2021-05-19 15:58:25 +00:00
Shruthi42 aa09b9db31
Register all models after training, not only Segmentation models. (#455)
This PR changes the codepath so all models trained on AzureML are registered. The codepath previously allowed only segmentation models (subclasses of `SegmentationModelBase`) to be registered. Models are registered after a training run or if the `only_register_model` flag is set. Models may be legacy InnerEye config-based models or may be defined using the LightningContainer class.

The PR also removes the AzureRunner conda environment. The full InnerEye conda environment is needed to submit a training job to AzureML.

It splits the `TrainHelloWorldAndHelloContainer` job in the PR build into two jobs, `TrainHelloWorld` and `TrainHelloContainer`. It adds a pytest marker `after_training_hello_container` for tests that can be run after training is finished in the `TrainHelloContainer` job.

This will solve the issue of model registration in #377 and #398.
2021-05-12 15:03:35 +01:00
Anton Schwaighofer c298155753
Fixing bugs when running container models on multiple GPUs (#445)
- The use_gpu flag for container models was not picked up correctly, always running without GPU
- When running inference for container models with the test_step method, PL would fail when running on >1 GPU
- Adds an extra test to run the HelloContainer model in AzureML
2021-04-23 17:15:37 +01:00
melanibe adffa95a14
Checkpoint recovery refactoring (#439)
* Add auto-restart

* Change handling of checkpoints and clean-up

* Save last k recovery checkpoints

* Log epoch for keeping last ckpt

* Keeping k last checkpoints

* Add possibility to recover from particular checkpoint

* Update tests

* Check k recovery

* Re-add skipif

* Correct pick up of recovery runs and add test

* Correct pick up of recovery runs and add test

* Remove all start epochs

* Remove all start epochs

* Spimplify run recovery logic

* Fix it

* Merge conflicts import errors

* Fix it

* Fix tests in test_scalar_model.py

* Fix tests in test_model_util.py

* Fix tests in test_scalar_model.py

* Fix tests in test_model_training.py

* Avoid forcing the user to log epoch

* Fix test_get_checkpoints

* Fix test_checkpoint_handling.py

* Fix callback

* Update CHANGELOG.md

* Self PR review comments

* Fix more tests

* Fix argument in test

* Mypy

* Update InnerEye-DeepLearning.iml

* Update InnerEye-DeepLearning.iml

* Fix mypy errors

* Address PR comment

* Typo

* mypy fix

* just style
2021-04-21 15:40:20 +01:00
Anton Schwaighofer 0d479ba3d8
Enable Bring-your-own-Lightning-model (#417)
- Enable brining arbitrary PyTorch-Lightning models to the InnerEye toolbox
- Upgrade mypy and simplify the way we invoke it
2021-04-19 15:28:41 +00:00
melanibe a155946ea4
Fix multi-node bug in PL 1.2.8 (#437)
* Fix the bug in PL

* Add back the test

* Missing import

* CHANGELOG.md

* Fix it

* Only plugin if more than one gpu

* Only plugin if more than one gpu

* Mypy

* Mypy again
2021-04-16 18:54:55 +00:00
melanibe 28404f09d1
Fix cross validation for classification models and update pytorch-lightning (#432)
* Fix cross validation results downloading for classification models

* Fix

* Fix it

* Back to main

* CHANGELOG.md

* try this out

* Add new build step

* Update

* roll back

* push again

* Update build-pr.yml

* Update GlaucomaPublic.py

* Update plot_cross_validation.py

* Changing model import

* Write model files was not working as expected

* Wrong indent

* Fix the aggregation code and add a test

* Update the environment.yml

* Fix it

* Attempt to fix test in build PR

* Update build-pr.yml

* Update GlaucomaPublic.py

* Delete model_paper_glaucoma.py

* Update model_util.py

* Update GlaucomaPublic.py

* Just format

* Add additional tests

* Add additional tests

* Add test for check_count for ensemble

* Style

* Rename to more meaningful

* Adding cross validation fold to test metrics dict expect for ensemble

* Only download ensemble to CV if segmentation model

* Add explicit possible labels for tests

* Delete unecessary files

* Only compute val metrics if this is not ensemble run

* Adapt test

* Improve for segmentation model too

* Update again

* Fix it

* Update PR build

* Update config to avoid clashing import

* Update config to avoid clashing import

* Update config to avoid clashing import

* Flake8

* Update test

* Try out new env

* Try out spawn instead

* Back to main

* Update CHANGELOG.md

* Try out fix mentioned in PL issue

* Roll back weird fix

* Test files to match true structure of cv

* Add new tests to check the CV folder

* Roll back wrong commit

* Flake8

* Flake8

* Update doc PR comment

* Add docstring

* Fallback runs needed to be updated

* Update build-pr

* Update linux test

* Commented out by mistake

* Mypy

* dont need to change mypy

* Update InnerEye/ML/deep_learning_config.py

Co-authored-by: Anton Schwaighofer <antonsc@microsoft.com>

* Update InnerEye/ML/model_training.py

Co-authored-by: Anton Schwaighofer <antonsc@microsoft.com>

* Update azure-pipelines/build-pr.yml

Co-authored-by: Anton Schwaighofer <antonsc@microsoft.com>

* Custom type for complex signature

* Simplify signature for aggregate and create metrics

* Update

* Need to skip train 2 nodes

* Add warning in CHANGELOG.md

* Mark

* Fix multi-node with one gpu

* Update CHANGELOG.md

* Move to 1.2.7

* reformat

* reformat

* linesep

* reformat

* Type declaration beginning PR comment

Co-authored-by: Anton Schwaighofer <antonsc@microsoft.com>
2021-04-16 10:29:06 +00:00
Anton Schwaighofer 821cb3be7a
Move basic PR checks like Flake8, Mypy and HelloWorld to Github actions (#426)
At present, external contributors don't have any insight into why the PR builds fail because they run on ADO. This PR moves some of the basic checks to Github Actions, where they are fully visible: Flake8, mypy, and training the HelloWorld model.
2021-03-29 10:46:13 +05:30
Anton Schwaighofer eb5f931f20
Fix error messages in test coverage reporting (#394)
- Coverage reporting complains that it does not like the HTML output folder.
- Exclude the Tests* folders from the report, so that the overall coverage figures make more sense
2021-02-10 14:29:20 +00:00
Anton Schwaighofer b40d6d13c0
Enable multi-node training (#385)
- Enable training on multiple machines in AzureML
- Exclude private settings files from AzureML snapshot
2021-02-08 14:21:50 +00:00
Anton Schwaighofer bc90c65f0a
Migrate to Pytorch Lightning (#323)
This PR swaps out all of the previously hand-written training routines, and switches them to PyTorch Lightning.
2021-01-28 15:25:53 +00:00
Anton Schwaighofer 3fa74c2a2c
Remove pre-processing of source version message (#356)
Pre-processing of source code message causes problems when those contains shell special characters. Remove and rely on git package to pick those up
2021-01-11 15:50:26 +00:00
Jonathan Tripp c54a7281f9
Check more locations for dataset file and fail if comparison files mi… (#348)
Check more locations for dataset file and fail if model comparison requested but comparison files missing.
2021-01-07 11:02:55 +00:00
Anton Schwaighofer 014c74e34f
Fix ensemble checkpoint download (#326)
* test

* fix test

* download fix

* create separate model folder

* fixing tests

* making HD check better

* Tests

* inverted logic

* registering on parent run

* docu
2020-11-25 16:25:02 +01:00
Anton Schwaighofer 94f675d3ab
Fix for HelloWorld example on fresh checkout (#320) 2020-11-17 08:42:42 +01:00
Anton Schwaighofer 1e86bfd008
Ensure that PR builds fail on any job errors, fix component governance (#317)
- The "TrainViaSubmodule" step presently only fails if the last python call fails. Fix that.
- Component Governance was accidentally disabled in #290
2020-11-16 16:27:10 +00:00
Anton Schwaighofer cd4458e15c
Ensure that models are registered with consistent file structure (#276)
- Make file structure consistent across normal training and training when InnerEye is a submodule
- Add test coverage for the file structure of registered models
- Add documentation around how the model structure looks like
- If multiple Conda files are used in an InnerEye run, they are merged into one environment file for deployment. The complicated merge inside of `run_scoring` could be deprecated in principle, but leaving it there if we need for legacy models.
- Add test coverage for `submit_for_inference`: Previous test was using a hardcoded legacy model, meaning that any changes to model structure could have broken the script
- The test for `submit_for_inference` is no longer submitted from the big AzureML run, shortening the runtime of that part of the PR build. Instead, it is triggered after the `TrainViaSubmodule` part of the build. The corresponding AzureML experiment is no longer `model_inference`, but the same experiment as all other AzureML runs.
- The test for `submit_for_inference` was previously running on the expensive `training-nd24` cluster, now on the cheaper `nc12`.
- `submit_for_inference` now correctly uses the `score.py` file that is inside of the model, rather than copying it from the repository root.
2020-11-16 10:05:32 +00:00
Anton Schwaighofer e7a88877c5
Switch more code to using Path (#305)
- Rename the `TestOutputDirectories` class because it is picked up by pytest as something it expects to contain tests
- Switch fields to using `Path`, rather than `str`
2020-11-02 19:49:13 +00:00
Anton Schwaighofer d4b9720c81
Upgrade AzureML SDK, check framework versions (#304)
Ensure that AzureML SDK is recent enough to recognize our PyTorch version.
2020-10-30 15:13:59 +00:00
Anton Schwaighofer 9686f12728
Improve GPU resource monitoring (#296)
- Compute aggregate metrics over the whole training run
- Get allocated and reserved memory
- Store aggregate metrics in AzureML
Note, diagnostic metrics are no longer stored in AzureML. Tensorboard is better for vast amounts of metrics.
2020-10-26 16:44:08 +00:00
Anton Schwaighofer 52f5c77f81
Adding patch sampling diagnostics by default (#290)
Always create thumbnails that show patch sampling behaviour
2020-10-23 11:42:52 +01:00
Shruthi42 cee61026a5
Fix pytests (#274)
- Marks tests as `gpu`, `cpu_and_gpu` or `azureml`. Tests marked `gpu`  and `azureml` are not run in the normal test set, only on the AzureML run triggered by the PR builds. Long tests like test_submit_for_inference are no longer run as part of the main set.
- Cleans up pytest.ini
2020-10-09 15:49:15 +05:30
Anton Schwaighofer 7a98d4d62d
Add user alias for notifications, add max run duration (#271)
- Supply user alias and/or email address, so that notifications can be sent
- Add argument for maximum run duration, limit PR build to 1h
2020-10-06 13:50:16 +01:00
Shruthi42 1c9d67bd55
Refactor to separate checkpoint, model and optimizer logic (#259)
- Separates the logic used to determine from what checkpoint/checkpoint path we will recover
- Separates model creation, and model checkpoint loading from optimizer creation and checkpoint loading and keeps all this under class ModelAndInfo.
- Optimizers created after model is moved to GPU - Fixes #198
- Test added to train_via_submodule.yml which continues training from a previous run using run recovery.
2020-10-02 19:40:57 +01:00
Anton Schwaighofer d2f5327c79
Allow private settings files (#245)
Add the capability to not check in the complete `settings.yml` file, and fill in the missing ones via the file `InnerEyePrivateSettings.yml` file in the repository root.
2020-09-25 19:38:01 +01:00
melanibe a112b399fe
Ignore local dataset argument when running inside AML runs (#238)
This PR modifies mount_or_download_dataset such that we ignore the `local_dataset` argument inside AML runs (only used for local runs).
2020-09-22 16:01:54 +01:00
Anton Schwaighofer 8b0b47941a
Update documentation (#239)
* fix

* move IDs out

* syntax
2020-09-22 14:47:56 +01:00
Anton Schwaighofer 3e8b92d0f1
Shorten the most frequent commandline options, rename settings file (#232)
Rename commandline options: --submit_to_azureml -> --azureml, --is_train -> --train, --gpu_cluster_name -> --cluster
Rename train_variables.yml -> settings.yml
2020-09-21 17:40:05 +01:00