Граф коммитов

55 Коммитов

Автор SHA1 Сообщение Дата
Melissa Bristow 8cf63c8e7a
BUG: Dont update multi-node env vars for single node training (#796)
Update environment variables for multi-node jobs
2022-09-02 16:22:14 +00:00
Peter Hessey 59214c268e
DOC: Add all `InnerEye/ML` docstrings to ReadTheDocs (#783)
* 📝 Create basic for ML API

* 📝 Add ML/configs base doc files

* 📝 Finish ML/configs API

* 📝 Update augmentations

* 📝 Add ML/dataset API docs

* 📝 Add rst skeleton for ML/models

* 📝 Fix docstring missing newlines

* Remove script

* 📝 Finish ML/models API docs

* 📝 Start ML/SSL API. Fix some formatting issues

* 📝 Correct whitespace issues in `:param`

* 📝 Fix whitespace errors on `:return` statements

* 📝 Fix :return: statements

* 📝 Finish ML/SSL API

* 📝 Add ML/utils API docs

* 📝 Add visualizer docs, fix `:raise` indents

* 📝 Fix more issues with the `:raises:` formatting

* ♻️ Restructuring folders

* 📝 Limit API `toctree` depth

* 📝 Add primary InnerEye/ML files API to docs

* 📝 Fix and add `InnerEye/ML/*.py` docs

* ⚰️ Remove weird `settings.json` change

* ♻️ 💡 Address review comments
2022-08-16 08:58:38 +00:00
Peter Hessey 92d94799f2
ENH: Add Environment Locking (#735)
*  Add environment locking script

* 📝  Finish script, add documentation

* 🐛 Change Windows env file in workflow

* 📝 🐛 Add review changes + fixes

* 🚧 Temporarily alter tests and conda channels

* 🚧 Add logging

*  Fix TestSubmodule env file

* 🔥 Delete env test

* 🧑‍💻 Add warning to environment.yml

* 📝 ⚰️ Update based on review comments

* 📝 Add final changes
2022-06-01 10:05:54 +00:00
Javier 1606729c7a
Clean up legacy code (#671)
* Remove rnns

* Fix flake8

* Edit README

* Edit README

* Remove sequence

* Remove sequence

* Fix all

* Remove more

* Remove ignore

* Fix tests

* Undo config

* Fix config

* Revert pycharm

* Fix tests

* Undo outputlogger

* Fix flake8

* Fix ignore file

* Revert hi-ml

* Disable fail on alert
2022-03-09 14:53:12 +00:00
Anton Schwaighofer ccb53d01ad
Improve recovery of preempted jobs (#633)
Autosaving checkpoints by default every 1 epoch to a fixed file name. Retiring the "top k" recovery checkpoint notion because that was tied to specific models that needed more than 1 checkpoint.
2022-01-17 12:05:39 +00:00
vale-salvatelli 4aa84b9f36
Vsalva/deepmil panda (#619)
* adding deepmilpanda container
2021-12-14 19:45:07 +00:00
Anton Schwaighofer cd0c685eef
Bug fix: When using local folders, datasets are downloaded nevertheless (#604) 2021-12-07 17:45:14 +00:00
Anton Schwaighofer cf3145f029
Fix for invalid path_on_compute problem (#593)
Empty string as mount_point is turned into "." by hi-ml 0.1.11, which fails the jobs
2021-11-18 08:49:55 +00:00
Anton Schwaighofer 3075b447aa
Adjust to new namespaces (#572)
Adjusting for new version hi-ml-azure==0.1.9
2021-10-20 19:40:20 +01:00
Anton Schwaighofer b35399fb84
Moving InnerEye's Azure code to hi-ml package (#548)
Moving InnerEye to use the new HI-ML package. 

See Issue 62 on the HI-ML package
2021-08-26 09:17:09 +01:00
vale-salvatelli c946c689aa
Environment and hello_world_model documentation updated (#546)
* Some additions and fix to the docs describing how to setup the environment. The new version should be a little more clear.
* A small change in the logging info when submitting azureml - the workspace name is now reporte
2021-07-27 09:28:17 +01:00
Anton Schwaighofer 53999877d0
Enable a disabled test (#536)
Enabling required changing several stored result files because of the PL upgrade that happened in between.
2021-07-15 20:54:31 +01:00
Shruthi42 9fcc08f6cd
Run inference using checkpoints from registered models (#509) 2021-07-15 14:31:15 +00:00
Anton Schwaighofer ae82089d79
Fixes for mounting and matplotlib problems (#515) 2021-07-15 11:41:51 +01:00
Jonathan Tripp cab68ccc61
Split validation and test infer config (#502)
Split validation, test, ensemble inference flags
2021-07-05 16:25:49 +01:00
Anton Schwaighofer 7cd7e58899
Fix timeouts when downloading multiple checkpoint files (#498)
Downloading multiple checkpoints uses a codepath that has a fixed 120sec timeout. Instead, use multiple individual download operations.
2021-06-22 13:33:33 +00:00
Anton Schwaighofer 9749954923
Fix for stuck test set inference for LightningContainer models (#494)
Jobs got stuck if they used PyTorch Lightning built-in metrics objects, which were then trying to synchronize across GPUs. Resolved by shutting down torch.distributed
2021-06-17 17:34:26 +01:00
Anton Schwaighofer 8bae42eb92
FastMRI dataset onboarding script and detailed examples (#444)
Add necessary tooling and examples for running fastMRI reconstruction models.
- Script to create and run an Azure Data Factory to download the raw data, and place them into a storage account
- Detailed examples to run the VarNet model from the fastMRI github repo
- Ability to work with fixed mounting points for datasets
2021-05-19 15:58:25 +00:00
melanibe 7b5b414d02
Add self-supervised learning capabilities to InnerEye (#440) 2021-05-07 13:27:38 +00:00
Anton Schwaighofer 0d479ba3d8
Enable Bring-your-own-Lightning-model (#417)
- Enable brining arbitrary PyTorch-Lightning models to the InnerEye toolbox
- Upgrade mypy and simplify the way we invoke it
2021-04-19 15:28:41 +00:00
Anton Schwaighofer 88188c29ad
Upgrade to Pytorch 1.8 (#411)
Updated also `pytorch-lightning` to 1.1.8, and the AzureML libraries to 1.23
2021-03-15 13:33:25 +00:00
Anton Schwaighofer b40d6d13c0
Enable multi-node training (#385)
- Enable training on multiple machines in AzureML
- Exclude private settings files from AzureML snapshot
2021-02-08 14:21:50 +00:00
Anton Schwaighofer bc90c65f0a
Migrate to Pytorch Lightning (#323)
This PR swaps out all of the previously hand-written training routines, and switches them to PyTorch Lightning.
2021-01-28 15:25:53 +00:00
Shruthi42 be9323adbe
Remove sub fold cross validation (#357)
Removes the ability to perform sub-fold cross validation. Removes parameters `number_of_cross_validation_splits_per_fold` 
and `cross_validation_sub_fold_split_index` from ScalarModelBase.
2021-01-12 23:04:44 +05:30
Jonathan Tripp c54a7281f9
Check more locations for dataset file and fail if comparison files mi… (#348)
Check more locations for dataset file and fail if model comparison requested but comparison files missing.
2021-01-07 11:02:55 +00:00
Anton Schwaighofer 1a7b64f6c3
Remove more dead arguments (#333) 2020-12-04 11:31:37 +00:00
Javier a2c27e19d7
Remove blobxfer (#330)
* Remove blobxfer

* Update CHANGELOG.md

* Remove configs that are not required

* Remove from environment.yml

* Fix numba issue

* Improve CHANGELOG.md

* Fix tests

* Remove configs that are not required
2020-12-03 10:44:05 +00:00
Anton Schwaighofer 014c74e34f
Fix ensemble checkpoint download (#326)
* test

* fix test

* download fix

* create separate model folder

* fixing tests

* making HD check better

* Tests

* inverted logic

* registering on parent run

* docu
2020-11-25 16:25:02 +01:00
Anton Schwaighofer 94f675d3ab
Fix for HelloWorld example on fresh checkout (#320) 2020-11-17 08:42:42 +01:00
Anton Schwaighofer 1e86bfd008
Ensure that PR builds fail on any job errors, fix component governance (#317)
- The "TrainViaSubmodule" step presently only fails if the last python call fails. Fix that.
- Component Governance was accidentally disabled in #290
2020-11-16 16:27:10 +00:00
Anton Schwaighofer cd4458e15c
Ensure that models are registered with consistent file structure (#276)
- Make file structure consistent across normal training and training when InnerEye is a submodule
- Add test coverage for the file structure of registered models
- Add documentation around how the model structure looks like
- If multiple Conda files are used in an InnerEye run, they are merged into one environment file for deployment. The complicated merge inside of `run_scoring` could be deprecated in principle, but leaving it there if we need for legacy models.
- Add test coverage for `submit_for_inference`: Previous test was using a hardcoded legacy model, meaning that any changes to model structure could have broken the script
- The test for `submit_for_inference` is no longer submitted from the big AzureML run, shortening the runtime of that part of the PR build. Instead, it is triggered after the `TrainViaSubmodule` part of the build. The corresponding AzureML experiment is no longer `model_inference`, but the same experiment as all other AzureML runs.
- The test for `submit_for_inference` was previously running on the expensive `training-nd24` cluster, now on the cheaper `nc12`.
- `submit_for_inference` now correctly uses the `score.py` file that is inside of the model, rather than copying it from the repository root.
2020-11-16 10:05:32 +00:00
Shruthi42 22da1928b8
Load model weights from URL or local checkpoint (#282)
- Adds a parameter `weights_url` to DeepLearningConfig to download model weights from a URL.
- Adds a parameter `local_weights_path` to DeepLearningConfig to initialize model weights from a local checkpoint. This can also be used to perform inference on a checkpoint from a local training run.
- Refactors all checkpoint logic, including recovering from run_recovery into a class CheckpointHandler
- Adds a parameter `epochs_to_test` to DeepLearningConfig which can be used to specify a list of epochs to test in a training/inference run.
-  Deprecates DeepLearningConfig parameters `test_diff_epochs`, `test_step_epochs` and `test_start_epoch`.

Closes #178 
Closes #297
2020-11-03 20:17:35 +05:30
Anton Schwaighofer e7a88877c5
Switch more code to using Path (#305)
- Rename the `TestOutputDirectories` class because it is picked up by pytest as something it expects to contain tests
- Switch fields to using `Path`, rather than `str`
2020-11-02 19:49:13 +00:00
Anton Schwaighofer d4b9720c81
Upgrade AzureML SDK, check framework versions (#304)
Ensure that AzureML SDK is recent enough to recognize our PyTorch version.
2020-10-30 15:13:59 +00:00
Anton Schwaighofer 163ea0ab53
Make AzureML datastore name configurable (#300) 2020-10-27 14:32:51 +00:00
Anton Schwaighofer 9686f12728
Improve GPU resource monitoring (#296)
- Compute aggregate metrics over the whole training run
- Get allocated and reserved memory
- Store aggregate metrics in AzureML
Note, diagnostic metrics are no longer stored in AzureML. Tensorboard is better for vast amounts of metrics.
2020-10-26 16:44:08 +00:00
Anton Schwaighofer 52f5c77f81
Adding patch sampling diagnostics by default (#290)
Always create thumbnails that show patch sampling behaviour
2020-10-23 11:42:52 +01:00
Shruthi42 dd1e452509
Remove hardcoded list of secrets in secrets_handling.py (#279)
In secrets_handling.py, remove the hardcoded list SECRETS_IN_ENVIRONMENT and make the functions take this as a parameter.
2020-10-13 15:18:58 +05:30
Anton Schwaighofer 7a98d4d62d
Add user alias for notifications, add max run duration (#271)
- Supply user alias and/or email address, so that notifications can be sent
- Add argument for maximum run duration, limit PR build to 1h
2020-10-06 13:50:16 +01:00
Anton Schwaighofer d2f5327c79
Allow private settings files (#245)
Add the capability to not check in the complete `settings.yml` file, and fill in the missing ones via the file `InnerEyePrivateSettings.yml` file in the repository root.
2020-09-25 19:38:01 +01:00
Anton Schwaighofer 3e8b92d0f1
Shorten the most frequent commandline options, rename settings file (#232)
Rename commandline options: --submit_to_azureml -> --azureml, --is_train -> --train, --gpu_cluster_name -> --cluster
Rename train_variables.yml -> settings.yml
2020-09-21 17:40:05 +01:00
Anton Schwaighofer ad77f95d24
Simplify azure_config (#227)
- Remove the use of keyvaults for credentials storage, because the storage account keys are not longer needed in AzureML
- Remove dead config elements from azure_config
- Download datasets via AzureML FileDataset, if credentials for dataset storage account are not present.
- Add an option to specify the experiment name on the command line
2020-09-21 14:31:40 +01:00
Jay Nanavati fe04c14233
Add automatic run monitoring + refactoring + docs (#228)
Automatically monitor new runs if --monitor=True is passed in
Refactor monitor.py
Add documentation on run monitoring
2020-09-21 08:17:00 +01:00
Anton Schwaighofer 92d9d8211e
Automate deployment of Azure resources (#221)
This PR adds an Azure Resource Manager template to create the AzureML workspace and a compute cluster. The documentation has been updated to reflect that.
Also, the "location" argument in azure_config has been retired, because we assume that the workspace is already created.
2020-09-16 20:24:24 +01:00
Jay Nanavati 014433447f
Add ability to perform sub-fold cross validation for scalar models (#190)
Add ability to perform sub-fold cross validation for scalar models
2020-09-16 10:56:24 +01:00
Jay Nanavati 00050e7ec5
Use AML Run API to download run artefacts instead of blobxfer (#219)
This change migrates the currently blobxfer usage to download AML run output blobs using the AML Run API.
2020-09-16 09:33:56 +01:00
Anton Schwaighofer 95085f2d5e
Read git-related information via gitpython (#193)
Most git-related information is presently expected in commandline arguments, populated in pipelines. Change that to read via gitpython, so that also in runs from user's local machines the branch info is correct. This fixes #151
Improve documentation around queuing training runs and run recovery. Write run recovery ID to a file for later use in pipelines.
2020-09-04 14:16:03 +01:00
David Carter a36d2668b3
Update most dependencies (#181)
This PR upgrades all packages to their newest versions, except for pytorch (left at 1.3) and torchvision whose version needs to match the pytorch version. The pytorch version may stay at 1.3 until 1.7 is released, as 1.4, 1.5 and 1.6 all present hard-to-fix problems.

* azureml-sdk upgraded from 1.9 to 1.12.
* Removed from environment.yml altogether, as they're either not needed or are pulled in by other packages: jupyter, mock, options.
* Estimator(...) replaced by PyTorch(...) when creating the estimator, as in future we may need to be able to set framework_version.
* Format strings (f"...") without any {variables} inside them replaced by ordinary strings, to keep new version of flake happy.
* Single-letter variable names replace by multi-letter, for same reason.
* Remove "reorder" workaround for the python-version bug in merging conda dependencies, which is fixed in azuremk-sdk 1.12.
* Remove HotFixedTensorBoard as test pass without it.
* Add comments to lr_scheduler.py to specify changes required for get_[last_]lr when we upgrade pytorch beyond 1.3.
* A few other minor things required by new versions of packages.
2020-08-28 15:04:22 +05:30
David Carter 9662b09370
Rationalize directory structure in plotting and ensemble building (#169)
Reorganize outputs from train and test runs. For details of the new structure, see the changes to building_models.md, which should be read first.

Also:

Pass is_ensemble as an argument to lots of places from the ultimate calling context, so we can be sure when we're running an ensemble model

Add a context manager logging_section, to support named sections in the log to aid in understanding them.
Remove baseline-comparison results after downloading and doing the comparison.

For each training and validation epoch, log the total time taken and the portion of that taken for data loading.

Use shortened split names in box plots so X axis labels can be read (there will be more to do in future PRs to improve plot legibility).

Add a switch --create_plots, default True, to plot_cross_validation.py so that by setting it to False, the script can be run (to generate outliers and statistical test results) in environments where graphics is not available.
2020-08-18 08:59:10 +01:00
melanibe b23a1d14cf
Change default AML experiment name to avoid reaching AML system limit (#167)
Change the default AML experiment name to username_local_branch_YYYYMM to avoid having too many runs in a single AML experiment.
2020-08-12 14:37:37 +01:00