InnerEye-DeepLearning

Граф коммитов

Автор	SHA1	Сообщение	Дата
Melissa Bristow	8cf63c8e7a	BUG: Dont update multi-node env vars for single node training (#796 ) Update environment variables for multi-node jobs	2022-09-02 16:22:14 +00:00
Peter Hessey	59214c268e	DOC: Add all `InnerEye/ML` docstrings to ReadTheDocs (#783 ) * 📝 Create basic for ML API * 📝 Add ML/configs base doc files * 📝 Finish ML/configs API * 📝 Update augmentations * 📝 Add ML/dataset API docs * 📝 Add rst skeleton for ML/models * 📝 Fix docstring missing newlines * Remove script * 📝 Finish ML/models API docs * 📝 Start ML/SSL API. Fix some formatting issues * 📝 Correct whitespace issues in `:param` * 📝 Fix whitespace errors on `:return` statements * 📝 Fix :return: statements * 📝 Finish ML/SSL API * 📝 Add ML/utils API docs * 📝 Add visualizer docs, fix `:raise` indents * 📝 Fix more issues with the `:raises:` formatting * ♻️ Restructuring folders * 📝 Limit API `toctree` depth * 📝 Add primary InnerEye/ML files API to docs * 📝 Fix and add `InnerEye/ML/.py` docs ⚰️ Remove weird `settings.json` change * ♻️ 💡 Address review comments	2022-08-16 08:58:38 +00:00
Peter Hessey	92d94799f2	ENH: Add Environment Locking (#735 ) * ✨ Add environment locking script * 📝 ✨ Finish script, add documentation * 🐛 Change Windows env file in workflow * 📝 🐛 Add review changes + fixes * 🚧 Temporarily alter tests and conda channels * 🚧 Add logging * ✅ Fix TestSubmodule env file * 🔥 Delete env test * 🧑‍💻 Add warning to environment.yml * 📝 ⚰️ Update based on review comments * 📝 Add final changes	2022-06-01 10:05:54 +00:00
Javier	1606729c7a	Clean up legacy code (#671 ) * Remove rnns * Fix flake8 * Edit README * Edit README * Remove sequence * Remove sequence * Fix all * Remove more * Remove ignore * Fix tests * Undo config * Fix config * Revert pycharm * Fix tests * Undo outputlogger * Fix flake8 * Fix ignore file * Revert hi-ml * Disable fail on alert	2022-03-09 14:53:12 +00:00
Anton Schwaighofer	ccb53d01ad	Improve recovery of preempted jobs (#633 ) Autosaving checkpoints by default every 1 epoch to a fixed file name. Retiring the "top k" recovery checkpoint notion because that was tied to specific models that needed more than 1 checkpoint.	2022-01-17 12:05:39 +00:00
vale-salvatelli	4aa84b9f36	Vsalva/deepmil panda (#619 ) * adding deepmilpanda container	2021-12-14 19:45:07 +00:00
Anton Schwaighofer	cd0c685eef	Bug fix: When using local folders, datasets are downloaded nevertheless (#604 )	2021-12-07 17:45:14 +00:00
Anton Schwaighofer	cf3145f029	Fix for invalid path_on_compute problem (#593 ) Empty string as mount_point is turned into "." by hi-ml 0.1.11, which fails the jobs	2021-11-18 08:49:55 +00:00
Anton Schwaighofer	3075b447aa	Adjust to new namespaces (#572 ) Adjusting for new version hi-ml-azure==0.1.9	2021-10-20 19:40:20 +01:00
Anton Schwaighofer	b35399fb84	Moving InnerEye's Azure code to hi-ml package (#548 ) Moving InnerEye to use the new HI-ML package. See Issue 62 on the HI-ML package	2021-08-26 09:17:09 +01:00
vale-salvatelli	c946c689aa	Environment and hello_world_model documentation updated (#546 ) * Some additions and fix to the docs describing how to setup the environment. The new version should be a little more clear. * A small change in the logging info when submitting azureml - the workspace name is now reporte	2021-07-27 09:28:17 +01:00
Anton Schwaighofer	53999877d0	Enable a disabled test (#536 ) Enabling required changing several stored result files because of the PL upgrade that happened in between.	2021-07-15 20:54:31 +01:00
Shruthi42	9fcc08f6cd	Run inference using checkpoints from registered models (#509 )	2021-07-15 14:31:15 +00:00
Anton Schwaighofer	ae82089d79	Fixes for mounting and matplotlib problems (#515 )	2021-07-15 11:41:51 +01:00
Jonathan Tripp	cab68ccc61	Split validation and test infer config (#502 ) Split validation, test, ensemble inference flags	2021-07-05 16:25:49 +01:00
Anton Schwaighofer	7cd7e58899	Fix timeouts when downloading multiple checkpoint files (#498 ) Downloading multiple checkpoints uses a codepath that has a fixed 120sec timeout. Instead, use multiple individual download operations.	2021-06-22 13:33:33 +00:00
Anton Schwaighofer	9749954923	Fix for stuck test set inference for LightningContainer models (#494 ) Jobs got stuck if they used PyTorch Lightning built-in metrics objects, which were then trying to synchronize across GPUs. Resolved by shutting down torch.distributed	2021-06-17 17:34:26 +01:00
Anton Schwaighofer	8bae42eb92	FastMRI dataset onboarding script and detailed examples (#444 ) Add necessary tooling and examples for running fastMRI reconstruction models. - Script to create and run an Azure Data Factory to download the raw data, and place them into a storage account - Detailed examples to run the VarNet model from the fastMRI github repo - Ability to work with fixed mounting points for datasets	2021-05-19 15:58:25 +00:00
melanibe	7b5b414d02	Add self-supervised learning capabilities to InnerEye (#440 )	2021-05-07 13:27:38 +00:00
Anton Schwaighofer	0d479ba3d8	Enable Bring-your-own-Lightning-model (#417 ) - Enable brining arbitrary PyTorch-Lightning models to the InnerEye toolbox - Upgrade mypy and simplify the way we invoke it	2021-04-19 15:28:41 +00:00
Anton Schwaighofer	88188c29ad	Upgrade to Pytorch 1.8 (#411 ) Updated also `pytorch-lightning` to 1.1.8, and the AzureML libraries to 1.23	2021-03-15 13:33:25 +00:00
Anton Schwaighofer	b40d6d13c0	Enable multi-node training (#385 ) - Enable training on multiple machines in AzureML - Exclude private settings files from AzureML snapshot	2021-02-08 14:21:50 +00:00
Anton Schwaighofer	bc90c65f0a	Migrate to Pytorch Lightning (#323 ) This PR swaps out all of the previously hand-written training routines, and switches them to PyTorch Lightning.	2021-01-28 15:25:53 +00:00
Shruthi42	be9323adbe	Remove sub fold cross validation (#357 ) Removes the ability to perform sub-fold cross validation. Removes parameters `number_of_cross_validation_splits_per_fold` and `cross_validation_sub_fold_split_index` from ScalarModelBase.	2021-01-12 23:04:44 +05:30
Jonathan Tripp	c54a7281f9	Check more locations for dataset file and fail if comparison files mi… (#348 ) Check more locations for dataset file and fail if model comparison requested but comparison files missing.	2021-01-07 11:02:55 +00:00
Anton Schwaighofer	1a7b64f6c3	Remove more dead arguments (#333 )	2020-12-04 11:31:37 +00:00
Javier	a2c27e19d7	Remove blobxfer (#330 ) * Remove blobxfer * Update CHANGELOG.md * Remove configs that are not required * Remove from environment.yml * Fix numba issue * Improve CHANGELOG.md * Fix tests * Remove configs that are not required	2020-12-03 10:44:05 +00:00
Anton Schwaighofer	014c74e34f	Fix ensemble checkpoint download (#326 ) * test * fix test * download fix * create separate model folder * fixing tests * making HD check better * Tests * inverted logic * registering on parent run * docu	2020-11-25 16:25:02 +01:00
Anton Schwaighofer	94f675d3ab	Fix for HelloWorld example on fresh checkout (#320 )	2020-11-17 08:42:42 +01:00
Anton Schwaighofer	1e86bfd008	Ensure that PR builds fail on any job errors, fix component governance (#317 ) - The "TrainViaSubmodule" step presently only fails if the last python call fails. Fix that. - Component Governance was accidentally disabled in #290	2020-11-16 16:27:10 +00:00
Anton Schwaighofer	cd4458e15c	Ensure that models are registered with consistent file structure (#276 ) - Make file structure consistent across normal training and training when InnerEye is a submodule - Add test coverage for the file structure of registered models - Add documentation around how the model structure looks like - If multiple Conda files are used in an InnerEye run, they are merged into one environment file for deployment. The complicated merge inside of `run_scoring` could be deprecated in principle, but leaving it there if we need for legacy models. - Add test coverage for `submit_for_inference`: Previous test was using a hardcoded legacy model, meaning that any changes to model structure could have broken the script - The test for `submit_for_inference` is no longer submitted from the big AzureML run, shortening the runtime of that part of the PR build. Instead, it is triggered after the `TrainViaSubmodule` part of the build. The corresponding AzureML experiment is no longer `model_inference`, but the same experiment as all other AzureML runs. - The test for `submit_for_inference` was previously running on the expensive `training-nd24` cluster, now on the cheaper `nc12`. - `submit_for_inference` now correctly uses the `score.py` file that is inside of the model, rather than copying it from the repository root.	2020-11-16 10:05:32 +00:00
Shruthi42	22da1928b8	Load model weights from URL or local checkpoint (#282 ) - Adds a parameter `weights_url` to DeepLearningConfig to download model weights from a URL. - Adds a parameter `local_weights_path` to DeepLearningConfig to initialize model weights from a local checkpoint. This can also be used to perform inference on a checkpoint from a local training run. - Refactors all checkpoint logic, including recovering from run_recovery into a class CheckpointHandler - Adds a parameter `epochs_to_test` to DeepLearningConfig which can be used to specify a list of epochs to test in a training/inference run. - Deprecates DeepLearningConfig parameters `test_diff_epochs`, `test_step_epochs` and `test_start_epoch`. Closes #178 Closes #297	2020-11-03 20:17:35 +05:30
Anton Schwaighofer	e7a88877c5	Switch more code to using Path (#305 ) - Rename the `TestOutputDirectories` class because it is picked up by pytest as something it expects to contain tests - Switch fields to using `Path`, rather than `str`	2020-11-02 19:49:13 +00:00
Anton Schwaighofer	d4b9720c81	Upgrade AzureML SDK, check framework versions (#304 ) Ensure that AzureML SDK is recent enough to recognize our PyTorch version.	2020-10-30 15:13:59 +00:00
Anton Schwaighofer	163ea0ab53	Make AzureML datastore name configurable (#300 )	2020-10-27 14:32:51 +00:00
Anton Schwaighofer	9686f12728	Improve GPU resource monitoring (#296 ) - Compute aggregate metrics over the whole training run - Get allocated and reserved memory - Store aggregate metrics in AzureML Note, diagnostic metrics are no longer stored in AzureML. Tensorboard is better for vast amounts of metrics.	2020-10-26 16:44:08 +00:00
Anton Schwaighofer	52f5c77f81	Adding patch sampling diagnostics by default (#290 ) Always create thumbnails that show patch sampling behaviour	2020-10-23 11:42:52 +01:00
Shruthi42	dd1e452509	Remove hardcoded list of secrets in secrets_handling.py (#279 ) In secrets_handling.py, remove the hardcoded list SECRETS_IN_ENVIRONMENT and make the functions take this as a parameter.	2020-10-13 15:18:58 +05:30
Anton Schwaighofer	7a98d4d62d	Add user alias for notifications, add max run duration (#271 ) - Supply user alias and/or email address, so that notifications can be sent - Add argument for maximum run duration, limit PR build to 1h	2020-10-06 13:50:16 +01:00
Anton Schwaighofer	d2f5327c79	Allow private settings files (#245 ) Add the capability to not check in the complete `settings.yml` file, and fill in the missing ones via the file `InnerEyePrivateSettings.yml` file in the repository root.	2020-09-25 19:38:01 +01:00
Anton Schwaighofer	3e8b92d0f1	Shorten the most frequent commandline options, rename settings file (#232 ) Rename commandline options: --submit_to_azureml -> --azureml, --is_train -> --train, --gpu_cluster_name -> --cluster Rename train_variables.yml -> settings.yml	2020-09-21 17:40:05 +01:00
Anton Schwaighofer	ad77f95d24	Simplify azure_config (#227 ) - Remove the use of keyvaults for credentials storage, because the storage account keys are not longer needed in AzureML - Remove dead config elements from azure_config - Download datasets via AzureML FileDataset, if credentials for dataset storage account are not present. - Add an option to specify the experiment name on the command line	2020-09-21 14:31:40 +01:00
Jay Nanavati	fe04c14233	Add automatic run monitoring + refactoring + docs (#228 ) Automatically monitor new runs if --monitor=True is passed in Refactor monitor.py Add documentation on run monitoring	2020-09-21 08:17:00 +01:00
Anton Schwaighofer	92d9d8211e	Automate deployment of Azure resources (#221 ) This PR adds an Azure Resource Manager template to create the AzureML workspace and a compute cluster. The documentation has been updated to reflect that. Also, the "location" argument in azure_config has been retired, because we assume that the workspace is already created.	2020-09-16 20:24:24 +01:00
Jay Nanavati	014433447f	Add ability to perform sub-fold cross validation for scalar models (#190 ) Add ability to perform sub-fold cross validation for scalar models	2020-09-16 10:56:24 +01:00
Jay Nanavati	00050e7ec5	Use AML Run API to download run artefacts instead of blobxfer (#219 ) This change migrates the currently blobxfer usage to download AML run output blobs using the AML Run API.	2020-09-16 09:33:56 +01:00
Anton Schwaighofer	95085f2d5e	Read git-related information via gitpython (#193 ) Most git-related information is presently expected in commandline arguments, populated in pipelines. Change that to read via gitpython, so that also in runs from user's local machines the branch info is correct. This fixes #151 Improve documentation around queuing training runs and run recovery. Write run recovery ID to a file for later use in pipelines.	2020-09-04 14:16:03 +01:00
David Carter	a36d2668b3	Update most dependencies (#181 ) This PR upgrades all packages to their newest versions, except for pytorch (left at 1.3) and torchvision whose version needs to match the pytorch version. The pytorch version may stay at 1.3 until 1.7 is released, as 1.4, 1.5 and 1.6 all present hard-to-fix problems. * azureml-sdk upgraded from 1.9 to 1.12. * Removed from environment.yml altogether, as they're either not needed or are pulled in by other packages: jupyter, mock, options. * Estimator(...) replaced by PyTorch(...) when creating the estimator, as in future we may need to be able to set framework_version. * Format strings (f"...") without any {variables} inside them replaced by ordinary strings, to keep new version of flake happy. * Single-letter variable names replace by multi-letter, for same reason. * Remove "reorder" workaround for the python-version bug in merging conda dependencies, which is fixed in azuremk-sdk 1.12. * Remove HotFixedTensorBoard as test pass without it. * Add comments to lr_scheduler.py to specify changes required for get_[last_]lr when we upgrade pytorch beyond 1.3. * A few other minor things required by new versions of packages.	2020-08-28 15:04:22 +05:30
David Carter	9662b09370	Rationalize directory structure in plotting and ensemble building (#169 ) Reorganize outputs from train and test runs. For details of the new structure, see the changes to building_models.md, which should be read first. Also: Pass is_ensemble as an argument to lots of places from the ultimate calling context, so we can be sure when we're running an ensemble model Add a context manager logging_section, to support named sections in the log to aid in understanding them. Remove baseline-comparison results after downloading and doing the comparison. For each training and validation epoch, log the total time taken and the portion of that taken for data loading. Use shortened split names in box plots so X axis labels can be read (there will be more to do in future PRs to improve plot legibility). Add a switch --create_plots, default True, to plot_cross_validation.py so that by setting it to False, the script can be run (to generate outliers and statistical test results) in environments where graphics is not available.	2020-08-18 08:59:10 +01:00
melanibe	b23a1d14cf	Change default AML experiment name to avoid reaching AML system limit (#167 ) Change the default AML experiment name to username_local_branch_YYYYMM to avoid having too many runs in a single AML experiment.	2020-08-12 14:37:37 +01:00

1 2

55 Коммитов