Autosaving checkpoints by default every 1 epoch to a fixed file name. Retiring the "top k" recovery checkpoint notion because that was tied to specific models that needed more than 1 checkpoint.
* Some additions and fix to the docs describing how to setup the environment. The new version should be a little more clear.
* A small change in the logging info when submitting azureml - the workspace name is now reporte
Jobs got stuck if they used PyTorch Lightning built-in metrics objects, which were then trying to synchronize across GPUs. Resolved by shutting down torch.distributed
Add necessary tooling and examples for running fastMRI reconstruction models.
- Script to create and run an Azure Data Factory to download the raw data, and place them into a storage account
- Detailed examples to run the VarNet model from the fastMRI github repo
- Ability to work with fixed mounting points for datasets
Removes the ability to perform sub-fold cross validation. Removes parameters `number_of_cross_validation_splits_per_fold`
and `cross_validation_sub_fold_split_index` from ScalarModelBase.
* Remove blobxfer
* Update CHANGELOG.md
* Remove configs that are not required
* Remove from environment.yml
* Fix numba issue
* Improve CHANGELOG.md
* Fix tests
* Remove configs that are not required
* test
* fix test
* download fix
* create separate model folder
* fixing tests
* making HD check better
* Tests
* inverted logic
* registering on parent run
* docu
- Make file structure consistent across normal training and training when InnerEye is a submodule
- Add test coverage for the file structure of registered models
- Add documentation around how the model structure looks like
- If multiple Conda files are used in an InnerEye run, they are merged into one environment file for deployment. The complicated merge inside of `run_scoring` could be deprecated in principle, but leaving it there if we need for legacy models.
- Add test coverage for `submit_for_inference`: Previous test was using a hardcoded legacy model, meaning that any changes to model structure could have broken the script
- The test for `submit_for_inference` is no longer submitted from the big AzureML run, shortening the runtime of that part of the PR build. Instead, it is triggered after the `TrainViaSubmodule` part of the build. The corresponding AzureML experiment is no longer `model_inference`, but the same experiment as all other AzureML runs.
- The test for `submit_for_inference` was previously running on the expensive `training-nd24` cluster, now on the cheaper `nc12`.
- `submit_for_inference` now correctly uses the `score.py` file that is inside of the model, rather than copying it from the repository root.
- Adds a parameter `weights_url` to DeepLearningConfig to download model weights from a URL.
- Adds a parameter `local_weights_path` to DeepLearningConfig to initialize model weights from a local checkpoint. This can also be used to perform inference on a checkpoint from a local training run.
- Refactors all checkpoint logic, including recovering from run_recovery into a class CheckpointHandler
- Adds a parameter `epochs_to_test` to DeepLearningConfig which can be used to specify a list of epochs to test in a training/inference run.
- Deprecates DeepLearningConfig parameters `test_diff_epochs`, `test_step_epochs` and `test_start_epoch`.
Closes#178Closes#297
- Rename the `TestOutputDirectories` class because it is picked up by pytest as something it expects to contain tests
- Switch fields to using `Path`, rather than `str`
- Compute aggregate metrics over the whole training run
- Get allocated and reserved memory
- Store aggregate metrics in AzureML
Note, diagnostic metrics are no longer stored in AzureML. Tensorboard is better for vast amounts of metrics.
Add the capability to not check in the complete `settings.yml` file, and fill in the missing ones via the file `InnerEyePrivateSettings.yml` file in the repository root.
- Remove the use of keyvaults for credentials storage, because the storage account keys are not longer needed in AzureML
- Remove dead config elements from azure_config
- Download datasets via AzureML FileDataset, if credentials for dataset storage account are not present.
- Add an option to specify the experiment name on the command line
This PR adds an Azure Resource Manager template to create the AzureML workspace and a compute cluster. The documentation has been updated to reflect that.
Also, the "location" argument in azure_config has been retired, because we assume that the workspace is already created.
Most git-related information is presently expected in commandline arguments, populated in pipelines. Change that to read via gitpython, so that also in runs from user's local machines the branch info is correct. This fixes#151
Improve documentation around queuing training runs and run recovery. Write run recovery ID to a file for later use in pipelines.
This PR upgrades all packages to their newest versions, except for pytorch (left at 1.3) and torchvision whose version needs to match the pytorch version. The pytorch version may stay at 1.3 until 1.7 is released, as 1.4, 1.5 and 1.6 all present hard-to-fix problems.
* azureml-sdk upgraded from 1.9 to 1.12.
* Removed from environment.yml altogether, as they're either not needed or are pulled in by other packages: jupyter, mock, options.
* Estimator(...) replaced by PyTorch(...) when creating the estimator, as in future we may need to be able to set framework_version.
* Format strings (f"...") without any {variables} inside them replaced by ordinary strings, to keep new version of flake happy.
* Single-letter variable names replace by multi-letter, for same reason.
* Remove "reorder" workaround for the python-version bug in merging conda dependencies, which is fixed in azuremk-sdk 1.12.
* Remove HotFixedTensorBoard as test pass without it.
* Add comments to lr_scheduler.py to specify changes required for get_[last_]lr when we upgrade pytorch beyond 1.3.
* A few other minor things required by new versions of packages.
Reorganize outputs from train and test runs. For details of the new structure, see the changes to building_models.md, which should be read first.
Also:
Pass is_ensemble as an argument to lots of places from the ultimate calling context, so we can be sure when we're running an ensemble model
Add a context manager logging_section, to support named sections in the log to aid in understanding them.
Remove baseline-comparison results after downloading and doing the comparison.
For each training and validation epoch, log the total time taken and the portion of that taken for data loading.
Use shortened split names in box plots so X axis labels can be read (there will be more to do in future PRs to improve plot legibility).
Add a switch --create_plots, default True, to plot_cross_validation.py so that by setting it to False, the script can be run (to generate outliers and statistical test results) in environments where graphics is not available.