34 KiB
34 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
For each Pull Request, the affected code parts should be briefly described and added here in the "Upcoming" section. Once a release is done, the "Upcoming" section becomes the release changelog, and a new empty "Upcoming" should be created.
Upcoming
Added
- (#689) Show default argument values in help message.
- (#671) Remove sequence models and unused variables. Simplify README.
- (#693) Improve instructions for HelloWorld model in AzureML.
- (#678) Add function to get log level name and use it for logging.
- (#666) Replace RadIO with TorchIO for patch-based inference.
- (#643) Test for recovery of SSL job. Tracks learning rate and train loss.
- (#594) When supplying a "--tag" argument, the AzureML jobs use that value as the display name, to more easily distinguish run.
- (#640) Cancel AzureML jobs from previous runs of the PR build in the same branch to reduce AML load
- (#577) Commandline switch
monitor_gpu
to monitor GPU utilization via Lightning'sGpuStatsMonitor
, switchmonitor_loading
to check batch loading times viaBatchTimeCallback
, andpl_profiler
to turn on the Lightning profiler (simple
,advanced
, orpytorch
) - (#544) Add documentation for segmentation model evaluation.
- (#637) Add option to encode in chunks and to load pre-cached dataset in CPU or GPU in the histo pipeline.
- (#465) Adding ability to run segmentation inference module on test data with partial ground truth files. (Also 522.)
- (#502) More flags for fine control of when to run inference.
- (#492) Adding capability for regression tests for test jobs that run in AzureML.
- (#509) Run inference on registered models (single and
ensemble) using the parameter
model_id
. - (#554) Added a parameter
pretraining_dataset_id
toNIH_COVID_BYOL
to specify the name of the SSL training dataset. - (#560) Added pre-commit hooks.
- (#619) Add DeepMIL PANDA
- (#559) Adding the accompanying code for the "Active label cleaning: Improving dataset quality under resource constraints" paper.
- (#589) Add
LightningContainer.update_azure_config()
hook to enable overridingAzureConfig
parameters from a container (e.g.experiment_name
,cluster
,num_nodes
). - (#617) Commandline flag
pl_check_val_every_n_epoch
to control how often validation is happening - (#618) Using Azure Pipeline Cache to avoid re-building conda environnment repeatedly
- (#603) Add histopathology module
- (#614) Checkpoint downloading falls back to looking into AzureML if no checkpoints on disk
- (#613) Add additional tests for histopathology datasets
- (#616) Add more histopathology configs and tests
- (#621) Add WSI preprocessing functions and enable tiling more generic slide datasets
- (#634) Add WSI heatmaps and thumbnails to standard test outputs
- (#635) Add tile selection and binary label for online evaluation of PANDA SSL
- (#647) Add class-wise accuracy logging and confusion matrix to DeepMIL
- (#653) Add dropout to DeepMIL and fix feature extractor setup.
- (#650) Enable fine-tuning in DeepMIL using PANDA as the classification task.
- (#656) Add subsampling transform and support for MIL mean pooling.
- (#679) Add FP and TN slides/tiles to DeepMIL outputs and extend outputs to multi-class problems.
Changed
- (#677) Update TorchIO version to include the recent bug fix related to patch-based inference.
- (#666) Replace RadIO with TorchIO for patch-based inference.
- (#659) Update cudatoolkit version from 11.1 to 11.3.
- (#588) Replace SciPy with PIL.PngImagePlugin.PngImageFile to load png files.
- (#585) Switching to PyTorch 1.10.0 and torchvision 0.11.1
- (#576) The console output is no longer written to stdout.txt because AzureML handles that better now
- (#531) Updated PL to 1.3.8, torchmetrics and pl-bolts and changed relevant metrics and SSL code API.
- (#555) Make the SSLContainer compatible with new datasets
- (#533) Better defaults for inference on ensemble children.
- (#536) Inference will not run on the validation set by default, this can be turned on
via the
--inference_on_val_set
flag. - (#548) Many Azure-related functions have been moved out of the toolbox, into the separate hi-ml Python package.
- (#502) Renamed command line option 'perform_training_set_inference' to 'inference_on_train_set'. Replaced command line option 'perform_validation_and_test_set_inference' with the pair of options 'inference_on_val_set' and 'inference_on_test_set'.
- (#496) All plots are now saved as PNG, rather than JPG.
- (#497) Reducing the size of the code snapshot that gets uploaded to AzureML, by skipping all test folders.
- (#509) Parameter
extra_downloaded_run_id
has been renamed topretraining_run_checkpoints
. - (#526) Updated Covid config to use a multiclass
formulation. Moved functions
create_metric_computers
andcompute_and_log_metrics
fromScalarLightning
toScalarModelBase
. - (#554) Updated report in CovidModel. Set parameters in the config to run inference on both the validation and test sets by default.
- (#584) SSL models write the optimizer state for the linear head to the checkpoint now.
- (#594) Pytorch is now non-deterministic by default. Upgrade to AzureML-SDK 1.36
- (#566) Update
hi-ml
dependency tohi-ml-azure
. - (#591) Upgrade Pytorch Lightning to 1.5.0
- (#572) Updated to new version of hi-ml package
- (#623) Save checkpoints in SSLOnlineEvaluator without DDP wrapper code
- (#617) Provide an easier way for LightningContainers to add callbacks.
- (#596) Add
cudatoolkit=11.1
specification to environment.yml. - (#615) Minor changes to checkpoint download from AzureML.
- (#605) Make build jobs deterministic for regression testing.
- (#633) Model training now only writes one recovery checkpoint, rather than multiple ones. Frequency is controlled by
autosave_every_n_val_epochs
. - (#632) Nifti test data is no longer stored in Git LFS
Fixed
- (#699) Fix Sphinx warnings.
- (#682) Ensure the shape of input patches is compatible with model constraints.
- (#681) Pad model outputs if they are smaller than the inputs.
- (#683) Fix missing separator error in docs Makefile.
- (#659) Fix caching and checkpointing for TCGA CRCk dataset.
- (#649) Fix for the _convert_to_tensor_if_necessary method so that PIL.Image as well as np.array get converted to torch.Tensor.
- (#606) Bug fix: registered models do not include the hi-ml submodule
- (#646) Workaround for bug in PL: CombinedLoader cannot be used for training data when using DDP
- (#593) Bug fix for hi-ml 0.1.11 issue (#130): empty mount point is turned into ".", which fails the AML job
- (#587) Bug fix for regression in AzureML's handling of environments: upgrade to hi-ml 0.1.11
- (#625) updates to PandaDeepMIL to enable the use of a SSL pre-trained checkpoint and updated commit to hi-ml
- (#537) Print warning if inference is disabled but comparison requested.
- (#567) fix pillow version.
- (#546) Environment and hello_world_model documentation updated
- (#525) Enable --store_dataset_sample
- (#495) Fix model comparison.
- (#547) The parameter pl_find_unused_parameters was no longer used to initialize the DDP Plugin.
- (#482) Check bool parameter is either true or false.
- (#475) Bug in AML SDK meant that we could not train any large models anymore because data loaders ran out of memory.
- (#472) Correct model path for moving ensemble models.
- (#494) Fix an issue where multi-node jobs for LightningContainer models can get stuck at test set inference.
- (#498) Workaround for the problem that downloading multiple large checkpoints can time out.
- (#515) Workaround for occasional issues with dataset mounting and running matplotblib on some machines. Re-instantiated a disabled test.
- (#509) Fix issue where model checkpoints were not loaded in inference-only runs when using lightning containers.
- (#553) Fix incomplete test data module setup in Lightning inference.
- (#557) Fix issue where learning rate was not set correctly in the SimCLR module
- (#622) Fix issue with multi-GPU jobs on a VM: each process tries to create a folder structure
- (#558) Fix issue with the CovidModel config where model weights from a finetuning run were incompatible with the model architecture created for non-finetuning runs.
- (#604) Fix issue where runs on a VM would download the dataset even when a local dataset is provided.
- (#628) SSL SimCLR using the wrong LR schedule when running on multiple nodes
- (#638) SimClr cosine LR scheduler was using wrong length information when using with long linear head datasets
- (#612) SSL online evaluator was not doing distributed training
- (#652) Run pytest build on Windows after Linux agent version upgrade
- (#655) Run pytest on Linux again, but with Ubuntu 20.04
- (#674) Fix DeepMIL metrics bug whereby hard labels were used instead of probabilities.
Removed
- (#692) Replace InnerEye-DataQuality with a link to commit,
- (#577) Removing the monitoring of batch loading time,
use the
BatchTimeCallback
fromhi-ml
instead - (#542) Removed Windows test leg from build pipeline.
- (#509) Parameters
local_weights_path
andweights_url
can no longer be used to initialize a training run, only inference runs. - (#526) Removed
get_posthoc_label_transform
in classScalarModelBase
. Instead, functionsget_loss_function
andcompute_and_log_metrics
inScalarModelBase
can be implemented to compute the loss and metrics in a task-specific manner. - (#554) Removed cryptography from list of invalid
packages in
test_invalid_python_packages
as it is already present as a dependency in our conda environment. - (#596) Removed obsolete
TrainGlaucomaCV
from PR build. - (#604) Removed all code that downloads datasets, this is now all handled by hi-ml
Deprecated
- (#633) Model fields
recovery_checkpoint_save_interval
andrecovery_checkpoints_save_last_k
have been retired. Recovery checkpoint handling is now controlled byautosave_every_n_val_epochs
.
0.3 (2021-06-01)
Added
- (#483) Allow cross validation with 'bring your own' Lightning models (without ensemble building).
- (#489) Remove portal query for outliers.
- (#488) Better handling of missing seriesId in segmentation cross validation reports.
- (#454) Checking that labels are mutually exclusive.
- (#447) Added a sanity check to ensure there are no missing channels, nor missing files. If missing channels in the csv file or filenames associated with channels are incorrect, pipeline exits with error report before running training or inference.
- (#446) Guarding
save_outlier
so that it works when institution id and series id columns are missing. - (#441) Add script to move models from one AzureML workspace to another:
python InnerEye/Scripts/move_model.py
- (#417) Added a generic way of adding PyTorch Lightning models to the toolbox. It is now possible to train almost any Lightning model with the InnerEye toolbox in AzureML, with only minimum code changes required. See the MD documentation for details.
- (#430) Update conversion to 1.0.1 InnerEye-DICOM-RT to add: manufacturer, SoftwareVersions, Interpreter and ROIInterpretedTypes.
- (#385) Add the ability to train a model on multiple
nodes in AzureML. Example: Add
--num_nodes=2
to the commandline arguments to train on 2 nodes. - (#366) and
(#407) add new parameters to the
score.py
script ofuse_dicom
andresult_zip_dicom_name
. Ifuse_dicom==True
then the input file should be a zip of a DICOM series. This will be unzipped and converted to Nifti format before processing. The result will then be converted to a DICOM-RT file, zipped and stored asresult_zip_dicom_name
. - (#416) Add a github action chat checks
if
CHANGELOG.md
has been modified. - (#412) Dataset files can now have arbitrary names, and
are no longer restricted to be called
dataset.csv
, via the config fielddataset_csv
. This allows to have a single set of image files in a folder, but multiple datasets derived from it. - (#391) Support for multilabel classification tasks.
Multilabel models can be trained by adding the parameter
class_names
to the config for classification models.class_names
should contain the name of each label class in the dataset, and the order of names should match the order of class label indices indataset.csv
.dataset.csv
supports multiple labels (indices corresponding toclass_names
) per subject in the label column. Multiple labels should be encoded as a string with labels separated by a|
, for example "0|2|4". Note that this PR does not add support for multiclass models, where the labels are mutually exclusive. - (#425) The number of layers in a Unet is no longer
fixed at 4, but can be set via the config field
num_downsampling_paths
. A lower number of layers may be useful for decreasing memory requirements, or for working with smaller images. (The minimum image size in any dimension when using a network of n layers is 2**n.) - (#426) Flake8, mypy, and testing the HelloWorld model is now happening in a Github action, no longer in Azure Pipelines.
- (#405) Cross-validation runs for classification models now also generate a report notebook summarising the metrics from the individual splits. Also includes minor formatting improvements for standard classification reports.
- (#438) Add links and small docs to InnerEye-Gateway and InnerEye-Inference
- (#439) Enable automatic job recovery from last recovery checkpoint in case of job pre-emption on AML. Give the possibility to the user to keep more than one recovery checkpoint.
- (#442) Enable defining custom scalar losses
(
ScalarLoss.CustomClassification
andCustomRegression
), prediction targets (ScalarModelBase.target_names
), and reporting (ModelConfigBase.generate_custom_report()
) in scalar configs, providing more flexibility for defining model configs with custom behaviour while leveraging the existing InnerEye workflows. - (#444) Added setup scripts and documentation to work with the FastMRI challenge datasets.
- (#444) Git-related information is now printed to the console for easier diagnostics.
- (#445) Adding test coverage for the
HelloContainer
model with multiple GPUs - (#450) Adds the metric "Accuracy at threshold 0.5" to the classification report (
classification_crossval_report.ipynb
). - (#451) Write a file
model_outputs.csv
with columnssubject
,prediction_target
,label
,model_output
andcross_validation_split_index
. This file is not written out for sequence models. - (#440) Added support for training of self-supervised models (BYOL and SimCLR) based on the bring-your-own-model framework. Providing examples configurations for training of SSL models on CIFAR10/100 datasets as well as for chest-x-ray datasets such as NIH CHest-Xray or RSNA Pneumonia Detection Challenge datasets. See SSL doc for more details.
- (#455) All models trained on AzureML are registered.
The codepath previously allowed only segmentation models (subclasses of
SegmentationModelBase
) to be registered. Models are registered after a training run or if theonly_register_model
flag is set. Models may be legacy InnerEye config-based models or may be defined using the LightningContainer class. Additionally, theTrainHelloWorldAndHelloContainer
job in the PR build has been split into two jobs,TrainHelloWorld
andTrainHelloContainer
. A pytest markerafter_training_hello_container
has been added to run tests after training is finished in theTrainHelloContainer
job. - (#456) Adding configs to train Covid detection models.
- (#463) Add arguments
dirs_recursive
anddirs_non_recursive
tomypy_runner.py
to let users specify a list of directories to run mypy on.
Changed
- (#385) Starting an AzureML run now uses the
ScriptRunConfig
object, rather than the deprecatedEstimator
object. - (#385) When registering a model, the name of the Python execution environment is added as a tag. This tag is read when running inference, and the execution environment is re-used.
- (#411) Upgraded to PyTorch 1.8.0, PyTorch-Lightning 1.1.8 and AzureML SDK 1.23.0
- (#432) Upgraded to PyTorch-Lightning 1.2.7. Add end-to-end test for classification cross-validation. WARNING: upgrade PL version causes hanging of multi-node training.
- (#437) Upgrade to PyTorch-Lightning 1.2.8.
- (#439) Recovery checkpoints are now
named
recovery_epoch=x.ckpt
instead ofrecovery.ckpt
orrecovery-v0.ckpt
. - (#451) Change the signature for function
generate_custom_report
inModelConfigBase
to take only the path to the reports folder and aModelProcessing
object. - (#444) The method
before_training_on_rank_zero
of theLightningContainer
class has been renamed tobefore_training_on_global_rank_zero
. The order in which the hooks are called has been changed. - (#458) Simplifying and generalizing the way we handle
data augmentations for classification models. The pipelining logic is now taken care of by a ImageTransformPipeline
class that takes as input a list of transforms to chain together. This pipeline takes of applying transforms on 3D or
2D images. The user can choose to apply the same transformation for all channels (RGB example) or whether to apply
different transformation for each channel (if each channel represents a different
modality / time point for example). The pipeline can now work directly with out-of-the box torchvision transform
(as long as they support [..., C, H, W] inputs). This allows to get rid of nearly all of our custom augmentations
functions. The conversion from pipeline of image transformation to ScalarItemAugmentation is now taken care of under
the hood, the user does not need to call this wrapper for each config class. In models derived from ScalarModelConfig
to change which augmentations are applied to the images inputs (resp. segmentations inputs), users can override
get_image_transform
(resp.get_segmentation_transform
). These two functions replace the oldget_image_sample_transforms
method. Seedocs/building_models.md
for more information on augmentations.
Fixed
- (#422) Documentation - clarified
setting_up_aml.md
datastore creation instructions and fixed small typos inhello_world_model.md
- (#432) Fixed cross-validation for classification models. Fixed multi-gpu metrics aggregation. Add end-to-end test for classification cross-validation. Add fix to bug in ddp setting when running multi-node with 1 gpu per node.
- (#435) If parameter
model
inAzureConfig
is not set, display an error message and terminate the run. - (#437) Fixed multi-node DDP bug in PL v1.2.8. Re-add end-to-end test for multi-node.
- (#445) Fixed a bug when running inference for container models on machines with >1 GPU
Removed
- (#439) Deprecated
start_epoch
config argument. - (#450) Delete unused
classification_report.ipynb
. - (#455) Removed the AzureRunner conda environment. The full InnerEye conda environment is needed to submit a training job to AzureML.
- (#458) Getting rid of all the unused code for RandAugment & Co. The user has now instead complete freedom to specify the set of augmentations to use.
- (#468) Removed the
KneeSinglecoil
example model
Deprecated
0.2 (2021-01-29)
Added
- (#323) There are new model configuration fields
(and hence, commandline options), in particular for controlling PyTorch Lightning (PL) training:
max_num_gpus
controls how many GPUs are used at most for training (default: all GPUs, value -1).pl_num_sanity_val_steps
controls the PL trainer flagnum_sanity_val_steps
pl_deterministic
controls the PL trainer flagsbenchmark
anddeterministic
generate_report
controls if a HTML report will be written (default: True)recovery_checkpoint_save_interval
determines how often a checkpoint for training recovery is saved.
- (#336) New extensions of
SegmentationModelBases
HeadAndNeckBase
andProstateBase
. Use these classes to build your own Head&Neck or Prostate models, by just providing a list of foreground classes. - (#363) Grouped dataset splits and k-fold
cross-validation. This allows, for example, training on datasets with multiple images per subject without leaking data
from the same subject across train/test/validation sets or cross-validation folds. To use this functionality, simply
provide the name of the CSV grouping column (
group_column
) when creating theDatasetSplits
object in your model config'sget_model_train_test_dataset_splits()
method. See theInnerEye.ML.utils.split_dataset.DatasetSplits
class for details.
Changed
- (#323) The codebase has undergone a massive
refactoring, to use PyTorch Lightning as the foundation for all training. As a consequence of that:
- Training is now using Distributed Data Parallel with synchronized
batchnorm
. The number of GPUs to use can be controlled by a new commandline argumentmax_num_gpus
. - Several classes, like
ModelTrainingSteps*
, have been removed completely. - The final model is now always the one that is written at the end of all training epochs.
- The old code that options to run full image inference at multiple epochs (i.e., multiple checkpoints), this has
been removed, alongside the respective commandline options
save_start_epoch
,save_step_epochs
,epochs_to_test
,test_diff_epochs
,test_step_epochs
,test_start_epoch
- The commandline option
register_model_only_for_epoch
is now calledonly_register_model
, and is boolean. - All metrics are written to AzureML and Tensorboard in a unified format. A training Dice score for 'bladder' would previously be called Train_Dice/bladder, now it is train/Dice/bladder.
- Due to a different checkpoint format, it is no longer possible to use checkpoints written by the previous version of the code.
- Training is now using Distributed Data Parallel with synchronized
- The arguments of the
score.py
script changed:data_root
->data_folder
, it no longer assumes a fixeddata
subfolder.project_root
->model_root
,test_image_channels
->image_files
. - By default, the visualization of patch sampling for segmentation models will run on only 1 image (down from 5). This is because patch sampling is expensive to compute, taking 1min per large CT scan.
- (#336) Renamed
HeadAndNeckBase
toHeadAndNeckPaper
, andProstateBase
toProstatePaper
. - (#427) Move dicom loading function from SimpleITK to pydicom. Loading time improved by 30x.
Fixed
- When registering a model, it now has a consistent folder structured, described here. This folder structure is present irrespective of using InnerEye as a submodule or not. In particular, exactly 1 Conda environment will be contained in the model.
Removed
- The commandline options to control which checkpoint is saved, and which is used for inference, have been removed:
save_start_epoch
,save_step_epochs
,epochs_to_test
,test_diff_epochs
,test_step_epochs
,test_start_epoch
- Removed blobxfer completely. When downloading a dataset from Azure, we now use AzureML dataset downloading tools. Please remove the following fields from your settings.yml file: 'datasets_storage_account' and 'datasets_container'.
- Removed
ProstatePaperBase
. - Removed ability to perform sub-fold cross validation. The parameters
number_of_cross_validation_splits_per_fold
andcross_validation_sub_fold_split_index
have been removed from ScalarModelBase.
Deprecated
0.1 (2020-11-13)
- This is the baseline release.