* Fix shell lexer name

* Update CHANGELOG

* Fix CHANGELOG

* Fix "html_static_path entry '_static' does not exist"

* Clean up preprocess script

* Fix link to InnerEye-DataQuality

* Use shutil.copy to copy files

* Remove extra info from CHANGELOG

* Fix broken link to LICENSE

* Fix lexer name for YAML

* Remove colons from headers

* Fix InnerEye module not being found
This commit is contained in:
Fernando Pérez-García 2022-03-22 09:46:15 +00:00 коммит произвёл GitHub
Родитель 95d8b72ae6
Коммит 45e7d5ff4d
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
10 изменённых файлов: 142 добавлений и 143 удалений

Просмотреть файл

@ -37,7 +37,7 @@ jobs that run in AzureML.
`NIH_COVID_BYOL` to specify the name of the SSL training dataset.
- ([#560](https://github.com/microsoft/InnerEye-DeepLearning/pull/560)) Added pre-commit hooks.
- ([#619](https://github.com/microsoft/InnerEye-DeepLearning/pull/619)) Add DeepMIL PANDA
- ([#559](https://github.com/microsoft/InnerEye-DeepLearning/pull/559)) Adding the accompanying code for the ["Active label cleaning: Improving dataset quality under resource constraints"](https://arxiv.org/abs/2109.00574) paper. The code can be found in the [InnerEye-DataQuality](InnerEye-DataQuality/README.md) subfolder. It provides tools for training noise robust models, running label cleaning simulation and loading our label cleaning benchmark datasets.
- ([#559](https://github.com/microsoft/InnerEye-DeepLearning/pull/559)) Adding the accompanying code for the ["Active label cleaning: Improving dataset quality under resource constraints"](https://arxiv.org/abs/2109.00574) paper.
- ([#589](https://github.com/microsoft/InnerEye-DeepLearning/pull/589)) Add `LightningContainer.update_azure_config()`
hook to enable overriding `AzureConfig` parameters from a container (e.g. `experiment_name`, `cluster`, `num_nodes`).
- ([#617](https://github.com/microsoft/InnerEye-DeepLearning/pull/617)) Commandline flag `pl_check_val_every_n_epoch` to control how often validation is happening
@ -97,6 +97,7 @@ gets uploaded to AzureML, by skipping all test folders.
### Fixed
- ([#699](https://github.com/microsoft/InnerEye-DeepLearning/pull/699)) Fix Sphinx warnings.
- ([#682](https://github.com/microsoft/InnerEye-DeepLearning/pull/682)) Ensure the shape of input patches is compatible with model constraints.
- ([#681](https://github.com/microsoft/InnerEye-DeepLearning/pull/681)) Pad model outputs if they are smaller than the inputs.
- ([#683](https://github.com/microsoft/InnerEye-DeepLearning/pull/683)) Fix missing separator error in docs Makefile.

Просмотреть файл

@ -106,7 +106,7 @@ Further detailed instructions, including setup in Azure, are here:
1. [Model diagnostics](docs/model_diagnostics.md)
1. [Move a model to a different workspace](docs/move_model.md)
1. [Working with FastMRI models](docs/fastmri.md)
1. [Active label cleaning and noise robust learning toolbox](InnerEye-DataQuality/README.md)
1. [Active label cleaning and noise robust learning toolbox](https://github.com/microsoft/InnerEye-DeepLearning/blob/1606729c7a16e1bfeb269694314212b6e2737939/InnerEye-DataQuality/README.md)
## Deployment
@ -133,7 +133,7 @@ Details can be found [here](docs/deploy_on_aml.md).
## Licensing
[MIT License](LICENSE)
[MIT License](/LICENSE)
**You are responsible for the performance, the necessary testing, and if needed any regulatory clearance for
any of the models produced by this toolbox.**
@ -157,7 +157,7 @@ Oktay O., Nanavati J., Schwaighofer A., Carter D., Bristow M., Tanno R., Jena R.
Bannur S., Oktay O., Bernhardt M, Schwaighofer A., Jena R., Nushi B., Wadhwani S., Nori A., Natarajan K., Ashraf S., Alvarez-Valle J., Castro D. C.: Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs. ICML 2021 Workshop on Interpretable Machine Learning in Healthcare. [https://arxiv.org/abs/2107.06618](https://arxiv.org/abs/2107.06618)
Bernhardt M., Castro D. C., Tanno R., Schwaighofer A., Tezcan K. C., Monteiro M., Bannur S., Lungren M., Nori S., Glocker B., Alvarez-Valle J., Oktay. O: Active label cleaning for improved dataset quality under resource constraints. [https://www.nature.com/articles/s41467-022-28818-3](https://www.nature.com/articles/s41467-022-28818-3). Accompagnying code [InnerEye-DataQuality](InnerEye-DataQuality/README.md)
Bernhardt M., Castro D. C., Tanno R., Schwaighofer A., Tezcan K. C., Monteiro M., Bannur S., Lungren M., Nori S., Glocker B., Alvarez-Valle J., Oktay. O: Active label cleaning for improved dataset quality under resource constraints. [https://www.nature.com/articles/s41467-022-28818-3](https://www.nature.com/articles/s41467-022-28818-3). Accompagnying code [InnerEye-DataQuality](https://github.com/microsoft/InnerEye-DeepLearning/blob/1606729c7a16e1bfeb269694314212b6e2737939/InnerEye-DataQuality/README.md)
## Contributing
@ -175,5 +175,3 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
## This toolbox is maintained by the
[Microsoft Medical Image Analysis team](https://www.microsoft.com/en-us/research/project/medical-image-analysis/).

Просмотреть файл

@ -9,8 +9,8 @@ create a directory `InnerEyeLocal` beside `InnerEye`.
As well as your configurations (dealt with below) you will need these files:
* `settings.yml`: A file similar to `InnerEye\settings.yml` containing all your Azure settings.
The value of `extra_code_directory` should (in our example) be `'InnerEyeLocal'`,
and model_configs_namespace should be `'InnerEyeLocal.ML.configs'`.
The value of `extra_code_directory` should (in our example) be `'InnerEyeLocal'`,
and model_configs_namespace should be `'InnerEyeLocal.ML.configs'`.
* A folder like `InnerEyeLocal` that contains your additional code, and model configurations.
* A file `InnerEyeLocal/ML/runner.py` that invokes the InnerEye training runner, but that points the code to your environment and Azure
settings.
@ -38,7 +38,7 @@ You will find a variety of model configurations [here](/InnerEye/ML/configs/segm
in `Base.py` reference open-sourced data and can be used as they are. Those ending in `Base.py`
are partially specified, and can be used by having other model configurations inherit from them and supply the missing
parameter values: a dataset ID at least, and optionally other values. For example, a `Prostate` model might inherit
very simply from `ProstateBase` by creating `Prostate.py` in the directory `InnerEyeLocal/ML/configs/segmentation`
very simply from `ProstateBase` by creating `Prostate.py` in the directory `InnerEyeLocal/ML/configs/segmentation`
with the following contents:
```python
from InnerEye.ML.configs.segmentation.ProstateBase import ProstateBase
@ -51,8 +51,8 @@ class Prostate(ProstateBase):
azure_dataset_id="name-of-your-AML-dataset-with-prostate-data")
```
The allowed parameters and their meanings are defined in [`SegmentationModelBase`](/InnerEye/ML/config.py).
The class name must be the same as the basename of the file containing it, so `Prostate.py` must contain `Prostate`.
In `settings.yml`, set `model_configs_namespace` to `InnerEyeLocal.ML.configs` so this config
The class name must be the same as the basename of the file containing it, so `Prostate.py` must contain `Prostate`.
In `settings.yml`, set `model_configs_namespace` to `InnerEyeLocal.ML.configs` so this config
is found by the runner.
A `Head and Neck` model might inherit from `HeadAndNeckBase` by creating `HeadAndNeck.py` with the following contents:
@ -70,11 +70,11 @@ class HeadAndNeck(HeadAndNeckBase):
### Training a new model
* Set up your model configuration as above and update `azure_dataset_id` to the name of your Dataset in the AML workspace.
It is enough to put your dataset into blob storage. The dataset should be a contained in a folder at the root of the datasets container.
It is enough to put your dataset into blob storage. The dataset should be a contained in a folder at the root of the datasets container.
The InnerEye runner will check if there is a dataset in the AzureML workspace already, and if not, generate it directly from blob storage.
* Train a new model, for example `Prostate`:
```shell script
```shell
python InnerEyeLocal/ML/runner.py --azureml --model=Prostate
```
@ -101,12 +101,12 @@ Conversely, for command line options that take a boolean argument, and that are
### Training using multiple machines
To speed up training in AzureML, you can use multiple machines, by specifying the additional
`--num_nodes` argument. For example, to use 2 machines to train, specify:
```shell script
```shell
python InnerEyeLocal/ML/runner.py --azureml --model=Prostate --num_nodes=2
```
On each of the 2 machines, all available GPUs will be used. Model inference will always use only one machine.
For the Prostate model, we observed a 2.8x speedup for model training when using 4 nodes, and a 1.65x speedup
For the Prostate model, we observed a 2.8x speedup for model training when using 4 nodes, and a 1.65x speedup
when using 2 nodes.
### AzureML Run Hierarchy
@ -127,8 +127,8 @@ at the same time (provided that the cluster has capacity). This means that a com
takes as long as a single training run.
To start cross validation, you can either modify the `number_of_cross_validation_splits` property of your model,
or supply it on the command line: provide all the usual switches, and add `--number_of_cross_validation_splits=N`,
for some `N` greater than 1; a value of 5 is typical. This will start a
or supply it on the command line: provide all the usual switches, and add `--number_of_cross_validation_splits=N`,
for some `N` greater than 1; a value of 5 is typical. This will start a
[HyperDrive run](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters): a parent
AzureML job, with `N` child runs that will execute in parallel. You can see the child runs in the AzureML UI in the
"Child Runs" tab.
@ -144,12 +144,12 @@ To train further with an already-created model, give the above command with the
--run_recovery_id=foo_bar:foo_bar_12345_abcd
```
The run recovery ID is of the form "experiment_id:run_id". When you trained your original model, it will have been
queued as a "Run" inside of an "Experiment". The experiment will be given a name derived from the branch name - for
queued as a "Run" inside of an "Experiment". The experiment will be given a name derived from the branch name - for
example, branch `foo/bar` will queue a run in experiment `foo_bar`. Inside the "Tags" section of your run, you should
see an element `run_recovery_id`. It will look something like `foo_bar:foo_bar_12345_abcd`.
If you are recovering a HyperDrive run, the value of `--run_recovery_id` should for the parent,
and `--number_of_cross_validation_splits` should have the same value as in the recovered run.
and `--number_of_cross_validation_splits` should have the same value as in the recovered run.
For example:
```
--run_recovery_id=foo_bar:HD_55d4beef-7be9-45d7-89a5-1acf1f99078a --start_epoch=120 --number_of_cross_validation_splits=5
@ -169,38 +169,38 @@ You will need to specify the registered model to run on using the `model_id` arg
version by clicking on `Registered Models` on the Details tab of a run in the AzureML UI.
The model id is of the form "model_name:model_version". Thus your command should look like this:
```shell script
```shell
python Inner/ML/runner.py --azureml --model=Prostate --cluster=my_cluster_name \
--no-train --model_id=Prostate:1
```
#### From local checkpoints:
To evaluate a model using one or more local checkpoints, use the `local_weights_path` argument to specify the path(s) to the
model checkpoint(s) on the local disk.
```shell script
```shell
python Inner/ML/runner.py --model=Prostate --no-train --local_weights_path=path_to_your_checkpoint
```
To run on multiple checkpoints (if you have trained an ensemble model), specify each checkpoint using the argument
`local_weights_path`.
```shell script
```shell
python Inner/ML/runner.py --model=Prostate --no-train --local_weights_path=path_to_first_checkpoint,path_to_second_checkpoint
```
#### From URLs:
To evaluate a model using one or more checkpoints each specified by a URL, use the `weights_url` argument to specify the
url(s) from which the model checkpoint(s) should be downloaded.
```shell script
```shell
python Inner/ML/runner.py --model=Prostate --no-train --weights_url=url_for_your_checkpoint
```
To run on multiple checkpoints (if you have trained an ensemble model), specify each checkpoint using the argument
`weights_url`.
```shell script
```shell
python Inner/ML/runner.py --model=Prostate --no-train --weights_url=url_for_first_checkpoint,url_for_second_checkpoint
```
#### Running a registered AzureML model on a single image on the local disk
To submit an AzureML run to apply a model to a single image on your local disc,
you can use the script `submit_for_inference.py`, with a command of this form:
```shell script
```shell
python InnerEye/Scripts/submit_for_inference.py --image_file ~/somewhere/ct.nii.gz --model_id Prostate:555 \
--settings ../somewhere_else/settings.yml --download_folder ~/my_existing_folder
```
@ -208,8 +208,8 @@ python InnerEye/Scripts/submit_for_inference.py --image_file ~/somewhere/ct.nii.
### Model Ensembles
An ensemble model will be created automatically and registered in the AzureML model registry whenever cross-validation
models are trained. The ensemble model creation is done by the child whose `cross_validation_split_index` is 0;
you can identify this child by looking at the "Child Runs" tab in the parent run page in AzureML.
models are trained. The ensemble model creation is done by the child whose `cross_validation_split_index` is 0;
you can identify this child by looking at the "Child Runs" tab in the parent run page in AzureML.
To find the registered ensemble model, find the Hyperdrive parent run in AzureML. In the "Details" tab, there is an
entry for "Registered models", that links to the ensemble model that was just created. Note that each of the child runs
@ -225,12 +225,12 @@ and the generated posteriors are passed to the usual model testing downstream pi
Once your HyperDrive AzureML runs are completed, you can visualize the results by running the
[`plot_cross_validation.py`](/InnerEye/ML/visualizers/plot_cross_validation.py) script locally:
```shell script
```shell
python InnerEye/ML/visualizers/plot_cross_validation.py --run_recovery_id ... --epoch ...
```
filling in the run recovery ID of the parent run and the epoch number (one of the test epochs, e.g. the last epoch)
filling in the run recovery ID of the parent run and the epoch number (one of the test epochs, e.g. the last epoch)
for which you want results plotted. The script will also output several `..._outliers.txt` file with all of the outliers
across the splits and a portal query to
across the splits and a portal query to
find them in the production portal, and run statistical tests to compute the significance of differences between scores
across the splits and with respect to other runs that you specify. This is done for you during
the run itself (see below), but you can use the script post hoc to compare arbitrary runs
@ -241,18 +241,18 @@ and [`mann_whitney_test.py`](/InnerEye/Common/Statistics/mann_whitney_test.py).
## Where are my outputs and models?
* AzureML writes all its results to the storage account you have specified. Inside of that account, you will
find a container named `azureml`. You can access that with
[Azure StorageExplorer](https://azure.microsoft.com/en-us/features/storage-explorer/). The checkpoints and other
find a container named `azureml`. You can access that with
[Azure StorageExplorer](https://azure.microsoft.com/en-us/features/storage-explorer/). The checkpoints and other
files of a run will be in folder `azureml/ExperimentRun/dcid.my_run_id`, where `my_run_id` is the "Run Id" visible in
the "Details" section of the run. If you want to download all the results files or a large subset of them,
we recommend you access them this way.
* The results can also be viewed in the "Outputs and Logs" section of the run. This is likely to be more
convenient for viewing and inspecting single files.
* All files that the model training writes to the `./outputs` folder are automatically uploaded at the end of
the AzureML training job, and are put into `outputs` in Blob Storage and in the run itself.
the AzureML training job, and are put into `outputs` in Blob Storage and in the run itself.
Similarly, what the model training writes to the `./logs` folder gets uploaded to `logs`.
* You can monitor the file system that is mounted on the compute node, by navigating to your
storage account in Azure. In the blade, click on "Files" and, navigate through to `azureml/azureml/my_run_id`. This
storage account in Azure. In the blade, click on "Files" and, navigate through to `azureml/azureml/my_run_id`. This
will show all files that are mounted as the working directory on the compute VM.
The organization of the `outputs` directory is as follows:
@ -281,25 +281,25 @@ the `metrics.csv` files of the current run and the comparison run(s).
and `test_dataset.csv`, `train_dataset.csv` and `val_dataset.csv` for those subsets of it.
* `BaselineComparisonWilcoxonSignedRankTestResults.txt`, containing the results of comparisons
between the current run and any specified baselines (earlier runs) to compare with. Each paragraph of that file compares two models and
indicates, for each structure, when the Dice scores for the second model are significantly better
or worse than the first. For full details, see the
indicates, for each structure, when the Dice scores for the second model are significantly better
or worse than the first. For full details, see the
[source code](../InnerEye/Common/Statistics/wilcoxon_signed_rank_test.py).
* A directory `scatterplots`, containing a `png` file for every pairing of the current model
with one of the baselines. Each one is named `AAA_vs_BBB.png`, where `AAA` and `BBB` are the run IDs
of the two models. Each plot shows the Dice scores on the test set for the models.
* For both segmentation and classification models an IPython Notebook `report.ipynb` will be generated in the
`outputs` directory.
* For segmentation models, this report is based on the full image results of the model checkpoint that performed
the best on the validation set. This report will contain detailed metrics per structure, and outliers to help
* For segmentation models, this report is based on the full image results of the model checkpoint that performed
the best on the validation set. This report will contain detailed metrics per structure, and outliers to help
model development.
* For classification models, the report is based on the validation and test results from the last epoch. It shows
metrics on the validation and test sets, ROC and PR Curves, and a list of the best and worst performing images
from the test set.
Ensemble models are created by the zero'th child (with `cross_validation_split_index=0`) in each
cross-validation run. Results from inference on the test and validation sets are uploaded to the
parent run, and can be found in `epoch_NNN` directories as above.
In addition, various scores and plots from the ensemble and from individual child
In addition, various scores and plots from the ensemble and from individual child
runs are uploaded to the parent run, in the `CrossValResults` directory. This contains:
* Subdirectories named 0, 1, 2, ... for all the child runs including the zero'th one, as well
as `ENSEMBLE`, containing their respective `epoch_NNN` directories.
@ -320,24 +320,24 @@ scatterplots for the ensemble, as described above for single runs.
### Augmentations for classification models.
For classification models, you can define an augmentation pipeline to apply to your images input (resp. segmentations) at
training, validation and test time. In order to define such a series of transformations, you will need to overload the
`get_image_transform` (resp. `get_segmention_transform`) method of your config class. This method expects you to return
a `ModelTransformsPerExecutionMode`, that maps each execution mode to one transform function. We also provide the
`ImageTransformationPipeline` a class that creates a pipeline of transforms, from a list of individual transforms and
For classification models, you can define an augmentation pipeline to apply to your images input (resp. segmentations) at
training, validation and test time. In order to define such a series of transformations, you will need to overload the
`get_image_transform` (resp. `get_segmention_transform`) method of your config class. This method expects you to return
a `ModelTransformsPerExecutionMode`, that maps each execution mode to one transform function. We also provide the
`ImageTransformationPipeline` a class that creates a pipeline of transforms, from a list of individual transforms and
ensures the correct conversion of 2D or 3D PIL.Image or tensor inputs to the obtained pipeline.
`ImageTransformationPipeline` takes two arguments for its constructor:
* `transforms`: a list of image transforms, in particular you can feed in standard [torchvision transforms](https://pytorch.org/vision/0.8/transforms.html) or
any other transforms as long as they support an input `[Z, C, H, W]` (where Z is the 3rd dimension (1 for 2D images),
C number of channels, H and W the height and width of each 2D slide - this is supported for standard torchvision
any other transforms as long as they support an input `[Z, C, H, W]` (where Z is the 3rd dimension (1 for 2D images),
C number of channels, H and W the height and width of each 2D slide - this is supported for standard torchvision
transforms.). You can also define your own transforms as long as they expect such a `[Z, C, H, W]` input. You can
find some examples of custom transforms class in `InnerEye/ML/augmentation/image_transforms.py`.
* `use_different_transformation_per_channel`: if True, apply a different version of the augmentation pipeline
for each channel. If False, applies the same transformation to each channel, separately. Default to False.
Below you can find an example of `get_image_transform` that would resize your input images to 256 x 256, and at
training time only apply random rotation of +/- 10 degrees, and apply some brightness distortion,
training time only apply random rotation of +/- 10 degrees, and apply some brightness distortion,
using standard pytorch vision transforms.
```python
@ -353,9 +353,9 @@ def get_image_transform(self) -> ModelTransformsPerExecutionMode:
### Segmentation Models and Inference.
By default when building a segmentation model a full image inference will be performed on the validation and test data sets;
and when building an ensemble model, a full image inference will be performed on the test data set only (because the
training and validation sets are first combined before being split into each of the folds).
By default when building a segmentation model a full image inference will be performed on the validation and test data sets;
and when building an ensemble model, a full image inference will be performed on the test data set only (because the
training and validation sets are first combined before being split into each of the folds).
There are a total of six command line options for controlling this in more detail.
For non-ensemble models use any of the following command line options to enable or disable inference on training, test, or validation data sets:

Просмотреть файл

@ -2,21 +2,21 @@
### Using TensorBoard to monitor AzureML jobs
* **Existing jobs**: execute [`InnerEye/Azure/tensorboard_monitor.py`](/InnerEye/Azure/tensorboard_monitor.py)
with either an experiment id `--experiment_name` or a list of run ids `--run_ids job1,job2,job3`.
If an experiment id is provided then all of the runs in that experiment will be monitored. Additionally You can also
* **Existing jobs**: execute [`InnerEye/Azure/tensorboard_monitor.py`](/InnerEye/Azure/tensorboard_monitor.py)
with either an experiment id `--experiment_name` or a list of run ids `--run_ids job1,job2,job3`.
If an experiment id is provided then all of the runs in that experiment will be monitored. Additionally You can also
filter runs by type by the run's status, setting the `--filters Running,Completed` parameter to a subset of
`[Running, Completed, Failed, Canceled]`. By default Failed and Canceled runs are excluded.
To quickly access this script from PyCharm, there is a template PyCharm run configuration
`Template: Tensorboard monitoring` in the repository. Create a copy of that, and modify the commandline
To quickly access this script from PyCharm, there is a template PyCharm run configuration
`Template: Tensorboard monitoring` in the repository. Create a copy of that, and modify the commandline
arguments with your jobs to monitor.
* **New jobs**: when queuing a new AzureML job, pass `--tensorboard`, which will automatically start a new TensorBoard
session, monitoring the newly queued job.
session, monitoring the newly queued job.
### Resource Monitor
GPU and CPU usage can be monitored throughout the execution of a run (local and AML) by setting the monitoring interval
GPU and CPU usage can be monitored throughout the execution of a run (local and AML) by setting the monitoring interval
for the resource monitor eg: `--monitoring_interval_seconds=5`. This will spawn a separate process at the start of the
run which will log both GPU and CPU utilization and memory consumption. These metrics will be written to AzureML as
well as a separate TensorBoard logs file under `Diagnostics`.
@ -26,12 +26,12 @@ well as a separate TensorBoard logs file under `Diagnostics`.
For full debugging of any non-trivial model, you will need a GPU. Some basic debugging can also be carried out on
standard Linux or Windows machines.
The main entry point into the code is [`InnerEye/ML/runner.py`](/InnerEye/ML/runner.py). The code takes its
configuration elements from commandline arguments and a settings file,
[`InnerEye/settings.yml`](/InnerEye/settings.yml).
The main entry point into the code is [`InnerEye/ML/runner.py`](/InnerEye/ML/runner.py). The code takes its
configuration elements from commandline arguments and a settings file,
[`InnerEye/settings.yml`](/InnerEye/settings.yml).
A password for the (optional) Azure Service
Principal is read from `InnerEyeTestVariables.txt` in the repository root directory. The file
A password for the (optional) Azure Service
Principal is read from `InnerEyeTestVariables.txt` in the repository root directory. The file
is expected to contain a line of the form
```
APPLICATION_KEY=<app key for your AML workspace>
@ -48,7 +48,7 @@ create a copy of the template run configuration, and change the arguments to sui
Here are a few hints how you can reduce the complexity of training if you need to debug an issue. In most cases,
you should then be able to rely on a CPU machine.
* Reduce the number of feature channels in your model. If you run a UNet, for example, you can set
* Reduce the number of feature channels in your model. If you run a UNet, for example, you can set
`feature_channels = [1]` in your model definition file.
* Train only for a single epoch. You can set `--num_epochs=1` via the commandline or the `more_switches` variable
if you start your training via a build definition. This will only create a model checkpoint at epoch 1, and ignore
@ -63,7 +63,7 @@ With the above settings, you should be able to get a model training run to compl
### Verify your changes using a simplified fast model
If you made any changes to the code that submits experiments (either `azure_runner.py` or `runner.py` or code
imported by those), validate them using a model training run in Azure. You can queue a model training run for the
imported by those), validate them using a model training run in Azure. You can queue a model training run for the
simplified `BasicModel2Epochs` model.
@ -71,8 +71,8 @@ simplified `BasicModel2Epochs` model.
It is sometimes possible to get a Python debugging (pdb) session on the main process for a model
training run on an AzureML compute cluster, for example if a run produces unexpected output,
or is silent what seems like an unreasonably long time. For this to work, you will need to
have created the cluster with ssh access enabled; it is not currently possible to add this
or is silent what seems like an unreasonably long time. For this to work, you will need to
have created the cluster with ssh access enabled; it is not currently possible to add this
after the cluster is created. The steps are as follows.
* From the "Details" tab in the run's page, note the Run ID, then click on the target name under
@ -82,13 +82,13 @@ after the cluster is created. The steps are as follows.
supply the password chosen when the cluster was created.
* Type "bash" for a nicer command shell (optional).
* Identify the main python process with a command such as
```shell script
```shell
ps aux | grep 'python.*runner.py' | egrep -wv 'bash|grep'
```
You may need to vary this if it does not yield exactly one line of output.
* Note the process identifier (the value in the PID column, generally the second one).
* Issue the commands
```shell script
```shell
kill -TRAP nnnn
nc 127.0.0.1 4444
```

Просмотреть файл

@ -17,7 +17,7 @@ AWS into Azure blob storage.
## Registering for the challenge
In order to download the dataset, you need to register [here](https://fastmri.org/dataset/).
You will shortly receive an email with links to the dataset. In that email, there are two sections containing
You will shortly receive an email with links to the dataset. In that email, there are two sections containing
scripts to download the data, like this:
```
To download Knee MRI files, we recommend using curl with recovery mode turned on:
@ -25,22 +25,22 @@ curl -C "https://....amazonaws.com/knee_singlecoil_train.tar.gz?AWSAccessKeyId=.
...
```
There are two sections of that kind, one for the knee data and one for the brain data. Copy and paste *all* the lines
with `curl` commands into a text file, for example called `curl.txt`. In total, there should be 10 lines with `curl`
with `curl` commands into a text file, for example called `curl.txt`. In total, there should be 10 lines with `curl`
commands for the knee data, and 7 for the brain data (including the SHA256 file).
## Download the dataset directly to blob storage via Azure Data Factory
We are providing a script that will bulk download all files in the FastMRI dataset from AWS to Azure blob storage.
To start that script, you need
- The file that contains all the `curl` commands to download the data (see above). The downloading script will
- The file that contains all the `curl` commands to download the data (see above). The downloading script will
extract all the AWS access tokens from the `curl` commands.
- The connection string to the Azure storage account that stores your dataset.
- To get that, navigate to the [Azure Portal](https://portal.azure.com), and search for the storage account
that you created to hold your datasets (Step 4 in [AzureML setup](setting_up_aml.md)).
- On the left hand navigation, there is a section "Access Keys", select that and copy out the connection string
- The connection string to the Azure storage account that stores your dataset.
- To get that, navigate to the [Azure Portal](https://portal.azure.com), and search for the storage account
that you created to hold your datasets (Step 4 in [AzureML setup](setting_up_aml.md)).
- On the left hand navigation, there is a section "Access Keys", select that and copy out the connection string
(sanity check: it should look something like `DefaultEndpointsProtocol=....==;EndpointSuffix=core.windows.net`)
- The Azure location where the Data Factory should be created (for example "westeurope"). The Data Factory should
live in the same Azure location as your AzureML workspace and storage account. To check the location,
- The Azure location where the Data Factory should be created (for example "westeurope"). The Data Factory should
live in the same Azure location as your AzureML workspace and storage account. To check the location,
find the workspace in the [Azure Portal](https://portal.azure.com), the location is shown on the overview page.
Then run the script to download the dataset as follows, providing the path the the file with the curl commands
@ -57,21 +57,21 @@ you supplied, and uncompress them.
- Run all the pipelines and delete the Data Factory.
This whole process can take a few hours to complete. It will print progress information every 30 seconds to the console.
Alternatively, find the Data Factory "fastmri-copy-data" in your Azure portal, and click on the "Monitor" icon to
Alternatively, find the Data Factory "fastmri-copy-data" in your Azure portal, and click on the "Monitor" icon to
drill down into all running pipelines.
Once the script is complete, you will have the following datasets in Azure blob storage:
- `knee_singlecoil`, `knee_multicoil`, and `brain_multicoil` with all files unpacked
- `knee_singlecoil_compressed`, `knee_multicoil_compressed`, and `brain_multicoil_compressed` with the `.tar` and
- `knee_singlecoil_compressed`, `knee_multicoil_compressed`, and `brain_multicoil_compressed` with the `.tar` and
`.tar.gz` files as downloaded. NOTE: The raw challenge data files all have a `.tar.gz` extension, even though some
of them are plain (uncompressed) `.tar` files. The pipeline corrects these mistakes and puts the files into blob storage
with their corrected extension.
- The DICOM files are stored in the folders `knee_DICOMs` and `brain_DICOMs` (uncompressed) and
- The DICOM files are stored in the folders `knee_DICOMs` and `brain_DICOMs` (uncompressed) and
`knee_DICOMs_compressed` and `brain_DICOMs_compressed` (as `.tar` files)
### Troubleshooting the data downloading
If you see a runtime error saying "The subscription is not registered to use namespace 'Microsoft.DataFactory'", then
If you see a runtime error saying "The subscription is not registered to use namespace 'Microsoft.DataFactory'", then
follow the steps described [here](https://stackoverflow.com/a/48419951/5979993), to enable DataFactory for your
subscription.
@ -83,19 +83,19 @@ If set up correctly, this is the Azure storage account that holds all datasets u
Hence, after the downloading completes, you are ready to use the InnerEye toolbox to submit an AzureML job that uses
the FastMRI data.
There are 2 example models already coded up in the InnerEye toolbox, defined in
[fastmri_varnet.py](../InnerEye/ML/configs/other/fastmri_varnet.py): `KneeMulticoil` and
`BrainMulticoil`. As with all InnerEye models, you can start a training run by specifying the name of the class
There are 2 example models already coded up in the InnerEye toolbox, defined in
[fastmri_varnet.py](../InnerEye/ML/configs/other/fastmri_varnet.py): `KneeMulticoil` and
`BrainMulticoil`. As with all InnerEye models, you can start a training run by specifying the name of the class
that defines the model, like this:
```shell script
```shell
python InnerEye/ML/runner.py --model KneeMulticoil --azureml --num_nodes=4
```
This will start an AzureML job with 4 nodes training at the same time. Depending on how you set up your compute
cluster, this will use a different number of GPUs: For example, if your cluster uses ND24 virtual machines, where
cluster, this will use a different number of GPUs: For example, if your cluster uses ND24 virtual machines, where
each VM has 4 Tesla P40 cards, training will use a total of 16 GPUs.
As common with multiple nodes, training time will not scale linearly with increased number of nodes. The following
table gives a rough overview of time to train 1 epoch of the FastMri model in the InnerEye toolbox
table gives a rough overview of time to train 1 epoch of the FastMri model in the InnerEye toolbox
on our cluster (`Standard_ND24s` nodes with 4 Tesla P40 cards):
| Step | 1 node (4 GPUs) | 2 nodes (8 GPUs) | 4 nodes (16 GPUs) | 8 nodes (32 GPUs) |
@ -106,7 +106,7 @@ on our cluster (`Standard_ND24s` nodes with 4 Tesla P40 cards):
| Total time for 1 epoch | 5h 5min | 3h 5min | 1h 58min | 1h 26min |
| Total time for 50 epochs | 9 days | 4.6 days | 2.3 days | 1.2 days|
Note that the download times depend on the type of Azure storage account that your workspace is using. We recommend
Note that the download times depend on the type of Azure storage account that your workspace is using. We recommend
using Premium storage accounts for optimal performance.
You can avoid the time to download the dataset, by specifying that the data is always read on-the-fly from the network.
@ -120,36 +120,36 @@ when training on 8 nodes in parallel. For more details around dataset mounting p
Training a FastMri model on the `brain_multicoil` dataset is particularly challenging because the dataset is larger.
Downloading the dataset can - depending on the types of nodes - already make the nodes go out of disk space.
The InnerEye toolbox has a way of working around that problem, by reading the dataset on-the-fly from the network,
rather than downloading it at the start of the job. You can trigger this behaviour by supplying an additional
The InnerEye toolbox has a way of working around that problem, by reading the dataset on-the-fly from the network,
rather than downloading it at the start of the job. You can trigger this behaviour by supplying an additional
commandline argument `--use_dataset_mount`, for example:
```shell script
```shell
python InnerEye/ML/runner.py --model BrainMulticoil --azureml --num_nodes=4 --use_dataset_mount
```
With this flag, the InnerEye training script will start immediately, without downloading data beforehand.
However, the fastMRI data module generates a cache file before training, and to build that, it needs to traverse the
With this flag, the InnerEye training script will start immediately, without downloading data beforehand.
However, the fastMRI data module generates a cache file before training, and to build that, it needs to traverse the
full dataset. This will lead to a long (1-2 hours) startup time before starting the first epoch, while it is
creating this cache file. This can be avoided by copying the cache file from a previous run into to the dataset folder.
creating this cache file. This can be avoided by copying the cache file from a previous run into to the dataset folder.
More specifically, you need to follow these steps:
* Start a training job, training for only 1 epoch, like
```shell script
```shell
python InnerEye/ML/runner.py --model BrainMulticoil --azureml --use_dataset_mount --num_epochs=1
```
* Wait until the job starts has finished creating the cache file - the job will print out a message
* Wait until the job starts has finished creating the cache file - the job will print out a message
"Saving dataset cache to dataset_cache.pkl", visible in the log file `azureml-logs/70_driver_log.txt`, about 1-2 hours
after start. At that point, you can cancel the job.
* In the "Outputs + logs" section of the AzureML job, you will now see a file `outputs/dataset_cache.pkl` that has
after start. At that point, you can cancel the job.
* In the "Outputs + logs" section of the AzureML job, you will now see a file `outputs/dataset_cache.pkl` that has
been produced by the job. Download that file.
* Upload the file `dataset_cache.pkl` to the storage account that holds the fastMRI datasets, in the `brain_multicoil`
* Upload the file `dataset_cache.pkl` to the storage account that holds the fastMRI datasets, in the `brain_multicoil`
folder that was previously created by the Azure Data Factory. You can do that via the Azure Portal or Azure Storage
Explorer. Via the Azure Portal, you can search for the storage account that holds your data, then select
Explorer. Via the Azure Portal, you can search for the storage account that holds your data, then select
"Data storage: Containers" in the left hand navigation. You should see a folder named `datasets`, and inside of that
`brain_multicoil`. Once in that folder, press the "Upload" button at the top and select the `dataset_cache.pkl` file.
* Start the training job again, this time you can start multi-node training right away, like this:
```shell script
```shell
python InnerEye/ML/runner.py --model BrainMulticoil --azureml --use_dataset_mount --num_nodes=8. This new
```
This job should pick up the existing cache file, and output a message like "Copying a pre-computed dataset cache
This job should pick up the existing cache file, and output a message like "Copying a pre-computed dataset cache
file ..."
The same trick can of course be applied to other models as well (`KneeMulticoil`).
@ -157,20 +157,20 @@ The same trick can of course be applied to other models as well (`KneeMulticoil`
# Running on a GPU machine
You can of course run the InnerEye fastMRI models on a reasonably large machine with a GPU for development and
debugging purposes. Before running, we recommend to download the datasets using a tool
You can of course run the InnerEye fastMRI models on a reasonably large machine with a GPU for development and
debugging purposes. Before running, we recommend to download the datasets using a tool
like [azcopy](http://aka.ms/azcopy) into a folder, for example the `datasets` folder at the repository root.
To use `azcopy`, you will need the access key to the storage account that holds your data - it's the same storage
account that was used when creating the Data Factory that downloaded the data.
- To get that, navigate to the [Azure Portal](https://portal.azure.com), and search for the storage account
that you created to hold your datasets (Step 4 in [AzureML setup](setting_up_aml.md)).
- To get that, navigate to the [Azure Portal](https://portal.azure.com), and search for the storage account
that you created to hold your datasets (Step 4 in [AzureML setup](setting_up_aml.md)).
- On the left hand navigation, there is a section "Access Keys". Select that and copy out one of the two keys (_not_
the connection strings). The key is a base64 encoded string, it should not contain any special characters apart from
the connection strings). The key is a base64 encoded string, it should not contain any special characters apart from
`+`, `/`, `.` and `=`
Then run this script in the repository root folder:
```shell script
```shell
mkdir datasets
azcopy --source-key <storage_account_key> --source https://<your_storage_acount>.blob.core.windows.net/datasets/brain_multicoil --destination datasets/brain_multicoil --recursive
```
@ -178,7 +178,7 @@ Replace `brain_multicoil` with any of the other datasets names if needed.
If you follow these suggested folder structures, there is no further change necessary to the models. You can then
run, for example, the `BrainMulticoil` model by dropping the `--azureml` flag like this:
```shell script
```shell
python InnerEye/ML/runner.py --model BrainMulticoil
```
The code will recognize that an Azure dataset named `brain_multicoil` is already present in the `datasets` folder,

Просмотреть файл

@ -1,21 +1,21 @@
# Using the InnerEye code as a git submodule of your project
You can use InnerEye as a submodule in your own project.
You can use InnerEye as a submodule in your own project.
If you go down that route, here's the list of files you will need in your project (that's the same as those
given in [this document](building_models.md))
* `environment.yml`: Conda environment with python, pip, pytorch
* `settings.yml`: A file similar to `InnerEye\settings.yml` containing all your Azure settings
* A folder like `ML` that contains your additional code, and model configurations.
* A file like `myrunner.py` that invokes the InnerEye training runner, but that points the code to your environment
* A file like `myrunner.py` that invokes the InnerEye training runner, but that points the code to your environment
and Azure settings; see the [Building models](building_models.md) instructions for details. Please see below for how
`myrunner.py` should look like.
You then need to add the InnerEye code as a git submodule, in folder `innereye-deeplearning`:
```shell script
```shell
git submodule add https://github.com/microsoft/InnerEye-DeepLearning innereye-deeplearning
```
Then configure your Python IDE to consume *both* your repository root *and* the `innereye-deeplearning` subfolder as inputs.
In Pycharm, you would do that by going to Settings/Project Structure. Mark your repository root as "Source", and
In Pycharm, you would do that by going to Settings/Project Structure. Mark your repository root as "Source", and
`innereye-deeplearning` as well.
Example commandline runner that uses the InnerEye runner (called `myrunner.py` above):
@ -24,7 +24,7 @@ import sys
from pathlib import Path
# This file here mimics how the InnerEye code would be used as a git submodule.
# This file here mimics how the InnerEye code would be used as a git submodule.
# Ensure that this path correctly points to the root folder of your repository.
repository_root = Path(__file__).absolute()
@ -70,11 +70,11 @@ if __name__ == '__main__':
1. Set up a directory outside of InnerEye to holds your configs. In your repository root, you could have a folder
`InnerEyeLocal`, parallel to the InnerEye submodule, alongside `settings.yml` and `myrunner.py`.
The example below creates a new flavour of the Glaucoma model in `InnerEye/ML/configs/classification/GlaucomaPublic`.
All that needs to be done is change the dataset. We will do this by subclassing GlaucomaPublic in a new config
The example below creates a new flavour of the Glaucoma model in `InnerEye/ML/configs/classification/GlaucomaPublic`.
All that needs to be done is change the dataset. We will do this by subclassing GlaucomaPublic in a new config
stored in `InnerEyeLocal/configs`
1. Create folder `InnerEyeLocal/configs`
1. Create a config file `InnerEyeLocal/configs/GlaucomaPublicExt.py` which extends the `GlaucomaPublic` class
1. Create a config file `InnerEyeLocal/configs/GlaucomaPublicExt.py` which extends the `GlaucomaPublic` class
like this:
```python
from InnerEye.ML.configs.classification.GlaucomaPublic import GlaucomaPublic
@ -83,12 +83,12 @@ class MyGlaucomaModel(GlaucomaPublic):
def __init__(self) -> None:
super().__init__()
self.azure_dataset_id="name_of_your_dataset_on_azure"
```
1. In `settings.yml`, set `model_configs_namespace` to `InnerEyeLocal.configs` so this config
```
1. In `settings.yml`, set `model_configs_namespace` to `InnerEyeLocal.configs` so this config
is found by the runner. Set `extra_code_directory` to `InnerEyeLocal`.
#### Start Training
Run the following to start a job on AzureML:
Run the following to start a job on AzureML:
```
python myrunner.py --azureml --model=MyGlaucomaModel
```

Просмотреть файл

@ -200,7 +200,7 @@ Leave all other fields as they are for now.
Summarizing, here is how the file should look like:
```yml
```yaml
variables:
tenant_id: '<Azure tenant ID of your company>'
subscription_id: '<Azure subscription ID that your project is using'

Просмотреть файл

@ -10,29 +10,28 @@ def replace_in_file(filepath: Path, original_str: str, replace_str: str) -> None
"""
Replace all occurences of the original_str with replace_str in the file provided.
"""
with filepath.open('r') as file:
text = file.read()
text = filepath.read_text()
text = text.replace(original_str, replace_str)
with filepath.open('w') as file:
file.write(text)
filepath.write_text(text)
if __name__ == '__main__':
sphinx_root = Path(__file__).absolute().parent
repository_root = sphinx_root.parent
md_root = sphinx_root / "source/md"
markdown_root = sphinx_root / "source" / "md"
repository_url = "https://github.com/microsoft/InnerEye-DeepLearning"
# Create directories source/md and source/md/docs where files will be copied to
if md_root.exists():
shutil.rmtree(md_root)
md_root.mkdir()
if markdown_root.exists():
shutil.rmtree(markdown_root)
markdown_root.mkdir()
# copy README.md and doc files
shutil.copyfile(repository_root / "README.md", md_root / "README.md")
shutil.copytree(repository_root / "docs", md_root / "docs")
shutil.copy(repository_root / "README.md", markdown_root)
shutil.copy(repository_root / "CHANGELOG.md", markdown_root)
shutil.copytree(repository_root / "docs", markdown_root / "docs")
# replace links to files in repository with urls
md_file_list = md_root.rglob("*.md")
for filepath in md_file_list:
md_files = markdown_root.rglob("*.md")
for filepath in md_files:
replace_in_file(filepath, "](/", f"]({repository_url}/blob/main/")

Просмотреть файл

@ -13,11 +13,12 @@
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
# documentation root, make it absolute.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import sys
from pathlib import Path
repo_dir = Path(__file__).absolute().parents[2]
sys.path.insert(0, str(repo_dir))
# -- Imports -----------------------------------------------------------------
@ -64,7 +65,7 @@ html_theme = 'sphinx_rtd_theme'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# html_static_path = ['_static']
source_parsers = {
'.md': CommonMarkParser,

Просмотреть файл

@ -8,7 +8,7 @@ InnerEye-DeepLearning Documentation
.. toctree::
:maxdepth: 1
:caption: Contents:
:caption: Contents
md/README.md
md/docs/WSL.md
@ -21,13 +21,13 @@ InnerEye-DeepLearning Documentation
.. toctree::
:maxdepth: 1
:caption: About Model Configs:
:caption: About Model Configs
rst/configs.rst
.. toctree::
:maxdepth: 1
:caption: Further reading for contributors:
:caption: Further reading for contributors
md/docs/pull_requests.md
md/docs/testing.md