InnerEye-DeepLearning/InnerEye-DataQuality
melanibe 94553a5c0b
Adding Active Label Cleaning code (#559)
* initial commit

* updating the build

* flake8

* update main page

* add the links

* try to fix the env

* update build, gitignore and remove duplicate license

* update gitignore again

* Adding to changelog

* conda activate

* update again

* wrong instruction

* add data quality

* rephrase

* first pass on Readme.md

* switch from our to the, and clarify the cxr datasets

* move content to a separate markdown file

* move additional content to config readme file

* finish updating dataquality readme

* Rename

* pr ocmment

* todos

* changed default dir for cifar10 dataset

Co-authored-by: Ozan Oktay <ozan.oktay@microsoft.com>
2021-09-21 10:22:03 +01:00
..
InnerEyeDataQuality Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
PyTorchImageClassification Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
tests Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
README.md Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
__init__.py Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
create_environment.py Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
default_paths.py Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
environment.yml Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
pytest.ini Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00
setup.py Adding Active Label Cleaning code (#559) 2021-09-21 10:22:03 +01:00

README.md

InnerEye-DataQuality

Contents of this sub-repository:

This folder contains all the source code associated to the manuscript "Bernhardt et al.: Active label cleaning: Improving dataset quality under resource constraints".

In particular, this folder provides the tools for:

  1. Label noise robust training (e.g. co-teaching, ELR, self-supervised pretraining and finetuning capabilities)
  2. The label cleaning simulation benchmark proposed in the above mentioned manuscript.
  3. The model selection benchmark.
  4. All the code related to proposed benchmark datasets "CIFAR10H" and "NoisyChestXray".

Installation:

Cloning the InnerEye-DeepLearning repository to your local disk and move to the InnerEye-DataQuality folder.

git clone https://github.com/microsoft/InnerEye-DeepLearning
cd InnerEye-DeepLearning/InnerEye-DataQuality

Setting up the InnerEyeDataQuality python environment. Note that this repository uses a specific conda environment, independent from the InnerEye environment.

python create_environment.py
conda activate InnerEyeDataQuality
pip install -e .

Benchmark datasets:

CIFAR10H

The CIFAR10H dataset consists of images taken from the original CIFAR10 test set, but all the images have been labelled by multiple annotators. We use the CIFAR10 training set as the clean test-set to evaluate our trained models.

Noisy Chest-Xray

The images released as part of the Kaggle Challenge, where originally released as part of the NIH chest x-ray dataset. Before starting the competition, 30k images have been selected as the images for competitions. The labels for these images have then been adjudicated to label them with bounding boxes indicating "pneumonia-life opacities". This dataset uses the kaggle dataset with noisy labels as the original labels from RSNA and the clean labels are the Kaggle labels. Originally the dataset had 14 classes, we created a new binary label to label each image as "pneumonia-like" or "non-pneumonia-like" depending on the original label prior to adjudication. The original (binarized) labels along with their corresponding adjudicated label, can be created with create_noisy_chestxray_dataset.py (see "How to use it" section below). The dataset class for this dataset is the noisy_kaggle_cxr.py file. This dataset class will automatically load the noisy labels from the aforementioned file (provided they have been created before hand, see next section)

Pre-requisites for using this dataset

  1. The code will assume that the RSNA Pneumonia Challenge dataset is present on your machine. You will need to download it from the Kaggle page first to the dataset_dir of your choice.
  2. You will need to create the noisy dataset csv file, that contains the noisy (derived from NIH dataset) labels and their clean counterpart (from the challenge data). In order to do so you will need to first download the following files:
  1. Update the dataset_dir field in the corresponding model configs.

Chest X-ray datasets for model pre-training

Full Kaggle Pneumonia Detection challenge dataset:

In a subset of experiments, for unsupervised pretraining of chest xray models, the code uses the Kaggle training set (stage 1) from the Pneumonia Challenge. The dataset class for this dataset can be found in the kaggle_cxr.py file. This dataset class loads the full set with binary labels based on the bounding boxes provided for the competition.

  1. The code will assume that the RSNA Pneumonia Challenge dataset is present on your machine. You will need to download it from the Kaggle page first to the dataset_dir of your choice.
  2. Update the dataset_dir field in the corresponding model configs.

NIH Chest-Xray dataset:

In a subset of experiments, for unsupervised pretraining of chest xray models, the code uses NIH Chest-Xray Dataset.

  1. The code will assume that the NIH ChestXray dataset is present on your machine. You will need to download the data from its dedicated Kaggle page to the dataset_dir of your choice.
  2. Update the dataset_dir field in the corresponding model configs.

Noise Robust Learning

In this section, we provide details on how to train noise robust supervised models with this repository. The code supports in particular Co-Teaching, Early Learning Regularization (ELR), finetuning of self-supervised (SSL) pretrained models. We also provide off-the-shelf configurations matching the experiment presented in the paper for the CIFAR10H and NoisyChestXray benchmark.

Training noise robust supervised models

The main entry point for training a supervised model is train.py. The code requires you to provide a config file specifying the dataset to use, the training specification (batch size, scheduler etc...), which type of training, which augmentation to use. To launch a training job use the following command:

python InnerEyeDataQuality/deep_learning/train.py  --config <path to config>

Please check the Readme file to learn more about how to configure model training experiments.

Label cleaning simulation benchmark

To run the label cleaning simulation you will need to run main_simulation with a list of selector configs in the --config arguments as well as a list of seeds to use for sampling in the --seeds arguments. A selector config will allow you to specify which selector to use and which model config to use for inference. All selectors config can be found in the configs/selection folder (more details below).

For example if you run the following command, the benchmark will be run 3 times (with 3 different seeds) for each selector specified by its selector config (here we run it with 2 selectors). The resulting simulation results will be plotted on the same graph, with results aggregated per selector. The resulting graphs will then by default be found in ROOT/logs/main_simulation_benchmark/TIME-STAMP.

python InnerEyeDataQuality/main_simulation.py --config <path/config1> <path/config2> --seeds 1 2 3

This will by default clean the training set associated to your config. If you wish to clean the validation set instead, please add the --on-val-set flag to your command. Please check the Readme file to learn more about how to configure the simulation benchmark for label cleaning.

Model selection benchmark

The repository also provides the capability to run the model selection benchmark described in the paper. In particular, the script will first evaluate the model on the original noisy validation set. Then it will load the cleaned validation labels (coming from running one selector on the noisy validation dataset) and re-run evaluation on this cleaned dataset. The script will then report metrics for all models evaluated on both noisy and cleaned data.

Running the benchmark:

  • [Pre-requisite] Prior to running the benchmark that are a few steps to do first:
    • You will need to first train one (or several) selector model using one model config of your choice and then run the corresponding cleaning simulation on the validation set to get cleaned labels for your validation set.
      • Note: make sure you specify a specific output_directory in your selector config prior to running the cleaning simulation so that the model benchmark can retrive your cleaned labels.
    • You will also need to train the classifiers models you wish to compare in your benchmark. Note this model choice is independent from the model you chose to clean your data.
  • Once you have completed the previous steps, you can run the benchmark with the following command:
python InnerEyeDataQuality/model_selection_benchmark.py --config <path-to-model-config1> <path-to-model-config2> --curated-label-config <path-to-selector-config-used-to-clean-your-data>

Examples model and selector configs for this benchmark for CIFAR10H dataset can be found in configs/models/benchmark3_idn.

Self supervised pretraining for noise robust learning

In this subfolder you will find the source code to pre-train models using SimCLR or BYOL self-supervision methods.

General

For the unsupervised training of the models, we rely on PyTorch Lightning and Pytorch Lightining bolts. The main entry point for model training is InnerEyeDataQuality/deep_learning/self_supervised/main.py. You will also need to feed in a ssl model config file to specify which dataset to use etc.. All arguments available for the config are listed in ssl_model_config.py

To launch a training job simply run:

python InnerEyeDataQuality/deep_learning/self_supervised/main.py --config path/to/ssl_config

Please check the Readme file to learn more about how to configure the self-supervised (SSL) training.