ENH: Add code to create montages from WSI (#787)

This commit is contained in:
Anton Schwaighofer 2023-02-09 19:50:07 +00:00 коммит произвёл GitHub
Родитель 7b9a0d2453
Коммит a055dfae9f
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
24 изменённых файлов: 1318 добавлений и 2 удалений

17
.github/workflows/cpath-pr.yml поставляемый
Просмотреть файл

@ -303,6 +303,22 @@ jobs:
cd ${{ env.folder }}
make smoke_test_tiles_panda_no_ddp_sampler_aml
smoke_test_montage_creation:
runs-on: ubuntu-20.04
needs: [ cancel-azureml ]
steps:
- uses: actions/checkout@v3
with:
lfs: true
- name: Prepare Conda environment
uses: ./.github/actions/prepare_cpath_environment
- name: smoke test
run: |
cd ${{ env.folder }}
make smoke_test_montage_creation
cpath-publish-pypi-package:
runs-on: ubuntu-20.04
needs: [
@ -321,6 +337,7 @@ jobs:
smoke_test_slides_panda_loss_analysis,
smoke_test_slides_panda_no_ddp_sampler,
smoke_test_tiles_panda_no_ddp_sampler,
smoke_test_montage_creation,
]
steps:
- uses: actions/checkout@v3

Просмотреть файл

@ -3,6 +3,7 @@ myst-parser==0.15.2
sphinx==4.1.2
sphinx-autodoc-typehints==1.12.0
sphinx-automodapi==0.13
sphinx-argparse==0.3.1
sphinx-rtd-theme==1.0.0
twine==3.3.0
wheel==0.38.1

Просмотреть файл

@ -43,6 +43,7 @@ extensions = [
'sphinx_automodapi.automodapi',
'sphinx_autodoc_typehints',
'sphinx.ext.viewcode',
"sphinxarg.ext",
]
numpydoc_show_class_members = False

Просмотреть файл

@ -0,0 +1,9 @@
Histopathology Scripts
======================
Please see links below for details on the available arguments for each of the histopathology scripts.
.. toctree::
:maxdepth: 1
create_montage

Просмотреть файл

@ -0,0 +1,8 @@
create_montage
==============
This script can create a high-resolution montage of all images in a folder or datasets.
.. argparse::
:ref: health_cpath.utils.montage_config.create_montage_argparser
:prog: runner.py

Двоичные данные
docs/source/images/montage_from_dataset.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 19 KiB

Двоичные данные
docs/source/images/montage_from_folder.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Просмотреть файл

@ -54,6 +54,8 @@ The `hi-ml` toolbox provides
tcga_model.md
ssl_on_tile_dataset.md
dsa.md
montage_creation.md
cpath_scripts
.. toctree::
:maxdepth: 1

Просмотреть файл

@ -0,0 +1,166 @@
# Creating montages from whole slide images (WSIs)
For working with large amounts of histology data, it is often useful to create montages of the data.
Montages are a collection of images that are stitched together to form a single image.
Montages are useful for visualizing large amounts of data at once, and can be used to create a single image that can be used for analysis.
The `hi-ml-cpath` toolbox contains scripts that help with the creation of montages from whole slide images (WSIs).
Creating montages can be very time-consuming. It can hence be helpful to run the process in the cloud. The montage
creation code provided here can be run in AzureML very easily.
## Types of data for montage creation
1. Montages can be created from a folder of images, by specifying the name of the folder and a glob pattern, like
`**/foo_*.tiff`.
1. Montages can be created by first reading a file called `dataset.csv` located in a folder. `dataset.csv` is
effectively a Pandas DataFrame, with each row corresponding to a single image.
More details on the format of `dataset.csv` can be found below.
Montage creation works for all WSI image formats that are supported by either of the two possible backends:
- [`openslide`](https://openslide.org/api/python/) supports `.tif(f)`, `.ndpi`, `.scn` and others
- [`cucim`](https://pypi.org/project/cucim/) supports `.svs`, `.tiff` and others
## Setup
- Check out the `hi-ml` repository via `git clone https://github.com/microsoft/hi-ml`
- Run the following commands:
```shell
cd hi-ml-cpath
make env
conda activate HimlHisto
make pip_local
```
All the commands listed below assume that
- you have activated the Conda environment `HimlHisto`
- your current working directory is `<repo root>/hi-ml-cpath`
## Creating montages from a folder with files
The following command will create a montage from all files in the folder `/data` that match the pattern
`**/*.tiff`.
```shell
python src/health_cpath/scripts/create_montage.py --dataset /data --image_glob_pattern '**/*.tiff' --level 2 --width 1000 --output_path montage1
```
This will create a montage from all TIFF files in folder `/data`. Each TIFF file is read as a multi-level image, and
level 2 is read for creating the montage.
The `--width` argument determines the width in pixel of the output image. The height of the output image is determined
automatically. Note that the width given here, 1000, is suitable only for montages from a very small number of files
(say, 10). See below for more details on this commandline option.
This will create two images in the folder specified by the `--output_path montage1` argument, hence outputting the files `montage1/montage.jpg` and `montage1/montage.png`.
Here's an example how this could look like for a folder with 6 images, `0.tiff` through `5.tiff`:
![image](images/montage_from_folder.png)
## Creating montages from a `dataset.csv` file
If the montage creation script is only pointed to a folder, without providing a glob pattern,
it assumes that a file `dataset.csv` is present. A montage will be created from only the images
listed in `dataset.csv`. In addition, an optional `label` column will be added to the text that is
overlayed onto the images itself.
The dataset file should be a CSV file, with each row corresponding to a single image.
When working with a `dataset.csv` file, the following columns are handled:
| Column name | Contents | Required? |
| ----------- | ---------------------------------------------------------------------------------------------- | --------- |
| `image` | The path of the image that should be loaded | Required |
| `slide_id` | A unique identifier for the slide | Required |
| `label` | An additional string that will be placed on the montage, This could be `0`, `1`, `tumour`, ... | Optional |
| `mask` | The path of an additional image that will rendered next to the image given in `image` | Optional |
Consider this example dataset file:
```text
image,slide_id,label
2.tiff,ID 2,Label 2
3.tiff,ID 3,Label 3
4.tiff,ID 4,Label 4
5.tiff,ID 5,Label 5
```
Run montage creation with the following command:
```shell
python src/health_cpath/scripts/create_montage.py --dataset /data --level 2 --width 1000 --output_path montage1
```
This would produce (assuming that the images `2.tiff`, `3.tiff`, `4.tiff`, and `5.tiff` are present in the folder
`/data`) a montage similar to this one:
![image](images/montage_from_dataset.png)
### Using inclusion or exclusion lists
When creating montages from a `dataset.csv` file, it is possible to create montages from only a specific subset
of rows, or all rows apart from those in a given list.
- Use the `--exclude_by_slide_id exclude.txt` argument to point to a file with a list of slide IDs that should be
excluded from the montage.
- Use the `--include_by_slide_id include.txt` argument to point to a file with a list of slide IDs for which
the montage should be created.
The files `exclude.txt` and `include.txt` should contain one slide ID per line.
## Other commandline options
- Use `--width=20000` to set the width of the output montage image. The height of the output image is determined
automatically. Mind that the default value is 60_000, suitable for several hundreds of input images. If you want to try
out montage creation on a small set of files (say, 10), ensure that you set the width to a reasonably small value,
like `--width=1000`
- Use `--parallel=2` to specify the number of parallel processes that should be used for creating image thumbnails.
Thumbnails are created in a first step, using multiple processes, and then the thumbnails are stitched into the final
montage in the main process.
- Use `--backend=cucim` to switch the image reader backend to `CuCIM`. The default backend is `openslide`.
## Running in Azure
The `create_montage.py` script can be run in AzureML by adding 3 commandline arguments.
To set up Azure and AzureML:
- Follow the steps in the [AzureML onboarding](azure_setup.md).
- At the end of the onboarding you will download a file `config.json` from your AzureML workspace to your repository
root folder.
- To understand datasets, please read through the [AzureML datasets](datasets.md) documentation. Then create an AzureML
datastore that points to your Azure Blob Storage account.
- Upload your WSIs to a folder in Azure Blob Storage. This can be done most efficiently via
[azcopy](http://aka.ms/azcopy). `azcopy` can also copy directly across cloud providers, for example from AWS to Azure.
The following command will upload all files in the folder `my_test_slides` to a container `datasets` in your Azure Blob
Storage account called `mystorage`, creating a folder `my_test_slides` in the storage account in the process:
```shell
azcopy copy my_test_slides https://mystorage.blob.core.windows.net/datasets/ --recursive
```
The following command will then create a run in AzureML that executes montage creation from that folder:
```shell
python src/health_cpath/scripts/create_montage.py --dataset my_test_slides --level 2 --width 1000 --cluster <clustername> --conda_env environment.yml --datastore <datastorename>
```
In this command, replace the following:
- Replace `my_test_slides` with the name of the folder in blob storage where you uploaded your WSIs.
- `clustername` is the name of a [compute
cluster](https://learn.microsoft.com/en-us/azure/machine-learning/quickstart-create-resources#create-compute-clusters)
where your job will execute)
- `datastorename` is the name of an AzureML datastore, essential a pointer to your blob storage account plus the
credentials that are necessary to access it. For the above example, the data store needs to point to storage account
`mystorage` and container `datasets`.
The command above will only run for a minute or less - it will mostly create a snapshot of the code and send that off to
the cloud for execution. At the end you will see a link printed out that takes you to the AzureML portal, where you can
monitor the progress of the run.
Once the run is completed, you will find two files `montage.jpg` and `montage.png` in the tab "Outputs" of the run, and
an option to download it to your machine.

15
hi-ml-cpath/.vscode/launch.json поставляемый
Просмотреть файл

@ -85,5 +85,20 @@
"console": "integratedTerminal",
"justMyCode": false,
},
{
"name": "Montage creation in AzureML",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/src/health_cpath/scripts/create_montage.py",
"args": [
"--dataset=test_montage",
"--width=1000",
"--cluster=lite-testing-ds2",
"--datastore=himldatasets",
"--conda_env=environment.yml"
],
"console": "integratedTerminal",
"justMyCode": true
},
]
}

Просмотреть файл

@ -301,6 +301,9 @@ smoke_test_tiles_panda_no_ddp_sampler_aml:
{ ${BASE_CPATH_RUNNER_COMMAND} ${DEEPSMILEPANDATILES_ARGS} ${DEFAULT_SMOKE_TEST_ARGS} \
${DEEPSMILEDEFAULT_SMOKE_TEST_ARGS} ${DDP_SAMPLER_ARGS} ${AML_MULTIPLE_DEVICE_ARGS} --tag smoke_test_tiles_panda_no_ddp_sampler_aml;}
smoke_test_montage_creation:
python src/health_cpath/scripts/create_montage.py --dataset test_montage --width 1000 --cluster lite-testing-ds2 --datastore himldatasets --conda_env environment.yml --wait_for_completion
smoke tests local: smoke_test_cucim_slidespandaimagenetmil_local smoke_test_openslide_slidespandaimagenetmil_local smoke_test_tilespandaimagenetmil_local smoke_test_tcgacrcksslmil_local smoke_test_crck_simclr_local smoke_test_crck_flexible_finetuning_local smoke_test_tcgacrckimagenetmil_local smoke_test_crck_loss_analysis_local smoke_test_slides_panda_loss_analysis_local smoke_test_slides_panda_no_ddp_sampler_local smoke_test_tiles_panda_no_ddp_sampler_local
smoke tests AML: smoke_test_cucim_slidespandaimagenetmil_aml smoke_test_openslide_slidespandaimagenetmil_aml smoke_test_tilespandaimagenetmil_aml smoke_test_tcgacrcksslmil_aml smoke_test_crck_simclr_aml smoke_test_crck_flexible_finetuning_aml smoke_test_tcgacrckimagenetmil_aml smoke_test_crck_loss_analysis_aml smoke_test_slides_panda_loss_analysis_aml smoke_test_slides_panda_no_ddp_sampler_aml smoke_test_tiles_panda_no_ddp_sampler_aml

Просмотреть файл

@ -0,0 +1,62 @@
# -------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# -------------------------------------------------------------------------------------------
"""
This script can be used to create a montage of slides, given a slides dataset or a folder with images.
For a full documentation of the parameters, run `python create_montage.py --help`
"""
import logging
from pathlib import Path
import sys
from typing import Optional
current_file = Path(__file__).absolute()
repository_root = current_file.parent.parent.parent.parent.parent
folders_to_add = [repository_root / "hi-ml" / "src",
repository_root / "hi-ml-azure" / "src",
repository_root / "hi-ml-cpath" / "src"]
for folder in folders_to_add:
assert folder.is_dir()
sys.path.insert(0, str(folder))
from health_azure.himl import submit_to_azure_if_needed, DatasetConfig # noqa
from health_azure.logging import logging_to_stdout # noqa
from health_cpath.utils.montage import create_config_from_args # noqa
def main() -> None:
config = create_config_from_args()
logging_to_stdout()
submit_to_azureml = config.cluster != ""
if config.dataset.strip() == "":
raise ValueError("Please provide a dataset name via --dataset")
elif config.dataset.startswith("/"):
if submit_to_azureml:
raise ValueError("Cannot submit to AzureML if dataset is a local folder")
input_folder: Optional[Path] = Path(config.dataset)
else:
logging.info(f"In AzureML use mounted dataset '{config.dataset}' in datastore {config.datastore}")
input_dataset = DatasetConfig(name=config.dataset, datastore=config.datastore, use_mounting=True)
logging.info(f"Submitting to AzureML, running on cluster {config.cluster}")
run_info = submit_to_azure_if_needed(
entry_script=current_file,
snapshot_root_directory=repository_root,
compute_cluster_name=config.cluster,
conda_environment_file=config.conda_env,
submit_to_azureml=submit_to_azureml,
input_datasets=[input_dataset],
strictly_aml_v1=True,
docker_shm_size="100g",
wait_for_completion=config.wait_for_completion,
)
input_folder = run_info.input_datasets[0]
assert input_folder is not None
config.create_montage(input_folder)
if __name__ == "__main__":
main()

Просмотреть файл

@ -0,0 +1,521 @@
# ------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# ------------------------------------------------------------------------------------------
import functools
import logging
import multiprocessing
import shutil
import sys
import tempfile
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union
import torchvision
import torch
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw, ImageFont
from tqdm import tqdm
from monai.data.image_reader import WSIReader
from health_azure.utils import apply_overrides, parse_arguments
from health_cpath.preprocessing.loading import WSIBackend
from health_cpath.utils.montage_config import MontageConfig, create_montage_argparser
from health_cpath.utils.naming import SlideKey
from health_cpath.datasets.base_dataset import SlidesDataset
from health_ml.utils.type_annotations import TupleInt3
MONTAGE_FILE = "montage.png"
DatasetOrDataframe = Union[SlidesDataset, pd.DataFrame]
DatasetRecord = Dict[SlideKey, Any]
logger = logging.getLogger(__name__)
def add_text(image: Image, text: str, y: float = 0.9, color: TupleInt3 = (27, 77, 40), fontsize_step: int = 2) -> None:
"""Add text to a PIL image.
:param image: Image object to which text needs to be added.
:param text: The text that needs to be added.
:param y: Float between 0-1 specifying the vertical position of the text (default=0.9).
:param color: A 3-tuple indicating the fill color of the text (default = (27, 77, 40)).
:param fontsize_step: Steps of font size to reduce if the text size is more than image size (default=2).
"""
# This font is usually found in a path like /usr/share/fonts/truetype/dejavu
font_path = Path('DejaVuSans.ttf')
fontsize = 48
draw = ImageDraw.Draw(image)
image_size_x, image_size_y = image.size
font = ImageFont.truetype(str(font_path), fontsize)
text_size_x, text_size_y = draw.textsize(text, font=font)
while text_size_x >= image_size_x:
fontsize -= fontsize_step
font = ImageFont.truetype(str(font_path), fontsize)
text_size_x, text_size_y = draw.textsize(text, font=font)
start_x = image_size_x // 2 - text_size_x // 2
start_y = image_size_y * y - text_size_y // 2
xy = start_x, start_y
draw.text(xy, text, fill=color, font=font, align='center')
def load_slide_as_pil(reader: WSIReader, slide_file: Path, level: int = 0) -> Image:
"""Load a WSI as a PIL image.
:param reader: The WSI reader for loading the slide.
:param slide_file: The file to read from.
:param level: Resolution downsampling level (default=0).
:return: PIL image object corresponding to the WSI image.
"""
image = reader.read(slide_file)
try:
image_array, _ = reader.get_data(image, level=level)
except ValueError:
logger.warning(f"Level {level} not available for {slide_file}, using level 0 instead.")
image_array, _ = reader.get_data(image, level=0)
array = image_array.numpy().transpose(1, 2, 0)
to_pil = torchvision.transforms.ToPILImage()
array_pil = to_pil(array)
return array_pil
def _make_thumbnail(sample: DatasetRecord, reader: WSIReader, level: int, slide_size: Tuple[int, int], images_dir: Path,
masks_dir: Optional[Path] = None, image_suffix: str = '.png',
default_mask_color: TupleInt3 = (119, 161, 120)) -> None:
"""Make thumbnails of the slides in slides dataset.
:param sample: The slide dataset object dictionary for which thumbnail needs to be created.
:param reader: The WSI reader for loading the slide.
:param slide_size: The tuple of slide size (width, height).
:param images_dir: The path to the `images` directory where WSI thumbnails will be stored.
:param level: Resolution downsampling level.
:param masks_dir: Optional path to `masks` directory where mask thumbnails will be stored.
If `None` (default), masks thumbnails will not be created.
:param image_suffix: Suffix of image thumbnails (default=`.png`).
:param default_mask_color: Color of the masks (default = (119, 161, 120)).
"""
try:
image_pil = load_slide_as_pil(reader, sample[SlideKey.IMAGE], level)
image_pil = image_pil.resize(slide_size)
slide_id = sample.get(SlideKey.SLIDE_ID, "")
# Slide IDs can be purely numeric, in those cases need to convert to str
text_to_add = str(slide_id)
if SlideKey.LABEL in sample:
label = str(sample[SlideKey.LABEL])
text_to_add += ": " + str(label)
if text_to_add:
add_text(image_pil, text_to_add)
if masks_dir is not None and SlideKey.MASK in sample:
masks_dir.mkdir(exist_ok=True)
try:
mask_pil = load_slide_as_pil(reader, sample[SlideKey.MASK], level=level)
mask_pil = mask_pil.resize(slide_size, Image.NEAREST)
mask_pil = Image.fromarray(np.asarray(mask_pil) * 255) # for visualization
except ValueError:
mask_pil = Image.new("RGB", slide_size, default_mask_color)
finally:
mask_path = masks_dir / f"{slide_id}.png"
mask_pil.save(mask_path)
image_path = images_dir / f"{slide_id}{image_suffix}"
image_pil.save(image_path)
except Exception as ex:
slide_id = sample.get(SlideKey.SLIDE_ID, "(no slide ID found)")
logging.warning(f"Unable to process slide with ID '{slide_id}': {ex}")
def make_thumbnails(records: List[DatasetRecord],
slide_size: Tuple[int, int],
images_dir: Path,
level: int,
masks_dir: Optional[Path] = None,
num_parallel: int = 0,
image_suffix: str = '.png',
backend: str = WSIBackend.CUCIM) -> None:
"""Make thumbnails of the slides in slides dataset.
:param records: A list of dataframe records. The records must contain at least the columns `slide_id` and `image`.
:param slide_size: The tuple of slide size (width, height).
:param images_dir: The path to the `images` directory where WSI thumbnails will be stored.
:param level: Resolution downsampling level.
:param masks_dir: Optional path to `masks` directory where mask thumbnails will be stored (default=None).
:param num_parallel: Number of parallel processes for thumbnail creation. Use 0 to disable parallel.
:param image_suffix: Suffix of image thumbnails (default=`.png`).
:param backend: The backend to use for reading the WSI (default=`cucim`).
"""
images_dir.mkdir(exist_ok=True, parents=True)
reader = WSIReader(backend=backend)
func = functools.partial(
_make_thumbnail,
reader=reader,
level=level,
slide_size=slide_size,
images_dir=images_dir,
masks_dir=masks_dir,
image_suffix=image_suffix,
)
if num_parallel > 0:
pool = multiprocessing.Pool(num_parallel)
map_func = pool.imap_unordered # type: ignore
else:
map_func = map # type: ignore
progress = tqdm(map_func(func, records), total=len(records))
list(progress) # type: ignore
if num_parallel > 0:
pool.close()
def make_montage_from_dir(images_dir: Path, num_cols: int, masks_dir: Optional[Path] = None,
image_suffix: str = '.png') -> Image:
"""Create the montage image from the thumbnails.
:param images_dir: The path to the `images` directory where WSI thumbnails will be stored.
:param num_cols: Number of columns in the montage.
:param masks_dir: Optional path to `masks` directory where mask thumbnails will be stored (default=None).
:param image_suffix: Suffix of image thumbnails (default=`.png`).
:return: PIL image of the montage.
"""
image_paths = sorted(images_dir.glob(f'*{image_suffix}'))
if len(image_paths) == 0:
raise ValueError(f"No thumbnail images found in {images_dir}")
images_arrays = []
for image_path in tqdm(image_paths):
image_pil = Image.open(image_path)
images_arrays.append(np.asarray(image_pil))
images_array = np.asarray(images_arrays)
if masks_dir is not None:
mask_paths = sorted(masks_dir.glob('*.png'))
# Don't process masks if there are no files present, even if the mask directory has been passed as an argument.
if len(mask_paths) > 0:
if len(mask_paths) != len(image_paths):
raise ValueError("Number of masks is different from number of images.")
masks_arrays = []
for mask_path in tqdm(mask_paths):
mask_pil = Image.open(mask_path)
masks_arrays.append(np.asarray(mask_pil))
masks_array = np.asarray(masks_arrays)
images_array = np.concatenate((images_array, masks_array), axis=-2)
images_tensor = torch.from_numpy(images_array).permute(0, 3, 1, 2)
grid_tensor = torchvision.utils.make_grid(images_tensor, nrow=num_cols)
grid_pil = torchvision.transforms.ToPILImage()(grid_tensor)
return grid_pil
def dataset_from_folder(root_folder: Path, glob_pattern: str = "**/*") -> pd.DataFrame:
"""Create slides dataset all files in a folder. The function searches for all files in the `root_folder` and its
subfolders, and creates a dataframe with the following columns: `image`: The absolute path of the file,
column `slide_id`: Either the file name only if that is unique, or otherwise the path of the file relative to
the `root_folder`.
:param root_folder: A directory with (image) files.
:param glob_pattern: The glob pattern to match the image files (default=`**/*`, using all files recursively in all
subfolders).
:return: Slides dataset.
"""
if not root_folder.is_dir():
raise ValueError(f"Root folder '{root_folder}' does not exist or is not a directory.")
# Find all image files in the folder, exclude folders in the result
image_paths = list(sorted(f for f in root_folder.glob(glob_pattern) if f.is_file()))
file_names_only = [path.name for path in image_paths]
# Check if file names alone are enough to make the dataset unique.
if len(file_names_only) != len(set(file_names_only)):
# There are duplicates when going only by file names. Hence, use full paths relative to the root folder.
image_ids = [str(path.relative_to(root_folder)) for path in image_paths]
else:
# File names are unique. Hence, use them as slide IDs.
image_ids = file_names_only
# Mounted datasets can show up with '&' appearing escape. Clean up the image IDs if so.
# We expect that the exclusion list shows '&' for the offending slides.
escaped_amp = "%26"
if any(escaped_amp in image_id for image_id in image_ids):
logging.info(f"Some image IDs contain '{escaped_amp}', replacing that with '&'.")
image_ids = [id.replace(escaped_amp, "&") for id in image_ids]
return pd.DataFrame({SlideKey.SLIDE_ID: image_ids, SlideKey.IMAGE: map(str, image_paths)})
def restrict_dataset(dataset: pd.DataFrame,
column: str,
items: List[str],
include: bool) -> pd.DataFrame:
"""Exclude or include slides from a dataset, based on values in a column.
For example, to exclude slides with label `0` from the dataset, use:
restrict_dataset(dataset, column='label', items=['0'], include=False).
The items are matched with the column values using `isin` operator. The code also handles
the case where the column in question is the dataset index.
If the items in question are not present in the column, the result is an empty dataset (if include=True)
or the original dataset (if include=False).
:param dataset: Slides dataset.
:param column: The name of the column on which the inclusion/exclusion name.
:param items: The values that the column should match.
:param include: If True, modify the dataset to only include the rows where the column matches a value in the
item list in `items`. If False, modify the dataset to exclude those rows.
:return: Filtered dataset.
"""
if column in dataset:
matching_rows = dataset[column].isin(items)
if not include:
matching_rows = ~matching_rows
return dataset[matching_rows]
elif dataset.index.name == column:
# Drop or loc operations on an index column when the values do not exist raise an error. Hence, restrict
# to existing values first.
items = list(set(items).intersection(set(dataset.index)))
if include:
return dataset.loc[items]
else:
return dataset.drop(items)
else:
raise ValueError(f"Column {column} not found in dataset.")
def dataset_to_records(dataset: DatasetOrDataframe) -> List[DatasetRecord]:
"""Converts a SlidesDataset or a plain dataframe to a list of dictionaries.
:param dataset: Slides dataset or a plain dataframe.
"""
if isinstance(dataset, pd.DataFrame):
return dataset.to_dict(orient='records')
elif isinstance(dataset, SlidesDataset):
# SlidesData overrides __getitem__, use that to convert to records
return [dataset[i] for i in range(len(dataset))]
else:
raise ValueError(f"Can't convert {type(dataset)} to a list of records.")
def make_montage(records: List[DatasetRecord],
out_path: Path,
width: int = 60_000,
level: int = 2,
image_suffix: str = '.png',
masks: bool = True,
temp_dir: Optional[Union[Path, str]] = None,
cleanup: bool = False,
num_parallel: int = 0,
backend: str = "cucim") -> None:
"""Make the montage of WSI thumbnails from a slides dataset.
:param records: A list of dataframe records. The records must contain at least the columns `slide_id` and `image`.
:param out_path: The output path where the montage image will be stored.
:param width: The width of the montage (default=60000).
:param level: Resolution downsampling level at which the WSI will be read (default=2).
:param image_suffix: Suffix of image thumbnails (default=`.png`).
:param masks: Flag to denote if masks need to be included (default=True).
:param temp_dir: Optional path to temporary directory that stores the slide thumbnails.
If `None`(default), a temporary directory will be created in `tmp` folder.
:param cleanup: Flag to determine whether to clean the temporary directory containing thumbnails
after the montage is created (default=False).
:param num_parallel: Number of parallel processes for thumbnail creation. Use 0 or 1 to disable parallel.
:param backend: The backend to use for reading the WSI (default=`cucim`).
"""
# There might be some blanks at the bottom right
# rows * cols <= N
# We are going to stack the slides and their masks slide by side, so we need 2 * cols
# 2 * cols / rows = 16 / 9; rows = 2 * cols * 9 / 16
# cols * 2 * cols * 9 / 16 <= N; cols <= (N / 2 * 16 / 9)**(1 / 2)
num_slides = len(records)
multiplier = 2 if masks else 1
num_cols = int(np.sqrt(num_slides / multiplier * 16 / 9))
logging.info(f"Creating montage from {num_slides} slides with {num_cols} columns.")
slide_width = (width // num_cols) // multiplier
slide_size = slide_width, slide_width // 2
temp_dir = tempfile.mkdtemp() if temp_dir is None else temp_dir
temp_dir = Path(temp_dir)
image_thumbnail_dir = temp_dir / "images"
mask_thumbnail_dir = temp_dir / "masks" if masks else None
if image_thumbnail_dir.is_dir():
logging.info(f"Skipping thumbnail creation because folder already exists: {image_thumbnail_dir}")
else:
logging.info(f"Starting thumbnail creation with thumbnail size {slide_size}")
make_thumbnails(records=records, slide_size=slide_size, images_dir=image_thumbnail_dir, level=level,
masks_dir=mask_thumbnail_dir, image_suffix=image_suffix, num_parallel=num_parallel,
backend=backend)
try:
logging.info("Starting montage creation")
montage_pil = make_montage_from_dir(image_thumbnail_dir,
num_cols,
masks_dir=mask_thumbnail_dir,
image_suffix=image_suffix)
except Exception as ex:
raise ValueError(f"Failed to create montage from {image_thumbnail_dir}: {ex}")
logger.info(f"Saving montage to {out_path}")
montage_pil.save(out_path)
if out_path.suffix != '.jpg':
jpeg_out = out_path.with_suffix('.jpg')
montage_pil.save(jpeg_out, format='JPEG', quality=90)
if cleanup:
shutil.rmtree(temp_dir)
class MontageCreation(MontageConfig):
def read_list(self, csv_file_path: Optional[Path]) -> List[str]:
"""Reads a list of slide IDs from a file."""
if csv_file_path:
df = pd.read_csv(csv_file_path)
column_to_read = df.columns[0]
if len(df.columns) > 1:
logger.warning(f"More than one column in file, using first column: {column_to_read}")
return df[column_to_read].tolist()
else:
return []
def read_exclusion_list(self) -> List[str]:
"""Read the list of slide IDs that should be excluded from the montage."""
if self.exclude_by_slide_id:
slides_to_exclude = self.read_list(self.exclude_by_slide_id)
logger.info(f"Excluding {len(slides_to_exclude)} slides from montage. First 3: {slides_to_exclude[:3]}")
logger.info("Exclusion list will be matched against the Slide ID column (for predefined datasets) or the "
"filename.")
return slides_to_exclude
else:
return []
def read_inclusion_list(self) -> List[str]:
"""Read the list of slide IDs that should be included in the montage."""
if self.include_by_slide_id:
slides_to_include = self.read_list(self.include_by_slide_id)
logger.info(f"Restricting montage to {len(slides_to_include)} slides. First 3: {slides_to_include[:3]}")
logger.info("Inclusion list will be matched against the Slide ID column (for predefined datasets) or the "
"filename.")
return slides_to_include
else:
return []
def read_dataset(self, input_folder: Path) -> DatasetOrDataframe:
"""Read the dataset that should be used for creating the montage. If a glob pattern has been provided, then
all the image files specified by that pattern will be used for the montage. Otherwise, a file `dataset.csv`
is expected in the input folder. The `dataset.csv` will be used to create an instance of `SlidesDataset`.
:param input_folder: The folder where the dataset is located.
:return: A SlidesDataset or dataframe object that contains the dataset."""
if self.image_glob_pattern:
logger.info(f"Trying to create a dataset from files that match: {self.image_glob_pattern}")
try:
dataset = dataset_from_folder(input_folder, glob_pattern=self.image_glob_pattern)
except Exception as ex:
raise ValueError(f"Unable to create dataset from files in folder {input_folder}: {ex}")
if len(dataset) == 0:
raise ValueError(f"No images found in folder {input_folder} with pattern {self.image_glob_pattern}")
return dataset
else:
logger.info(f"Trying to load the dataset as a SlidesDataset from folder {input_folder}")
try:
dataset = SlidesDataset(root=input_folder)
except Exception as ex:
logging.error("Unable to load dataset.")
file = input_folder / SlidesDataset.DEFAULT_CSV_FILENAME
# Print the whole directory tree to check where the problem is.
while str(file) != str(file.root):
logging.debug(f"File: {file}, exists: {file.exists()}")
file = file.parent
raise ValueError(f"Unable to load dataset. Check if the file {SlidesDataset.DEFAULT_CSV_FILENAME} "
f"exists, or provide a file name pattern via --image_glob_pattern. Error: {ex}")
return dataset
def create_montage(self, input_folder: Path) -> None:
"""Creates a montage from the dataset in the input folder. The method reads the dataset, creates an output
folder, handles the inclusion and exclusion lists, and then calls the method that creates the montage.
:param input_folder: The folder where the dataset is located.
:raises ValueError: If both an inclusion and exclusion list have been provided.
"""
dataset = self.read_dataset(input_folder)
self.output_path.mkdir(parents=True, exist_ok=True)
if self.include_by_slide_id and self.exclude_by_slide_id:
raise ValueError("You cannot provide both an inclusion and exclusion list.")
if self.include_by_slide_id:
items = self.read_inclusion_list()
exclude_items = False
elif self.exclude_by_slide_id:
items = self.read_exclusion_list()
exclude_items = True
else:
items = []
exclude_items = True
self.montage_from_included_and_excluded_slides(
dataset=dataset,
items=items,
exclude_items=exclude_items,
)
def montage_from_included_and_excluded_slides(self,
dataset: DatasetOrDataframe,
items: Optional[List[str]] = None,
exclude_items: bool = True,
restrict_by_column: str = "") -> Optional[Path]:
"""Creates a montage of included and excluded slides from the dataset.
:param dataset: Slides dataset or a plain dataframe.
:param items: A list values for SlideID that should be included/excluded from the montage.
:param exclude_items: If True, exclude the list in `items` from the montage. If False, include
only those in the montage.
:param restrict_by_column: The column name that should be used for inclusion/exclusion lists
(default=dataset.SLIDE_ID_COLUMN).
:return: A path to the created montage, or None if no images were available for creating the montage.
"""
if isinstance(dataset, pd.DataFrame):
df_original = dataset
else:
df_original = dataset.dataset_df
logging.info(f"Input dataset contains {len(df_original)} records.")
if restrict_by_column == "":
if isinstance(dataset, pd.DataFrame):
restrict_by_column = SlideKey.SLIDE_ID.value
else:
restrict_by_column = dataset.SLIDE_ID_COLUMN
if items:
if exclude_items:
logging.info(f"Using dataset column '{restrict_by_column}' to exclude slides")
include = False
else:
logging.info(f"Using dataset column '{restrict_by_column}' to restrict the set of slides")
include = True
df_restricted = restrict_dataset(
df_original,
column=restrict_by_column,
items=items,
include=include)
logging.info(f"Updated dataset contains {len(df_restricted)} records")
else:
df_restricted = df_original
montage_result = self.output_path / MONTAGE_FILE
logging.info(f"Creating montage in {montage_result}")
if isinstance(dataset, pd.DataFrame):
records_restricted = dataset_to_records(df_restricted)
else:
dataset.dataset_df = df_restricted
records_restricted = dataset_to_records(dataset)
# We had to modify the dataset in place, hence restore the original dataset to avoid odd side-effects.
dataset.dataset_df = df_original
if len(records_restricted) > 0:
make_montage(
records=records_restricted,
out_path=montage_result,
width=self.width,
level=self.level,
masks=False,
cleanup=True,
num_parallel=self.parallel,
backend=self.backend)
return montage_result
else:
logging.info("No slides to include in montage, skipping.")
return None
def create_config_from_args() -> MontageConfig:
"""Creates a configuration object for montage creation from the commandline arguments.
:return: An object that describes all options for the montage creation.
"""
parser = create_montage_argparser()
config = MontageCreation()
parser_results = parse_arguments(parser, args=sys.argv[1:], fail_on_unknown_args=True)
_ = apply_overrides(config, parser_results.args)
return config

Просмотреть файл

@ -0,0 +1,75 @@
# ------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# ------------------------------------------------------------------------------------------
from argparse import ArgumentParser
from pathlib import Path
from typing import Optional
import param
from health_azure.utils import create_argparser
class AzureRunConfig(param.Parameterized):
cluster: str = \
param.String(default="", allow_None=False,
doc="The name of the GPU or CPU cluster inside the AzureML workspace"
"that should execute the job. To run on your local machine, omit this argument.")
datastore = \
param.String(default="",
doc="The name of the AzureML datastore where the dataset is defined.")
dataset = \
param.String(default="",
doc="The name of the AzureML dataset to use for creating the montage. The dataset will be "
"mounted automatically. Use an absolute path to a folder on the local machine to bypass "
"mounting.")
conda_env: Optional[Path] = \
param.ClassSelector(class_=Path, default=Path("hi-ml/hi-ml-cpath/environment.yml"), allow_None=True,
doc="The Conda environment file that should be used when submitting the present run to "
"AzureML. If not specified, the hi-ml-cpath environment file will be used.")
wait_for_completion: bool = param.Boolean(default=False,
doc="If True, wait for AML Run to complete before proceeding. "
"If False, submit the run to AML and exit")
class MontageConfig(AzureRunConfig):
level: int = \
param.Integer(default=1,
doc="Resolution downsample level, e.g. if lowest resolution is 40x and the available "
"downsample levels are [1.0, 4.0, 16.0] then level = 1 corresponds to 10x magnification")
exclude_by_slide_id: Optional[Path] = \
param.ClassSelector(class_=Path, default=None, allow_None=True,
doc="Provide a file that contains slide IDs that should be excluded. File format is "
"CSV, the first column is used as the slide ID. If the file is empty, no slides "
"will be excluded.")
include_by_slide_id: Optional[Path] = \
param.ClassSelector(class_=Path, default=None, allow_None=True,
doc="Provide a file that contains slide IDs that should be included. File format is "
"CSV, the first column is used as the slide ID. If the file is empty, no montage "
"will be produced.")
image_glob_pattern: str = \
param.String(default="",
doc="When provided, use this pattern in rglob to find the files that should be included in the "
"montage. Example: '**/*.tiff' to find all TIFF files recursive. You may have to escape "
"the pattern in your shell.")
width: int = \
param.Integer(default=60_000,
doc="The width of the montage in pixels")
output_path: Path = \
param.ClassSelector(class_=Path,
default=Path("outputs"),
doc="The folder where the montage will be saved")
parallel: int = \
param.Integer(default=8,
doc="The number of parallel processes to use when creating the montage.")
backend: str = \
param.String(default="openslide",
doc="The backend to use for reading the slides. Can be 'openslide' or 'cucim'")
def create_montage_argparser() -> ArgumentParser:
return create_argparser(
MontageConfig(),
usage="python create_montage.py --dataset <dataset_folder> --image_glob_pattern '**/*.tiff' --width 1000",
description="Create an overview image with thumbnails of all slides in a dataset.")

Двоичные данные
hi-ml-cpath/testhisto/test_data/montages/montage_excluded.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 19 KiB

Двоичные данные
hi-ml-cpath/testhisto/test_data/montages/montage_from_folder.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 5.0 KiB

Двоичные данные
hi-ml-cpath/testhisto/test_data/montages/montage_included.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 18 KiB

Двоичные данные
hi-ml-cpath/testhisto/test_data/montages/montage_via_args.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 3.7 KiB

Двоичные данные
hi-ml-cpath/testhisto/test_data/montages/montage_with_masks.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 13 KiB

Двоичные данные
hi-ml-cpath/testhisto/test_data/montages/montage_without_masks.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 15 KiB

Просмотреть файл

@ -53,6 +53,7 @@ class MockPandaSlidesGenerator(MockHistoDataGenerator):
:param n_tiles_list: A list to use different n_tiles per slide for randomly positioned tiles.
:param kwargs: Same params passed to MockHistoDataGenerator.
"""
self.generated_files: List[str] = []
super().__init__(**kwargs)
self.n_levels = n_levels
@ -120,13 +121,14 @@ class MockPandaSlidesGenerator(MockHistoDataGenerator):
else tiles[i % self.n_tiles].numpy()
)
# fill the square diagonal with tile repeated n_repeat_tile times along X and Y axis.
fill_square = np.tile(tile, (self.n_repeat_tile, self.n_repeat_tile))
fill_square: Union[np.ndarray, float] = np.tile(tile, (self.n_repeat_tile, self.n_repeat_tile))
dump_tiles.append(tile)
elif self.mock_type == MockHistoDataType.FAKE:
if i == 0 or self.n_tiles > 1:
# pick a random fake value to fill in the square diagonal.
fill_square = np.random.uniform(0, self.background_val / (self.n_repeat_diag + 1) * (i + 1))
upper = self.background_val / (self.n_repeat_diag + 1) * (i + 1)
fill_square = np.random.uniform(0, upper)
dump_tiles.append(
np.full(
shape=(self.n_channels, self.tile_size, self.tile_size),
@ -222,6 +224,7 @@ class MockPandaSlidesGenerator(MockHistoDataGenerator):
slide_tiff_filename = self.dest_data_path / "train_images" / f"_{slide_counter}.tiff"
self._save_mock_wsi_as_tiff_file(slide_tiff_filename, wsi_levels)
self.generated_files.append(str(slide_tiff_filename))
if dump_tiles is not None:
dump_tiles_filename = self.dest_data_path / "dump_tiles" / f"_{slide_counter}.npy"

Просмотреть файл

@ -0,0 +1,416 @@
import shutil
from pathlib import Path
from unittest import mock
import numpy as np
import pandas as pd
import pytest
from PIL import Image
from pandas.testing import assert_frame_equal
from typing import Generator, List
from health_cpath.utils.montage import (
MONTAGE_FILE,
MontageCreation,
dataset_from_folder,
dataset_to_records,
make_montage,
make_montage_from_dir,
restrict_dataset,
)
from health_cpath.datasets.base_dataset import SlidesDataset
from health_cpath.datasets.panda_dataset import PandaDataset
from health_cpath.scripts.create_montage import main as script_main
from health_cpath.utils.naming import SlideKey
from testhisto.mocks.base_data_generator import MockHistoDataType
from testhisto.mocks.slides_generator import MockPandaSlidesGenerator
from testhisto.utils.utils_testhisto import assert_binary_files_match, full_ml_test_data_path, wait_until_file_exists
# Set this to True to update all stored images in the test_data folder.
UPDATE_STORED_RESULTS = False
NUM_SLIDES = 6
def expected_results_folder() -> Path:
"""Gets the path to the folder where the expected montage results are stored.
:return: The path to the folder where the expected results are stored.
"""
return full_ml_test_data_path("montages")
def _create_slides_images(tmp_path: Path) -> MockPandaSlidesGenerator:
print(f"Result folder: {tmp_path}")
wsi_generator = MockPandaSlidesGenerator(
dest_data_path=tmp_path,
mock_type=MockHistoDataType.FAKE,
n_tiles=4,
n_slides=NUM_SLIDES,
n_channels=3,
n_levels=3,
tile_size=28,
background_val=255,
)
wsi_generator.generate_mock_histo_data()
print(f"Generated images in {tmp_path}")
return wsi_generator
@pytest.fixture(scope="module")
def temp_panda_dataset(tmp_path_factory: pytest.TempPathFactory) -> Generator:
"""A fixture that creates a PandaDataset object with randomly created slides.
"""
tmp_path = tmp_path_factory.mktemp("mock_panda")
_create_slides_images(tmp_path)
usecols = [PandaDataset.SLIDE_ID_COLUMN, PandaDataset.MASK_COLUMN]
yield PandaDataset(root=tmp_path, dataframe_kwargs={"usecols": usecols + list(PandaDataset.METADATA_COLUMNS)})
@pytest.fixture(scope="module")
def temp_slides(tmp_path_factory: pytest.TempPathFactory) -> Generator:
"""A fixture that creates a folder (Path object) with randomly created slides.
"""
tmp_path = tmp_path_factory.mktemp("mock_wsi")
_create_slides_images(tmp_path)
yield tmp_path
@pytest.fixture(scope="module")
def temp_slides_dataset(tmp_path_factory: pytest.TempPathFactory) -> Generator:
"""A fixture that creates a SlidesDataset object with randomly created slides.
"""
tmp_path = tmp_path_factory.mktemp("mock_slides")
wsi_generator = _create_slides_images(tmp_path)
# Create a CSV file with the 3 required columns for montage creation. Mask is optional.
metadata = {
SlideKey.SLIDE_ID: [f"ID {i}" for i in range(NUM_SLIDES)],
SlideKey.IMAGE: wsi_generator.generated_files,
SlideKey.LABEL: [f"Label {i}" for i in range(NUM_SLIDES)],
}
df = pd.DataFrame(data=metadata)
csv_filename = tmp_path / SlidesDataset.DEFAULT_CSV_FILENAME
df.to_csv(csv_filename, index=False)
# Tests fail non-deterministically, saying that the dataset file does not exist (yet). Hence, wait.
wait_until_file_exists(csv_filename)
yield SlidesDataset(root=tmp_path)
def _create_folder_with_images(tmp_path: Path, num_images: int = 4, image_size: int = 20) -> None:
"""Creates a folder with images.
:param tmp_path: The path to the folder where the images should be stored.
:param num_images: The number of slides that should be created.
"""
np.random.seed(42)
tmp_path.mkdir(parents=True, exist_ok=True)
for i in range(num_images):
image_path = tmp_path / f"image_{i}.png"
image_np = np.random.uniform(0, 255, size=(image_size, image_size, 3)).astype(np.uint8)
image = Image.fromarray(image_np)
image.save(image_path)
def test_montage_from_dir(tmp_path: Path) -> None:
"""Test montage creation from a directory of images."""
print(f"Result folder: {tmp_path}")
np.random.seed(42)
# Create a directory of images
image_dir = tmp_path / "images"
thumb_size = 20
_create_folder_with_images(image_dir, num_images=4, image_size=thumb_size)
# Create a montage from the directory
file_name = "montage_from_random_thumbs.png"
montage_path = tmp_path / file_name
montage_image = make_montage_from_dir(image_dir, num_cols=2)
# We have 2 columns, so the montage should be 2x the size of the thumbnail, plus a 2px border and space in between
pad = 2
expected_size = 2 * thumb_size + 3 * pad
assert montage_image.size == (expected_size, expected_size)
montage_image.save(montage_path)
assert montage_path.is_file()
expected_file = expected_results_folder() / file_name
if UPDATE_STORED_RESULTS:
shutil.copyfile(montage_path, expected_file)
assert_binary_files_match(montage_path, expected_file)
@pytest.mark.parametrize("use_masks", [True, False])
def test_montage_from_dataset(tmp_path: Path, temp_panda_dataset: PandaDataset, use_masks: bool) -> None:
"""Test if a montage can be generated from a slides dataset that uses masks."""
dataset = dataset_to_records(temp_panda_dataset)
montage = tmp_path / "montage.png"
make_montage(dataset, out_path=montage, width=1000, num_parallel=1, masks=use_masks)
assert montage.is_file()
expected_file = expected_results_folder() / ("montage_with_masks.png" if use_masks else "montage_without_masks.png")
if UPDATE_STORED_RESULTS:
shutil.copyfile(montage, expected_file)
assert_binary_files_match(montage, expected_file)
def test_restrict_dataset() -> None:
column = "image_id"
dataset = pd.DataFrame({column: ["a", "b", "c"]})
included = restrict_dataset(dataset, column, ["a"], include=True)
assert len(included) == 1
assert included.iloc[0][column] == "a"
excluded = restrict_dataset(dataset, column, ["a"], include=False)
assert len(excluded) == 2
assert excluded.iloc[0][column] == "b"
assert excluded.iloc[1][column] == "c"
# Check the case when the requested value is not in the dataset
included2 = restrict_dataset(dataset, column, ["nope"], include=True)
assert len(included2) == 0
excluded2 = restrict_dataset(dataset, column, ["nope"], include=False)
assert len(excluded2) == 3
def test_restrict_dataset_with_index() -> None:
column = "image_id"
index_column = "index"
dataset = pd.DataFrame({column: ["a", "b", "c"], index_column: ["0", "1", "2"]})
dataset = dataset.set_index(index_column)
included = restrict_dataset(dataset, index_column, ["1"], include=True)
assert len(included) == 1
assert included.iloc[0][column] == "b"
excluded = restrict_dataset(dataset, index_column, ["1"], include=False)
assert len(excluded) == 2
assert excluded.iloc[0][column] == "a"
assert excluded.iloc[1][column] == "c"
# Check the case when the requested value is not in the dataset
included2 = restrict_dataset(dataset, column, ["nope"], include=True)
assert len(included2) == 0
excluded2 = restrict_dataset(dataset, column, ["nope"], include=False)
assert len(excluded2) == 3
@pytest.mark.parametrize("exclude_items", [True, False])
def test_montage_included_and_excluded1(
tmp_path: Path,
temp_slides_dataset: SlidesDataset,
exclude_items: bool) -> None:
"""Check that a montage with exclusion list is handled correctly."""
config = MontageCreation()
out_path = tmp_path / "montage"
out_path.mkdir(exist_ok=True)
config.output_path = out_path
config.width = 1000
config.parallel = 1
config.montage_from_included_and_excluded_slides(
temp_slides_dataset,
items=["ID 0", "ID 1"],
exclude_items=exclude_items,
)
expected_file = expected_results_folder() / ("montage_excluded.png" if exclude_items else "montage_included.png")
montage_file = out_path / MONTAGE_FILE
assert montage_file.is_file()
if UPDATE_STORED_RESULTS:
shutil.copyfile(montage_file, expected_file)
assert_binary_files_match(montage_file, expected_file)
def test_montage_included_and_excluded2(tmp_path: Path, temp_slides_dataset: SlidesDataset) -> None:
"""Check that a montage with exclusion list supplies the correct set of images."""
out_path = tmp_path / "montage"
out_path.mkdir(exist_ok=True)
config = MontageCreation()
config.output_path = out_path
config.parallel = 1
config.width = 1000
for exclude_items in [True, False]:
with mock.patch("health_cpath.utils.montage.make_montage") as mock_montage:
montage_file = config.montage_from_included_and_excluded_slides(
temp_slides_dataset,
items=["ID 0", "ID 1"],
exclude_items=exclude_items,
)
assert montage_file is not None
assert mock_montage.call_count == 1
records = mock_montage.call_args_list[0][1]["records"]
assert isinstance(records, List)
slide_ids = sorted([d[SlideKey.SLIDE_ID] for d in records])
if exclude_items:
assert slide_ids == ["ID 2", "ID 3", "ID 4", "ID 5"]
else:
assert slide_ids == ["ID 0", "ID 1"]
def test_dataset_from_folder_unique(tmp_path: Path) -> None:
"""Test if a plain dataframe can be created from files in a folder.
This tests the case where the file names alone are unique."""
file_names = ["file1.txt", "file2.txt"]
full_files = [tmp_path / file_name for file_name in file_names]
for f in full_files:
f.touch()
df = dataset_from_folder(tmp_path)
expected_df = pd.DataFrame({SlideKey.SLIDE_ID: file_names, SlideKey.IMAGE: map(str, full_files)})
assert_frame_equal(df, expected_df)
def test_dataset_from_folder_duplicate_files(tmp_path: Path) -> None:
"""Test if a plain dataframe can be created from files in a folder.
This tests the case where the file names alone are not unique."""
# Place files of the same name in different folders. The dataset should still be created, with the full path
# as the slide ID.
file_names = ["folder1/file.txt", "folder2/file.txt"]
full_files = [tmp_path / file_name for file_name in file_names]
for f in full_files:
f.parent.mkdir(exist_ok=True)
f.touch()
df = dataset_from_folder(tmp_path)
expected_df = pd.DataFrame({SlideKey.SLIDE_ID: file_names, SlideKey.IMAGE: map(str, full_files)})
assert_frame_equal(df, expected_df)
def test_dataset_from_folder_fails(tmp_path: Path) -> None:
"""Test if dataframe creation fails if the argument is not a folder."""
with pytest.raises(ValueError, match="does not exist or is not a directory"):
dataset_from_folder(tmp_path / "file.txt")
def test_montage_from_folder(tmp_path: Path, temp_slides: Path) -> None:
"""Test if a montage can be created from files in a folder."""
dataset = dataset_from_folder(temp_slides, glob_pattern="**/*.tiff")
assert len(dataset) == NUM_SLIDES
config = MontageCreation()
config.output_path = tmp_path
config.width = 1000
config.parallel = 2
result_file = config.montage_from_included_and_excluded_slides(dataset)
assert result_file is not None
assert result_file.is_file()
expected_file = expected_results_folder() / "montage_from_folder.png"
if UPDATE_STORED_RESULTS:
shutil.copyfile(result_file, expected_file)
assert_binary_files_match(result_file, expected_file)
def test_montage_from_folder_full(tmp_path: Path, temp_slides: Path) -> None:
"""Test if a montage can be created from files in a folder, using the commandline entrypoint."""
config = MontageCreation()
config.image_glob_pattern = "**/*.tiff"
config.width = 1000
config.parallel = 1
config.output_path = tmp_path / "outputs"
# Cucim is the only backend that supports TIFF files as created in the test images, openslide fails.
config.backend = "cucim"
config.create_montage(input_folder=temp_slides)
assert (config.output_path / "montage.png").is_file()
def test_montage_fails(tmp_path: Path) -> None:
"""Test if montage creation exits gracefully if files can't be read."""
# Create a single invalid TIFF file. The code will fail when trying to read the images (during thumbnail
# creation), but it should still reach the point where it creates the montage. There, it will fail because
# there is no thumbnails present.
image_file = tmp_path / "image.tiff"
image_file.touch()
config = MontageCreation()
config.image_glob_pattern = "**/*.tiff"
config.width = 1000
config.parallel = 1
config.input_folder = tmp_path
with pytest.raises(ValueError, match="Failed to create montage"):
config.create_montage(input_folder=tmp_path)
def test_montage_no_images(tmp_path: Path) -> None:
"""Test if montage creation fails if no files are present"""
config = MontageCreation()
config.input_folder = tmp_path
config.image_glob_pattern = "**/*.tiff"
with pytest.raises(ValueError, match="No images found"):
config.create_montage(input_folder=tmp_path)
def test_exclusion_list(tmp_path: Path) -> None:
"""Test if exclusion lists are read correctly from a CSV file and passed to the montage creation function."""
config = MontageCreation()
assert config.read_exclusion_list() == []
ids = ["id1"]
exclusion_csv = tmp_path / "exclusion.csv"
exclusion_df = pd.DataFrame({"col1": ids, "col2": ["something else"]})
exclusion_df.to_csv(exclusion_csv, index=False)
config.exclude_by_slide_id = exclusion_csv
assert config.read_exclusion_list() == ids
config.image_glob_pattern = "*.png"
(tmp_path / "image.png").touch()
with mock.patch.object(config, "montage_from_included_and_excluded_slides") as mock_mont:
config.create_montage(input_folder=tmp_path)
assert mock_mont.call_count == 1
assert mock_mont.call_args[1]["items"] == ids
assert mock_mont.call_args[1]["exclude_items"]
def test_raises_if_no_glob(tmp_path: Path) -> None:
"""Test for exception if no file pattern specified."""
config = MontageCreation()
with pytest.raises(ValueError, match="Unable to load dataset"):
config.create_montage(input_folder=tmp_path)
def test_raises_if_no_images(tmp_path: Path) -> None:
"""Test for exception if no file pattern specified."""
config = MontageCreation()
config.image_glob_pattern = "*.png"
with pytest.raises(ValueError, match="No images found in folder"):
config.create_montage(input_folder=tmp_path)
def test_read_dataset_if_csv_present(temp_slides_dataset: SlidesDataset) -> None:
"""Test if a SlidesDataset can be read from a folder that contains a dataset.csv file."""
dataset_path = temp_slides_dataset.root_dir
config = MontageCreation()
dataset = config.read_dataset(dataset_path)
assert isinstance(dataset, SlidesDataset)
def test_read_dataset_fails_if_no_csv_present(tmp_path: Path) -> None:
"""Test behaviour for reading SlidesDataset if not dataset.csv file is present."""
config = MontageCreation()
with pytest.raises(ValueError, match="Unable to load dataset"):
config.read_dataset(tmp_path)
def test_montage_from_slides_dataset(tmp_path: Path, temp_slides_dataset: SlidesDataset) -> None:
"""Test if a montage can be created via SlidesDataset, when the folder contains a dataset.csv file."""
dataset_path = temp_slides_dataset.root_dir
config = MontageCreation()
config.width = 200
config.parallel = 1
outputs = tmp_path / "outputs"
config.output_path = outputs
config.create_montage(input_folder=dataset_path)
montage = outputs / MONTAGE_FILE
assert montage.is_file()
def test_montage_via_args(tmp_path: Path, temp_slides: Path) -> None:
"""Test if montage creation can be invoked correctly via commandline args."""
outputs = tmp_path / "outputs"
with mock.patch("sys.argv",
[
"",
"--dataset", str(temp_slides),
"--image_glob_pattern", "**/*.tiff",
"--output_path", str(outputs),
"--width", "200"
]):
script_main()
montage = outputs / MONTAGE_FILE
assert montage.is_file()
expected_file = expected_results_folder() / "montage_via_args.png"
if UPDATE_STORED_RESULTS:
shutil.copyfile(montage, expected_file)
assert_binary_files_match(montage, expected_file)

Просмотреть файл

@ -2,7 +2,9 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# ------------------------------------------------------------------------------------------
import logging
import os
import time
from pathlib import Path
from typing import Any, Callable, Collection, Mapping, Sequence
@ -118,3 +120,18 @@ def run_distributed(fn: Callable[..., None], args: Sequence[Any] = (), world_siz
:param world_size: Total number of distributed subprocesses to spawn.
"""
torch.multiprocessing.spawn(_run_distributed_process, args=(world_size, fn, args), nprocs=world_size)
def wait_until_file_exists(filename: Path, timeout_sec: float = 10.0, sleep_sec: float = 0.1) -> None:
"""Wait until the given file exists. If the file does not exist after the given timeout, an exception is raised.
:param filename: The file to wait for.
:param timeout_sec: The maximum time to wait until the file exists.
:param sleep_sec: The time to sleep between repeated checks if the file exists already.
:raises TimeoutError: If the file does not exist after the given timeout."""
current_time = time.time()
while not filename.exists():
logging.info(f"Waiting for file {filename}. Total wait time so far: {time.time() - current_time} seconds.")
time.sleep(sleep_sec)
if time.time() - current_time > timeout_sec:
raise TimeoutError(f"File {filename} still does not exist after waiting for {timeout_sec} seconds")