Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BCSS dataset #559

Merged
merged 20 commits into from
Jul 16, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions configs/vision/dino_vit/online/bcss.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
trainer:
class_path: eva.Trainer
init_args:
n_runs: &N_RUNS ${oc.env:N_RUNS, 1}
default_root_dir: &OUTPUT_ROOT ${oc.env:OUTPUT_ROOT, logs/${oc.env:TIMM_MODEL_NAME, vit_small_patch16_224}/bcss}
max_steps: &MAX_STEPS ${oc.env:MAX_STEPS, 513}
log_every_n_steps: 6
callbacks:
- class_path: eva.callbacks.ConfigurationLogger
- class_path: eva.vision.callbacks.SemanticSegmentationLogger
init_args:
log_every_n_epochs: 1
mean: &NORMALIZE_MEAN ${oc.env:NORMALIZE_MEAN, [0.485, 0.456, 0.406]}
std: &NORMALIZE_STD ${oc.env:NORMALIZE_STD, [0.229, 0.224, 0.225]}
- class_path: lightning.pytorch.callbacks.ModelCheckpoint
init_args:
filename: best
save_last: true
save_top_k: 1
monitor: &MONITOR_METRIC ${oc.env:MONITOR_METRIC, val/MulticlassJaccardIndex}
mode: &MONITOR_METRIC_MODE ${oc.env:MONITOR_METRIC_MODE, max}
- class_path: lightning.pytorch.callbacks.EarlyStopping
init_args:
min_delta: 0
patience: 100
monitor: *MONITOR_METRIC
mode: *MONITOR_METRIC_MODE
logger:
- class_path: lightning.pytorch.loggers.TensorBoardLogger
init_args:
save_dir: *OUTPUT_ROOT
name: ""
model:
class_path: eva.vision.models.modules.SemanticSegmentationModule
init_args:
encoder:
class_path: eva.vision.models.networks.encoders.TimmEncoder
init_args:
model_name: ${oc.env:TIMM_MODEL_NAME, vit_small_patch16_224_dino}
pretrained: ${oc.env:MODEL_PRETRAINED, true}
out_indices: ${oc.env:TIMM_MODEL_OUT_INDICES, 1}
checkpoint_path: &CHECKPOINT_PATH ${oc.env:CHECKPOINT_PATH, null}
model_arguments:
dynamic_img_size: true
decoder:
class_path: eva.vision.models.networks.decoders.segmentation.ConvDecoderMS
init_args:
in_features: ${oc.env:DECODER_IN_FEATURES, 384}
num_classes: &NUM_CLASSES 6
criterion:
class_path: torch.nn.CrossEntropyLoss
init_args:
ignore_index: 0
lr_multiplier_encoder: 0.0
optimizer:
class_path: torch.optim.AdamW
init_args:
lr: 0.001
weight_decay: 0.01
lr_scheduler:
class_path: torch.optim.lr_scheduler.PolynomialLR
init_args:
total_iters: *MAX_STEPS
power: 0.9
metrics:
common:
- class_path: eva.metrics.AverageLoss
evaluation:
- class_path: eva.core.metrics.defaults.MulticlassSegmentationMetrics
init_args:
num_classes: *NUM_CLASSES
- class_path: eva.core.metrics.wrappers.ClasswiseWrapper
init_args:
metric:
class_path: torchmetrics.classification.MulticlassF1Score
init_args:
num_classes: *NUM_CLASSES
average: null
data:
class_path: eva.DataModule
init_args:
datasets:
train:
class_path: eva.vision.datasets.BCSS
init_args: &DATASET_ARGS
root: ${oc.env:DATA_ROOT, ./data}/bcss
split: train
target_mpp: 0.5
sampler:
class_path: eva.vision.data.wsi.patching.samplers.GridSampler
init_args:
max_samples: 1000
transforms:
class_path: eva.vision.data.transforms.common.ResizeAndCrop
init_args:
size: ${oc.env:RESIZE_DIM, 224}
mean: *NORMALIZE_MEAN
std: *NORMALIZE_STD
val:
class_path: eva.vision.datasets.BCSS
init_args:
<<: *DATASET_ARGS
split: val
test:
class_path: eva.vision.datasets.BCSS
init_args:
<<: *DATASET_ARGS
split: test
dataloaders:
train:
batch_size: &BATCH_SIZE ${oc.env:BATCH_SIZE, 64}
shuffle: true
val:
batch_size: *BATCH_SIZE
test:
batch_size: *BATCH_SIZE
59 changes: 59 additions & 0 deletions docs/datasets/bcss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# BCSS

The BCSS (Breast Cancer Semantic Segmentation) consists of extracts from 151 WSI images from [TCGA](https://www.cancer.gov/ccg/research/genome-sequencing/tcga), containing over 20.000 segmentation annotations covering 21 different tissue types.
nkaenzig marked this conversation as resolved.
Show resolved Hide resolved


## Raw data

### Key stats

| | |
|-----------------------|-----------------------------------------------------------|
| **Modality** | Vision (WSI extracts) |
| **Task** | Segmentation - 22 classes (tissue types)|
| **Data size** | total: ~5GB |
| **Image dimension** | ~1000-3000 x ~1000-3000 x 3 |
| **Magnification (μm/px)** | 40x (0.25) |
| **Files format** | `.png` images / `.mat` segmentation masks |
| **Number of images** | 151 |
| **Splits in use** | Train, Val and Test |


### Organization

The data is organized as follows:

```
bcss
├── rgbs_colorNormalized # wsi images
│ ├── TCGA-*.png
├── masks # segmentation masks
│ ├── TCGA-*.png # same filenames as images
```

## Download and preprocessing

The `BCSS` dataset class doesn't download the data during runtime and must be downloaded manually from links provided [here](https://drive.google.com/drive/folders/1zqbdkQF8i5cEmZOGmbdQm-EP8dRYtvss?usp=sharing).

Although the original images have a resolution of 0.25 microns per pixel (mpp), we extract patches at 0.5 mpp for evaluation. This is because using the original resolution with common foundation model patch sizes (e.g. 224x224 pixels) would result in regions that are too small, leading to less expressive segmentation masks and unnecessarily complicating the task.


### Splits

The authors of the dataset propose a train / test. Additionally, we split the train set into train/val by using random split using a 0.8 / 0.2 ratio.

TODO:
| Splits | Train | Validation | Test |
|----------|-------------|-------------|------------|
| #Samples | 84 (55.6%) | 22 (14.6%) | 45 (29.8%)|


## Relevant links

* [Dataset Repo](https://github.com/PathologyDataScience/BCSS)
* [Breast Cancer Segmentation Grand Challenge](https://bcsegmentation.grand-challenge.org)
* [Google Drive Download Link for 0.25 mpp version](https://drive.google.com/drive/folders/1zqbdkQF8i5cEmZOGmbdQm-EP8dRYtvss?usp=sharing)

## License

The BCSS dataset is held under the [CC0 1.0 UNIVERSAL](https://creativecommons.org/publicdomain/zero/1.0/) license.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ nav:
- PatchCamelyon: datasets/patch_camelyon.md
- MoNuSAC: datasets/monusac.md
- CoNSeP: datasets/consep.md
- BCSS: datasets/bcss.md
- Slide-level:
- Camelyon16: datasets/camelyon16.md
- PANDA: datasets/panda.md
Expand Down
3 changes: 2 additions & 1 deletion src/eva/core/data/splitting/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Dataset splitting API."""

from eva.core.data.splitting.random import random_split
from eva.core.data.splitting.stratified import stratified_split

__all__ = ["stratified_split"]
__all__ = ["random_split", "stratified_split"]
41 changes: 41 additions & 0 deletions src/eva/core/data/splitting/random.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""Functions for random splitting."""

from typing import Any, List, Sequence, Tuple

import numpy as np


def random_split(
samples: Sequence[Any],
train_ratio: float,
val_ratio: float,
test_ratio: float = 0.0,
seed: int = 42,
) -> Tuple[List[int], List[int], List[int] | None]:
"""Splits the samples into random train, validation, and test (optional) sets.

Args:
samples: The samples to split.
train_ratio: The ratio of the training set.
val_ratio: The ratio of the validation set.
test_ratio: The ratio of the test set (optional).
seed: The seed for reproducibility.

Returns:
The indices of the train, validation, and test sets as lists.
"""
if train_ratio + val_ratio + (test_ratio or 0) != 1:
raise ValueError("The sum of the ratios must be equal to 1.")

np.random.seed(seed)
n_samples = len(samples)
indices = np.random.permutation(n_samples)

n_train = int(np.floor(train_ratio * n_samples))
n_val = n_samples - n_train if test_ratio == 0.0 else int(np.floor(val_ratio * n_samples)) or 1

train_indices = list(indices[:n_train])
val_indices = list(indices[n_train : n_train + n_val])
test_indices = list(indices[n_train + n_val :]) if test_ratio > 0.0 else None

return train_indices, val_indices, test_indices
2 changes: 2 additions & 0 deletions src/eva/vision/data/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
WsiClassificationDataset,
)
from eva.vision.data.datasets.segmentation import (
BCSS,
CoNSeP,
EmbeddingsSegmentationDataset,
ImageSegmentation,
Expand All @@ -21,6 +22,7 @@

__all__ = [
"BACH",
"BCSS",
"CRC",
"MHIST",
"PANDA",
Expand Down
4 changes: 0 additions & 4 deletions src/eva/vision/data/datasets/classification/camelyon16.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,10 +190,6 @@ def validate(self) -> None:
first_and_last_labels=("normal", "tumor"),
)

@override
def filename(self, index: int) -> str:
return os.path.basename(self._file_paths[self._get_dataset_idx(index)])

@override
def __getitem__(self, index: int) -> Tuple[tv_tensors.Image, torch.Tensor, Dict[str, Any]]:
return base.ImageClassification.__getitem__(self, index)
Expand Down
6 changes: 1 addition & 5 deletions src/eva/vision/data/datasets/classification/panda.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,10 +116,6 @@ def validate(self) -> None:
first_and_last_labels=("0", "5"),
)

@override
def filename(self, index: int) -> str:
return os.path.basename(self._file_paths[self._get_dataset_idx(index)])

@override
def __getitem__(self, index: int) -> Tuple[tv_tensors.Image, torch.Tensor, Dict[str, Any]]:
return base.ImageClassification.__getitem__(self, index)
Expand Down Expand Up @@ -169,7 +165,7 @@ def _load_file_paths(self, split: Literal["train", "val", "test"] | None = None)
case None:
return file_paths
case _:
raise ValueError("Invalid split. Use 'train', 'val' or `None`.")
raise ValueError("Invalid split. Use 'train', 'val', 'test' or `None`.")

def _filter_noisy_labels(self, file_paths: List[str]):
is_noisy_filter = self.annotations["noise_ratio_10"] == 0
Expand Down
2 changes: 1 addition & 1 deletion src/eva/vision/data/datasets/classification/wsi.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def __getitem__(self, index: int) -> Tuple[tv_tensors.Image, torch.Tensor, Dict[
return base.ImageClassification.__getitem__(self, index)

@override
def load_image(self, index: int) -> np.ndarray:
def load_image(self, index: int) -> tv_tensors.Image:
return wsi.MultiWsiDataset.__getitem__(self, index)

@override
Expand Down
2 changes: 2 additions & 0 deletions src/eva/vision/data/datasets/segmentation/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
"""Segmentation datasets API."""

from eva.vision.data.datasets.segmentation.base import ImageSegmentation
from eva.vision.data.datasets.segmentation.bcss import BCSS
from eva.vision.data.datasets.segmentation.consep import CoNSeP
from eva.vision.data.datasets.segmentation.embeddings import EmbeddingsSegmentationDataset
from eva.vision.data.datasets.segmentation.monusac import MoNuSAC
from eva.vision.data.datasets.segmentation.total_segmentator_2d import TotalSegmentator2D

__all__ = [
"ImageSegmentation",
"BCSS",
"CoNSeP",
"EmbeddingsSegmentationDataset",
"MoNuSAC",
Expand Down
38 changes: 38 additions & 0 deletions src/eva/vision/data/datasets/segmentation/_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from typing import Any, Tuple

import numpy.typing as npt

from eva.vision.data.datasets import wsi


def get_coords_at_index(
dataset: wsi.MultiWsiDataset, index: int
) -> Tuple[Tuple[int, int], int, int]:
"""Returns the coordinates ((x,y),width,height) of the patch at the given index.

Args:
dataset: The WSI dataset instance.
index: The sample index.
"""
image_index = dataset._get_dataset_idx(index)
patch_index = index if image_index == 0 else index - dataset.cumulative_sizes[image_index - 1]
wsi_dataset = dataset.datasets[image_index]
if isinstance(wsi_dataset, wsi.WsiDataset):
coords = wsi_dataset._coords
return coords.x_y[patch_index], coords.width, coords.height
else:
raise Exception(f"Expected WsiDataset, got {type(wsi_dataset)}")


def extract_mask_patch(
mask: npt.NDArray[Any], dataset: wsi.MultiWsiDataset, index: int
) -> npt.NDArray[Any]:
"""Reads the mask patch at the coordinates corresponding to the dataset index.

Args:
mask: The mask array.
dataset: The WSI dataset instance.
index: The sample index.
"""
(x, y), width, height = get_coords_at_index(dataset, index)
return mask[y : y + height, x : x + width]
Loading