BAPLe Instructions for Dataset Preparation

This document provides instructions on how to prepare the datasets for training and testing the models. The datasets used in BAPLe project are as follows:

COVID RSNA18 MIMIC Kather PanNuke DigestPath

The general structure of a dataset is as follows:

med-datasets/
    ├── dataset-name/
        |── images/
            |── train/
            |── test/
        |── classnames.txt

where dataset-name is the name of the dataset, train and test are the directories containing the training and testing images, respectively, and classnames.txt text file lists the class folder names and their corresponding actual class names. The train and test directories contain sub-directories for each class, which contain the images for that class. An example structure of train and test directories is as follows:

|── train/
    |── class_1/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    |── class_2/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    .
    .
    .

    |── class_N/
        |── image_1.jpg
        |── image_2.jpg
        |── ...


|── test/
    |── class_1/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    |── class_2/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    .
    .
    .

    |── class_N/
        |── image_1.jpg
        |── image_2.jpg
        |── ...

We have used following datasets in our experiments and provided instructions on how to prepare them:

Dataset	Type	Classes
COVID	X-ray	2
RSNA18	X-ray	3
MIMIC	X-ray	5
Kather	Histopathology	9
PanNuke	Histopathology	2
DigestPath	Histopathology	2

TO DO

Add information about [Dataset Class Python File, Transformations, Data Loaders]

COVID

Download the dataset from the following Kaggle link:

COVID-19 Image Data Collection

After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

unzip archive.zip
mv COVID-19_Radiography_Dataset covid
cd covid
mkdir images
mkdir images/all-images

mkdir images/all-images/covid
mv COVID/images/* images/all-images/covid
rm -rf COVID

mkdir images/all-images/normal
mv Normal/images/* images/all-images/normal
rm -rf Normal

mkdir images/all-images/lung_opacity
mv Lung_Opacity/images/* images/all-images/lung_opacity
rm -rf Lung_Opacity

mkdir images/all-images/viral_pneumonia
mv Viral\ Pneumonia/images/* images/all-images/viral_pneumonia
rm -rf Viral\ Pneumonia

mkdir images/train
mkdir images/train/covid
mkdir images/train/normal
mkdir images/test
mkdir images/test/covid
mkdir images/test/normal

Download train_test_split_covid.py file from here and place it in main covid folder. Run the following command to split the dataset into training and testing sets:
```
python train_test_split_covid.py
```
Download the classnames.txt file from here and place it in the main covid folder.
Move covid folder to med-datasets directory.

RSNA18

Download the dataset from the following Kaggle link:

RSNA18 Challenge Dataset

After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

pip install pydicom==2.4.4
pip install pandas==2.2.2
pip install scikit-learn==1.5.1

unzip rsna-pneumonia-detection-challenge.zip
mv rsna-pneumonia-detection-challenge rsna18
cd rsna18

mkdir unprocessed
mv ./*.txt unprocessed
mv ./*.csv unprocessed
mv ./stage_2_train_images unprocessed
mv ./stage_2_test_images unprocessed

Download train_test_split_rsna18.py file from here and place it in main rsna18 folder. Run the following command to split the dataset into training and testing sets:
```
python train_test_split_rsna18.py
```
Download the classnames.txt file from here and place it in the main rsna18 folder.
Move rsna18 folder to med-datasets directory.

MIMIC

To be updated soon.

Kather

Download the dataset from the following links:

NCT-CRC-HE-100K.zip

CRC-VAL-HE-7K.zip

After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

unzip NCT-CRC-HE-100K.zip
unzip CRC-VAL-HE-7K.zip
mv NCT-CRC-HE-100K train
mv CRC-VAL-HE-7K test

mkdir kather
mkdir kather/images
mv train kather/images/
mv test kather/images/

pip install multiprocess==0.70.16

Download the process_kather.py file from here and place it in the main kather folder. After this run the following command to process the dataset:
```
python process_kather.py
```
Download the classnames.txt file from here and place it in the main kather folder.
Move kather folder to med-datasets directory.

PanNuke

Download the dataset (Fold-1, Fold-2, Fold-3) from the following link:

PanNuke Dataset for Nuclei Instance Segmentation and Classification

After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

mkdir pannuke
unzip fold_1.zip -d ./pannuke
unzip fold_2.zip -d ./pannuke
unzip fold_3.zip -d ./pannuke

cd pannuke

mkdir images
mkdir images/train
mkdir images/train/benign
mkdir images/train/malignant
mkdir images/test
mkdir images/test/benign
mkdir images/test/malignant

pip install multiprocess==0.70.16

Download the process_pannuke.py file from here and place it in the main pannuke folder. After this run the following command to process the dataset:
```
python process_pannuke.py
```
Download the train_test_split_pannuke.py file from here and place it in main pannuke folder. Run the following command to split the dataset into training and testing sets:
```
python train_test_split_pannuke.py
```
Download the classnames.txt file from here and place it in the main pannuke folder.
Move pannuke folder to med-datasets directory.

Note: Python script process_pannuke.py is adapted from PLIP Validation Dataset source.

DigestPath

Download the dataset from the following Google Drive link:

DigestPath Dataset - 2019

After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

mkdir digestpath
unzip tissue-train-neg.zip -d ./digestpath
unzip tissue-train-pos-v1.zip -d ./digestpath

cd digestpath

mkdir images
mkdir images/train
mkdir images/train/benign
mkdir images/train/malignant
mkdir images/test
mkdir images/test/benign
mkdir images/test/malignant

pip install multiprocess==0.70.16

Download the process_digestpath.py file from here and place it in the main digestpath folder. After this run the following commands to process the dataset:
```
python process_digestpath.py --step 1
python process_digestpath.py --step 2
python process_digestpath.py --step 3
```
Download the train_test_split_digestpath.py file from here and place it in main digestpath folder. Run the following command to split the dataset into training and testing sets:
```
python train_test_split_digestpath.py
```
Download the classnames.txt file from here and place it in the main digestpath folder.
Move digestpath folder to med-datasets directory.

Note: Python script process_digestpath.py is adapted from PLIP Validation Dataset source.

Acknowledgement

This file is prepared by BAPLe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATASETS.md

DATASETS.md

BAPLe Instructions for Dataset Preparation

TO DO

COVID

RSNA18

MIMIC

Kather

PanNuke

DigestPath

Acknowledgement

Files

DATASETS.md

Latest commit

History

DATASETS.md

File metadata and controls

BAPLe Instructions for Dataset Preparation

TO DO

COVID

RSNA18

MIMIC

Kather

PanNuke

DigestPath

Acknowledgement