BAPLe Instructions for Dataset Preparation
This document provides instructions on how to prepare the datasets for training and testing the models. The datasets used in BAPLe project are as follows:
COVID RSNA18 MIMIC Kather PanNuke DigestPath
The general structure of a dataset is as follows:
med-datasets/
├── dataset-name/
|── images/
|── train/
|── test/
|── classnames.txt
where dataset-name
is the name of the dataset, train
and test
are the directories containing the training and testing images, respectively, and classnames.txt
text file lists the class folder names and their corresponding actual class names. The train
and test
directories contain sub-directories for each class, which contain the images for that class. An example structure of train
and test
directories is as follows:
|── train/
|── class_1/
|── image_1.jpg
|── image_2.jpg
|── ...
|── class_2/
|── image_1.jpg
|── image_2.jpg
|── ...
.
.
.
|── class_N/
|── image_1.jpg
|── image_2.jpg
|── ...
|── test/
|── class_1/
|── image_1.jpg
|── image_2.jpg
|── ...
|── class_2/
|── image_1.jpg
|── image_2.jpg
|── ...
.
.
.
|── class_N/
|── image_1.jpg
|── image_2.jpg
|── ...
We have used following datasets in our experiments and provided instructions on how to prepare them:
Dataset | Type | Classes |
---|---|---|
COVID | X-ray | 2 |
RSNA18 | X-ray | 3 |
MIMIC | X-ray | 5 |
Kather | Histopathology | 9 |
PanNuke | Histopathology | 2 |
DigestPath | Histopathology | 2 |
Add information about [Dataset Class Python File, Transformations, Data Loaders]
-
Download the dataset from the following Kaggle link:
-
After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:
unzip archive.zip mv COVID-19_Radiography_Dataset covid cd covid mkdir images mkdir images/all-images mkdir images/all-images/covid mv COVID/images/* images/all-images/covid rm -rf COVID mkdir images/all-images/normal mv Normal/images/* images/all-images/normal rm -rf Normal mkdir images/all-images/lung_opacity mv Lung_Opacity/images/* images/all-images/lung_opacity rm -rf Lung_Opacity mkdir images/all-images/viral_pneumonia mv Viral\ Pneumonia/images/* images/all-images/viral_pneumonia rm -rf Viral\ Pneumonia mkdir images/train mkdir images/train/covid mkdir images/train/normal mkdir images/test mkdir images/test/covid mkdir images/test/normal
-
Download
train_test_split_covid.py
file from here and place it in maincovid
folder. Run the following command to split the dataset into training and testing sets:python train_test_split_covid.py
-
Download the
classnames.txt
file from here and place it in the maincovid
folder. -
Move
covid
folder tomed-datasets
directory.
-
Download the dataset from the following Kaggle link:
-
After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:
pip install pydicom==2.4.4 pip install pandas==2.2.2 pip install scikit-learn==1.5.1 unzip rsna-pneumonia-detection-challenge.zip mv rsna-pneumonia-detection-challenge rsna18 cd rsna18 mkdir unprocessed mv ./*.txt unprocessed mv ./*.csv unprocessed mv ./stage_2_train_images unprocessed mv ./stage_2_test_images unprocessed
-
Download
train_test_split_rsna18.py
file from here and place it in mainrsna18
folder. Run the following command to split the dataset into training and testing sets:python train_test_split_rsna18.py
-
Download the
classnames.txt
file from here and place it in the mainrsna18
folder. -
Move
rsna18
folder tomed-datasets
directory.
To be updated soon.
-
Download the dataset from the following links:
-
After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:
unzip NCT-CRC-HE-100K.zip unzip CRC-VAL-HE-7K.zip mv NCT-CRC-HE-100K train mv CRC-VAL-HE-7K test mkdir kather mkdir kather/images mv train kather/images/ mv test kather/images/ pip install multiprocess==0.70.16
-
Download the
process_kather.py
file from here and place it in the mainkather
folder. After this run the following command to process the dataset:python process_kather.py
-
Download the
classnames.txt
file from here and place it in the mainkather
folder. -
Move
kather
folder tomed-datasets
directory.
-
Download the dataset (Fold-1, Fold-2, Fold-3) from the following link:
PanNuke Dataset for Nuclei Instance Segmentation and Classification
-
After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:
mkdir pannuke unzip fold_1.zip -d ./pannuke unzip fold_2.zip -d ./pannuke unzip fold_3.zip -d ./pannuke cd pannuke mkdir images mkdir images/train mkdir images/train/benign mkdir images/train/malignant mkdir images/test mkdir images/test/benign mkdir images/test/malignant pip install multiprocess==0.70.16
-
Download the
process_pannuke.py
file from here and place it in the mainpannuke
folder. After this run the following command to process the dataset:python process_pannuke.py
-
Download the
train_test_split_pannuke.py
file from here and place it in mainpannuke
folder. Run the following command to split the dataset into training and testing sets:python train_test_split_pannuke.py
-
Download the
classnames.txt
file from here and place it in the mainpannuke
folder. -
Move
pannuke
folder tomed-datasets
directory.
Note: Python script process_pannuke.py
is adapted from PLIP Validation Dataset source.
-
Download the dataset from the following Google Drive link:
-
After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:
mkdir digestpath unzip tissue-train-neg.zip -d ./digestpath unzip tissue-train-pos-v1.zip -d ./digestpath cd digestpath mkdir images mkdir images/train mkdir images/train/benign mkdir images/train/malignant mkdir images/test mkdir images/test/benign mkdir images/test/malignant pip install multiprocess==0.70.16
-
Download the
process_digestpath.py
file from here and place it in the maindigestpath
folder. After this run the following commands to process the dataset:python process_digestpath.py --step 1 python process_digestpath.py --step 2 python process_digestpath.py --step 3
-
Download the
train_test_split_digestpath.py
file from here and place it in maindigestpath
folder. Run the following command to split the dataset into training and testing sets:python train_test_split_digestpath.py
-
Download the
classnames.txt
file from here and place it in the maindigestpath
folder. -
Move
digestpath
folder tomed-datasets
directory.
Note: Python script process_digestpath.py
is adapted from PLIP Validation Dataset source.
This file is prepared by BAPLe.