Skip to content

Latest commit

 

History

History
307 lines (222 loc) · 10.2 KB

DATASETS.md

File metadata and controls

307 lines (222 loc) · 10.2 KB

BAPLe Instructions for Dataset Preparation

This document provides instructions on how to prepare the datasets for training and testing the models. The datasets used in BAPLe project are as follows:

COVID   RSNA18   MIMIC   Kather   PanNuke   DigestPath

The general structure of a dataset is as follows:

med-datasets/
    ├── dataset-name/
        |── images/
            |── train/
            |── test/
        |── classnames.txt

where dataset-name is the name of the dataset, train and test are the directories containing the training and testing images, respectively, and classnames.txt text file lists the class folder names and their corresponding actual class names. The train and test directories contain sub-directories for each class, which contain the images for that class. An example structure of train and test directories is as follows:

|── train/
    |── class_1/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    |── class_2/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    .
    .
    .

    |── class_N/
        |── image_1.jpg
        |── image_2.jpg
        |── ...


|── test/
    |── class_1/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    |── class_2/
        |── image_1.jpg
        |── image_2.jpg
        |── ...
    .
    .
    .

    |── class_N/
        |── image_1.jpg
        |── image_2.jpg
        |── ...


We have used following datasets in our experiments and provided instructions on how to prepare them:

Dataset Type Classes
COVID X-ray 2
RSNA18 X-ray 3
MIMIC X-ray 5
Kather Histopathology 9
PanNuke Histopathology 2
DigestPath Histopathology 2

TO DO

Add information about [Dataset Class Python File, Transformations, Data Loaders]



COVID

  1. Download the dataset from the following Kaggle link:

    COVID-19 Image Data Collection

  2. After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

    unzip archive.zip
    mv COVID-19_Radiography_Dataset covid
    cd covid
    mkdir images
    mkdir images/all-images
    
    mkdir images/all-images/covid
    mv COVID/images/* images/all-images/covid
    rm -rf COVID
    
    mkdir images/all-images/normal
    mv Normal/images/* images/all-images/normal
    rm -rf Normal
    
    mkdir images/all-images/lung_opacity
    mv Lung_Opacity/images/* images/all-images/lung_opacity
    rm -rf Lung_Opacity
    
    mkdir images/all-images/viral_pneumonia
    mv Viral\ Pneumonia/images/* images/all-images/viral_pneumonia
    rm -rf Viral\ Pneumonia
    
    mkdir images/train
    mkdir images/train/covid
    mkdir images/train/normal
    mkdir images/test
    mkdir images/test/covid
    mkdir images/test/normal
  3. Download train_test_split_covid.py file from here and place it in main covid folder. Run the following command to split the dataset into training and testing sets:

    python train_test_split_covid.py
  4. Download the classnames.txt file from here and place it in the main covid folder.

  5. Move covid folder to med-datasets directory.



RSNA18

  1. Download the dataset from the following Kaggle link:

    RSNA18 Challenge Dataset

  2. After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

    pip install pydicom==2.4.4
    pip install pandas==2.2.2
    pip install scikit-learn==1.5.1
    
    unzip rsna-pneumonia-detection-challenge.zip
    mv rsna-pneumonia-detection-challenge rsna18
    cd rsna18
    
    mkdir unprocessed
    mv ./*.txt unprocessed
    mv ./*.csv unprocessed
    mv ./stage_2_train_images unprocessed
    mv ./stage_2_test_images unprocessed
  3. Download train_test_split_rsna18.py file from here and place it in main rsna18 folder. Run the following command to split the dataset into training and testing sets:

    python train_test_split_rsna18.py
  4. Download the classnames.txt file from here and place it in the main rsna18 folder.

  5. Move rsna18 folder to med-datasets directory.



MIMIC

To be updated soon.

Kather

  1. Download the dataset from the following links:

    NCT-CRC-HE-100K.zip

    CRC-VAL-HE-7K.zip

  2. After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

    unzip NCT-CRC-HE-100K.zip
    unzip CRC-VAL-HE-7K.zip
    mv NCT-CRC-HE-100K train
    mv CRC-VAL-HE-7K test
    
    mkdir kather
    mkdir kather/images
    mv train kather/images/
    mv test kather/images/
    
    pip install multiprocess==0.70.16
  3. Download the process_kather.py file from here and place it in the main kather folder. After this run the following command to process the dataset:

    python process_kather.py
  4. Download the classnames.txt file from here and place it in the main kather folder.

  5. Move kather folder to med-datasets directory.



PanNuke

  1. Download the dataset (Fold-1, Fold-2, Fold-3) from the following link:

    PanNuke Dataset for Nuclei Instance Segmentation and Classification

  2. After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

    mkdir pannuke
    unzip fold_1.zip -d ./pannuke
    unzip fold_2.zip -d ./pannuke
    unzip fold_3.zip -d ./pannuke
    
    cd pannuke
    
    mkdir images
    mkdir images/train
    mkdir images/train/benign
    mkdir images/train/malignant
    mkdir images/test
    mkdir images/test/benign
    mkdir images/test/malignant
    
    pip install multiprocess==0.70.16
  3. Download the process_pannuke.py file from here and place it in the main pannuke folder. After this run the following command to process the dataset:

    python process_pannuke.py
  4. Download the train_test_split_pannuke.py file from here and place it in main pannuke folder. Run the following command to split the dataset into training and testing sets:

    python train_test_split_pannuke.py
  5. Download the classnames.txt file from here and place it in the main pannuke folder.

  6. Move pannuke folder to med-datasets directory.


Note: Python script process_pannuke.py is adapted from PLIP Validation Dataset source.



DigestPath

  1. Download the dataset from the following Google Drive link:

    DigestPath Dataset - 2019

  2. After downloading the dataset, extract the files and move the images to the appropriate directories by running the following commands:

    mkdir digestpath
    unzip tissue-train-neg.zip -d ./digestpath
    unzip tissue-train-pos-v1.zip -d ./digestpath
    
    cd digestpath
    
    mkdir images
    mkdir images/train
    mkdir images/train/benign
    mkdir images/train/malignant
    mkdir images/test
    mkdir images/test/benign
    mkdir images/test/malignant
    
    pip install multiprocess==0.70.16
  3. Download the process_digestpath.py file from here and place it in the main digestpath folder. After this run the following commands to process the dataset:

    python process_digestpath.py --step 1
    python process_digestpath.py --step 2
    python process_digestpath.py --step 3
  4. Download the train_test_split_digestpath.py file from here and place it in main digestpath folder. Run the following command to split the dataset into training and testing sets:

    python train_test_split_digestpath.py
  5. Download the classnames.txt file from here and place it in the main digestpath folder.

  6. Move digestpath folder to med-datasets directory.


Note: Python script process_digestpath.py is adapted from PLIP Validation Dataset source.



Acknowledgement

This file is prepared by BAPLe.