diff --git a/tutorials/classification_pathmnist_notebook/classification_tutorial.ipynb b/tutorials/classification_pathmnist_notebook/classification_tutorial.ipynb new file mode 100644 index 000000000..187eed3a8 --- /dev/null +++ b/tutorials/classification_pathmnist_notebook/classification_tutorial.ipynb @@ -0,0 +1,957 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1cc4fdb8-18de-4128-8ec1-b0ba70f4a934", + "metadata": {}, + "source": [ + "In this tutorial, we will be using the Generally Nuanced Deep Learning Framework (GaNDLF) to perform training and inference on a VGG model with PathMNIST, a dataset of colon pathology images. This is a multi-class classification task: there are 9 different types of colon tissue displayed in the pathology images, each represented by its own class." + ] + }, + { + "cell_type": "markdown", + "id": "61838af7", + "metadata": {}, + "source": [ + "The VGG model is a CNN architecture mostly known for simplicity and effectiveness in image classification and has been used in object recognition, face detection, and medical image analysis. \n", + "Its most notable features include its use of 16-19 weight layers and small 3x3 filters which allow for better performance and faster training times.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "4e77f51e", + "metadata": {}, + "source": [ + "![VGG Model Image](https://d3i71xaburhd42.cloudfront.net/dae981902b1f6d869ef2d047612b90cdbe43fd1e/2-Figure1-1.png)" + ] + }, + { + "cell_type": "markdown", + "id": "b99da8e5", + "metadata": {}, + "source": [ + "Figure 1 - Display of VGG Architecture - Visualizing and Comparing AlexNet and VGG using Deconvolutional Layers (W. Yu)\n", + "\n", + "For more information on the model, refer to the paper Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman\n" + ] + }, + { + "cell_type": "markdown", + "id": "b0645481", + "metadata": {}, + "source": [ + "This tutorial demonstrates how to use GaNDLF with a simple classification task. Some steps that would ordinarily be part of the workflow (e.g. data CSV and config YAML file construction) have already been performed for simplicity; please refer to the GaNDLF documentation (located at https://cbica.github.io/GaNDLF/) for more information regarding replication of these steps.\n", + "\n", + "Command-Line Interface Note: Make sure you have Python installed before proceeding! Visit python.org/downloads for instructions on how do to this. Downloading version 3.7 is sufficient to complete this tutorial since this aligns with the version in Google Colab. \n", + "\n", + "Google Colab Note: Before continuing with this tutorial, please ensure that you are connected to the GPU by navigating to Runtime --> Change Runtime Type --> Hardware Accelerator and verifying that \"GPU\" is listed as the selected option. If not, it is highly recommended that you switch to it now. Also, if available, select the \"High-RAM\" option under Runtime → Change Runtime Type → Runtime Shape. Without this option selected, you may end up running out of RAM during training on a base notebook. \n", + "\n", + "Error Note: However, an error message may come up that says “You are connected to the GPU Runtime, but not utilizing the GPU.” This causes the program to stop when training the model because the program runs out of RAM.\n", + "\n", + "\n", + "Let's get started! First, we will clone the GaNDLF repo.\n", + "\n", + "Google Colab Note: If you are following these steps on Google Colab, replace 0.0.16 in the line of code below with 0.0.14, as the default version of Python in Colab (3.7) is not supported by the NumPy version used in the current version of GaNDLF.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adc51dd9", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install gandlf" + ] + }, + { + "cell_type": "markdown", + "id": "634c8e7b", + "metadata": {}, + "source": [ + "The -b option indicates which branch we want to clone, and –depth 1 specifies the number of commits(or versions of the repository) that we want to retrieve. In this case, we are only retrieving one to save space and time, and since we are not making any changes to the actual GaNDLF code, this is sufficient." + ] + }, + { + "cell_type": "markdown", + "id": "0c62a49b", + "metadata": {}, + "source": [ + "Let's navigate to the newly created GaNDLF directory using the cd(change directory) command.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e4cb90b", + "metadata": {}, + "outputs": [], + "source": [ + "%cd GaNDLF\n" + ] + }, + { + "cell_type": "markdown", + "id": "18faa97d", + "metadata": {}, + "source": [ + "Now, we'll install the appropriate version of PyTorch for use with GaNDLF. Pytorch is a machine learning framework primarily used for deep learning, and we will be using it here.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51f0bdf8", + "metadata": {}, + "outputs": [], + "source": [ + "!pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111\n" + ] + }, + { + "cell_type": "markdown", + "id": "f60e51b7", + "metadata": {}, + "source": [ + "pip is Python’s package manager that allows us to easily download different Python packages from the internet. If you do not have pip installed, visit pip.pypa.io/en/stable/installation/ for instructions.\n" + ] + }, + { + "cell_type": "markdown", + "id": "59aa00d9", + "metadata": {}, + "source": [ + "Google Colab Error Note: You might see an error that says:\n", + "​​ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchtext 0.14.1 requires torch==1.13.1, but you have torch 1.8.2+cu111 which is incompatible.\n", + "The next few cells may also have similar errors of the same form mentioning “pip’s dependency resolver” and version incompatibility, but GaNDLF installation will still be successful despite this, as we will see when verifying the installation." + ] + }, + { + "cell_type": "markdown", + "id": "1a79edc1", + "metadata": {}, + "source": [ + "Next, let's install OpenVINO, which is used by GaNDLF to generate optimized models for inference.\n", + "\n", + "OpenVINO is a toolkit developed by Intel that allows for easier optimization of deep learning models in situations where access to computing resources may be limited. The model optimizes computation and memory usage while providing hardware-specific optimizations. OpenVINO currently supports over 68 different image classification models, allowing for flexibility of use with GaNDLF.\n", + "\n", + "Visit docs.openvino.ai for more information.\n" + ] + }, + { + "cell_type": "markdown", + "id": "4eee3270", + "metadata": {}, + "source": [ + "We’ll need to install Matplotlib v.3.5.0 for use in plotting our results, for this version includes all of the features that we will be using later on. Let's do that now!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e63124bb", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install matplotlib==3.5.0\n" + ] + }, + { + "cell_type": "markdown", + "id": "ed63ccf0", + "metadata": {}, + "source": [ + "Matplotlib is a library that allows us to create different types of data visualizations in Python. We will go over some of the specific features when we utilize it later on. Visit matplotlib.org for more information.\n", + "\n", + "Google Colab Note: To be able to use the newly installed version of Matplotlib for plotting, go ahead and click the small gray \"RESTART RUNTIME\" button in the output directly above this code cell, and then continue with this tutorial once the restart process is complete.\n", + "\n", + "\n", + "Now, ensure you are still in the GaNDLF directory before proceeding. Otherwise, run the following command. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5abebbed", + "metadata": {}, + "outputs": [], + "source": [ + "%cd GaNDLF\n" + ] + }, + { + "cell_type": "markdown", + "id": "5aa72043", + "metadata": {}, + "source": [ + "For the last of our GaNDLF-related installations, let's install all required packages for GaNDLF.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8db908ae", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -e ." + ] + }, + { + "cell_type": "markdown", + "id": "488d9232", + "metadata": {}, + "source": [ + "-e is short for editable, and . indicates that we are installing from the current directory (GaNDLF). Essentially, we are installing the packages in an editable mode because this allows us to change the source code in the packages without having to redownload the packages after we make modifications. \n", + "\n", + "Now, let's use gandlf_verifyInstall to verify our GaNDLF installation. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77bd9e3d", + "metadata": {}, + "outputs": [], + "source": [ + "!python ./gandlf_verifyInstall\n" + ] + }, + { + "cell_type": "markdown", + "id": "3001d597", + "metadata": {}, + "source": [ + "\n", + "If you see the message, “GaNDLF is ready” after executing the previous step, then all steps have been followed correctly thus far. Let's move on to collecting our data. First, we will install the MedMNIST package in order to obtain the PathMNIST data that we will be using.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "baf27899", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install medmnist\n" + ] + }, + { + "cell_type": "markdown", + "id": "22031f46", + "metadata": {}, + "source": [ + "The original MNIST(Modified National Institute of Standards in Technology) database is a database containing various images of handwritten digits, and people use this dataset to train computer programs to recognize digits. MNIST is often used as a benchmark, and researchers can test different machine learning algorithms on MNIST to evaluate their performance.\n", + "MedMNIST is a database of biomedical images with 18 different datasets. \n", + "\n", + "The 2D Datasets are:\n", + "1. PathMNIST(Colon Pathology images)\n", + "2. ChestMNIST(Chest X-Ray images)\n", + "3. DermaMNIST(Dermatoscope images)\n", + "4. OCTMNIST(Retinal OCT images)\n", + "5. PneumoniaMNIST(Chest X-Ray images)\n", + "6. RetinaMNIST(Fundus Camera images)\n", + "7. BreastMNIST(Breast Ultrasound images)\n", + "8. BloodMNIST(Blood Cell Microscope images)\n", + "9. TissueMNIST(Kidney Cortex Microscope images)\n", + "10. - 12. OrganAMNIST, OrganCMNIST, and Organ SMNIST (Abdominal CT images)\n", + "\n", + "The 3D Datasets are:\n", + "1. OrganMNIST3D(Abdominal CT images)\n", + "2. NoduleMNIST3D(Chest CT images)\n", + "3. AdrenalMNIST3D(Shape from Abdominal CT images)\n", + "4. FractureMNIST3D(Chest CT images)\n", + "5. VesselMNIST3D(Shape from Brain MRA images)\n", + "6. SynapseMNIST3D(Electron Microscope images)\n", + "\n", + "Here are some visualizations of the MedMNIST dataset. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "2eeaf438", + "metadata": {}, + "source": [ + "![MedMNIST visualizations](https://medmnist.com/assets/v2/imgs/overview.jpg)\n" + ] + }, + { + "cell_type": "markdown", + "id": "bc3638aa", + "metadata": {}, + "source": [ + "Figure 2 - MedMNIST datasets - MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification (J. Yang)\n", + "\n", + "\n", + "The data is stored in a 28×28 (2D) or 28×28×28 (3D) format, similar to the 28×28 size of the images in the original MNIST dataset. \n", + "\n", + "For more information on MedMNIST and its installation visit github.com/MedMNIST/MedMNIST/ or medmnist.com\n", + "\n", + "For the purposes of this tutorial, we will be focusing on PathMNIST, the colon pathology images. The PathMNIST dataset has a total of 107,180 samples, and these images were taken from the study \"Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study,\" by Jakob Nikolas Kather, Johannes Krisam, et al.\n", + "\n", + "We will look more closely at the images in this dataset soon. \n", + "\n", + "\n", + "\n", + "Now, let's import MedMNIST and verify the version number before we move on. \n", + "\n", + "Command-Line Interface Note: This is Python code, so to run this, we must enter the Python shell. To do this, type python or python3 in the command prompt and press enter. After executing these commands, type exit() or quit() to exit the Python shell.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "411eb42d", + "metadata": {}, + "outputs": [], + "source": [ + "import medmnist\n", + "print(medmnist.__version__)" + ] + }, + { + "cell_type": "markdown", + "id": "84912c51", + "metadata": {}, + "source": [ + "Time to load our data! Let's download all MedMNIST datasets to the root directory. In this tutorial, we will only be using the PathMNIST dataset; however, feel free to use any of the other datasets you see being downloaded below (which are also mentioned above) to try out GaNDLF for yourself!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0536b26c", + "metadata": {}, + "outputs": [], + "source": [ + "!python -m medmnist download\n" + ] + }, + { + "cell_type": "markdown", + "id": "6a040293", + "metadata": {}, + "source": [ + "-m stands for module, and this command is saying that we want to access the medmnist module and run the download function from that module.\n", + "\n", + "Before we continue, let's navigate back to the base directory.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2cb0640", + "metadata": {}, + "outputs": [], + "source": [ + "%cd ..\n" + ] + }, + { + "cell_type": "markdown", + "id": "0509cadf", + "metadata": {}, + "source": [ + "Now, let's save all PathMNIST pathology images within the dataset folder (located inside the medmnist directory) in PNG format for use in training and inference.\n", + "If you've already gone through this tutorial and are looking to try using a different MedMNIST dataset, simply change --flag=pathmnist to any of the other datasets that were downloaded above—it's as simple as that!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19db62eb", + "metadata": {}, + "outputs": [], + "source": [ + "!python -m medmnist save --flag=pathmnist --folder=medmnist/dataset/ --postfix=png" + ] + }, + { + "cell_type": "markdown", + "id": "cbcb07ff", + "metadata": {}, + "source": [ + "The --folder option specifies where to save the downloaded information, and the --postfix option specifies that the images should be saved in png format.\n", + "\n", + "For this tutorial, we will be using the full PathMNIST dataset for training, validation and testing. However, to improve efficiency and save time, you may consider using a fraction of this dataset instead with GaNDLF. \n", + "\n", + "To download the full dataset:\n", + "\n", + "Let's retrieve and download the train_path_full data CSV file within the dataset folder, which consists of ~90,000 images.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a699380", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O /content/medmnist/dataset/train_path_full.csv \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=zzokqk7hjzwmjxvamrxotxu78ihd5az0&file_id=f_972380483494\"" + ] + }, + { + "cell_type": "markdown", + "id": "831324a5", + "metadata": {}, + "source": [ + "location where the file should be saved and what the new filename will be\n", + ". \n", + "Let's retrieve and download the val_path_full data CSV file within the dataset folder, which is the full validation dataset consisting of ~10,000 images. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "785af619", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O /content/medmnist/dataset/val_path_full.csv \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=bjoh6hn27l6ifqqtrs7w66za2hdwlu0a&file_id=f_990373191120\"" + ] + }, + { + "cell_type": "markdown", + "id": "37bd1888", + "metadata": {}, + "source": [ + "For the last of the data CSV files, let's retrieve and download the test_path_full data CSV file within the dataset folder. This CSV file contains ~7200 individual images.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42d61c84", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O /content/medmnist/dataset/test_path_full.csv \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=jjzoifpdly0pmkdaguy0cxbdfkig81eq&file_id=f_990374552591\"" + ] + }, + { + "cell_type": "markdown", + "id": "55be009b", + "metadata": {}, + "source": [ + "To download the “tiny” dataset:\n", + "\n", + "Let's retrieve and download the train_path_tiny data CSV file within the dataset folder, which consists of ~4,000 images.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d63bf0e", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O /content/medmnist/dataset/train_path_full.csv \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=um4003lkrvyj55jm4a0jz7zsuokb0r8o&file_id=f_991821586980\"" + ] + }, + { + "cell_type": "markdown", + "id": "e7d39d90", + "metadata": {}, + "source": [ + "Let's retrieve and download the val_path_tiny data CSV file within the dataset folder, which consists of ~1,000 images. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0054486", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O /content/medmnist/dataset/val_path_full.csv \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=rsmff27sm2z34r5xso1jx8xix7nhfspc&file_id=f_991817441206\"" + ] + }, + { + "cell_type": "markdown", + "id": "5a1b249b", + "metadata": {}, + "source": [ + "For the last of the data CSV files, let's retrieve and download the test_path_full data CSV file within the dataset folder. This CSV file contains 500 individual images.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdc098fe", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O /content/medmnist/dataset/test_path_full.csv \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=22lm0qfzk5luap72mtdpzx5l3ocflopa&file_id=f_991819617152\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "cad43170", + "metadata": {}, + "source": [ + "Command-Line Interface Note: If you run into any issues when trying to download these files, you can copy and paste these links into your web browser to download the .csv files. Then, you can manually place them into the desired directory using your file manager.\n", + "\n", + "\n", + "Now, we will retrieve and download the config YAML file to the base directory. This file specifies important information to be used in training and inference (model and training parameters, data preprocessing specifications, etc.).\n", + "\n", + "For the purposes of this tutorial, we have already constructed this file to fit our specific task, but for other tasks and experiments that you may want to run, this file will need to be edited to fit the required specifications of your experiment. However, the overall structure of this file will stay the same regardless of your task, so you should be able to get by by simply downloading and editing the config.yaml file we're using below for use in your own experiments.\n", + "\n", + "Either way, we highly encourage you to download and take a look at the structure of this file before proceeding if you intend to use GaNDLF for your own experiments, as it will be the backbone of all tasks you use GaNDLF with in the future. The file contains comments explaining the various parameters and what they mean. If you plan on trying to use any of the other datasets, specifically the 3D ones, make sure to change the number of dimensions on line 11 of the file to 3. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e71a3bfa", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O config.yaml \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=hs0zwezggl4rxtzgrcaq86enu7qwuvqx&file_id=f_974251081617\"" + ] + }, + { + "cell_type": "markdown", + "id": "5be6bb2f", + "metadata": {}, + "source": [ + "Finally, let's retrieve and download an updated copy of the gandlf_collectStats file to the base directory for use in plotting and visualizing our results. While an earlier version of this file is present in the GaNDLF repo, it is not suitable for classification tasks and has been modified for this tutorial to produce classification training and validation accuracy and loss plots. This file will be included in the GaNDLF repo in a future update, but for now, we will retrieve the updated file externally for use in this tutorial.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4309c31e", + "metadata": {}, + "outputs": [], + "source": [ + "!wget -O gandlf_collectStats_final \"https://app.box.com/index.php?rm=box_download_shared_file&shared_name=avq6pvqg3uzsc4uzbklab66mad6eaik5&file_id=f_989875069231\"" + ] + }, + { + "cell_type": "markdown", + "id": "41871543", + "metadata": {}, + "source": [ + "Let's visualize some sample images and their classes from the PathMNIST dataset.\n", + "Image classes for reference:\n", + "Class 0: Adipose\n", + "Class 1: Background\n", + "Class 2: Debris\n", + "Class 3: Lymphocytes\n", + "Class 4: Mucus\n", + "Class 5: Smooth Muscle\n", + "Class 6: Normal Colon Mucosa\n", + "Class 7: Cancer-Associated Stroma\n", + "Class 8: Colorectal Adenocarcinoma Epithelium\n", + "\n", + "Before running the code below, make sure you enter the python shell again if you are using a command line interface. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f322c5e", + "metadata": {}, + "outputs": [], + "source": [ + "#Step 1\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.image as mpimg\n", + "import pandas as pd\n", + "\n", + "\n", + "#Step 2\n", + "df_pathmnist = pd.read_csv('./medmnist/dataset/pathmnist.csv')\n", + "\n", + "\n", + "#Step 3\n", + "selected_images = [32, 36, 46, 13, 14, 8, 12, 18, 5, 6, 17, 0, 16, 3, 7, 10, 43, 45, 55, 1, 31, 41, 4, 9, 11]\n", + "\n", + "\n", + "#Step 4\n", + "fig, ax = plt.subplot_mosaic([\n", + " ['img0', 'img1', 'img2', 'img3', 'img4'],\n", + " ['img5', 'img6', 'img7', 'img8', 'img9'],\n", + " ['img10', 'img11', 'img12', 'img13', 'img14'],\n", + " ['img15', 'img16', 'img17', 'img18', 'img19'],\n", + " ['img20', 'img21', 'img22', 'img23', 'img24']\n", + "], figsize=(15, 15))\n", + "\n", + "\n", + "#Step 5\n", + "for i in range(len(selected_images)):\n", + " img = selected_images[i]\n", + " filename = df_pathmnist.iloc[img]['train0_0.png']\n", + " img_class = df_pathmnist.iloc[img]['0']\n", + "\n", + "\n", + " path_img = mpimg.imread(f'./medmnist/dataset/pathmnist/{filename}')\n", + "\n", + "\n", + " ax[f'img{i}'].imshow(path_img)\n", + " ax[f'img{i}'].axis('off')\n", + " ax[f'img{i}'].set_title(f'Class {img_class}')\n", + "\n", + "\n", + "#Step 6\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "c18e375e", + "metadata": {}, + "source": [ + "Let’s quickly go over what exactly this code is doing.\n", + "1. We import all of the necessary libraries, such as matplotlib(which we discussed earlier) and pandas, which is a Python library that provides methods for data manipulation. \n", + "2. The file paths for the PathMNIST images are read and stored in the df_pathmnist variable. \n", + "3. We initialize the selected_images variable which stores an array of integers. These integers represent the indices of the images from the pathmnist dataset we want to display. You can also try changing some of these values to see the different displayed plots and see more images in the dataset. \n", + "4. We create a 5x5 grid of subplots to display our 25 images.\n", + "5. We iterate through the selected_images array and read the corresponding image files. Then, we display the images in the proper subplots.\n", + "6. Finally, plt.show() displays the plot we have created. \n", + "\n", + "Your resulting plot should have an assortment of images from the PathMNIST set including images from all of class 0-8.\n", + "\n", + "Let’s quickly go over what these different classes are by using the results above:\n", + "\n", + "Class 0: Adipose\n", + "Adipose tissue is just tissue that stores fat. This tissue is made up of adipocytes, or fat cells, which store energy.\n", + "\n", + "\n", + "\n", + "Class 1: Background\n", + "The background class represents the non-tissue areas of the image that aren’t necessarily areas of interest.\n", + "\n", + "\n", + "\n", + "Class 2: Debris\n", + "Class 2 contains images of miscellaneous non-living particles or materials found in the colon.\n", + "\n", + "\n", + "\n", + "Class 3: Lymphocytes \n", + "Class 4 contains images of lymphocytes. Lymphocytes are a type of white blood cell that play a significant role in the immune system.\n", + "\n", + "\n", + "\n", + "Class 4: Mucus\n", + "Class 4 contains images of normal mucus that is found in healthy colons.\n", + "\n", + "\n", + "\n", + "\n", + "Class 5: Smooth Muscle\n", + "Class 5 contains images of smooth muscle tissue that is seen in the walls of the colon.\n", + "\n", + "\n", + "\n", + "Class 6: Normal Colon Mucosa\n", + "Class 6 contains images of the colon mucosa, which lines the colon.\n", + "\n", + "\n", + "\n", + "\n", + "Class 7: Cancer-Associated Stroma\n", + "Class 7 contains images of stroma, which is supportive tissue that provides tumors with nutrients and protection to assist in the cancer’s survival.\n", + "\n", + "\n", + "\n", + "\n", + "Class 8: Colorectal Adenocarcinoma Epithelium\n", + "The epithelium is the tissue that lines the colon, and Class 8 contains images of epithelial cells affected by colorectal adenocarcinoma, a type of cancer.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5195c707", + "metadata": {}, + "source": [ + "Now, on to training! Since there is only one GPU, let's set CUDA_VISIBLE_DEVICES to 0 to train on the first (and only) available GPU. The way to execute this code will vary based on what you are running your code on.\n", + "\n", + "On Colab run this Python code: \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2afcc3d0", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"" + ] + }, + { + "cell_type": "markdown", + "id": "91a977dc", + "metadata": {}, + "source": [ + "If you are following these instructions on your own command line interface, instead run the following: \n", + "\n", + "export \"CUDA_VISIBLE_DEVICES\"=\"0\"" + ] + }, + { + "cell_type": "markdown", + "id": "8a2d6e30", + "metadata": {}, + "source": [ + "Let's run the training script! For this run, we will pass in the config.yaml file that we just downloaded to the -c parameter, the training and validation CSV files to the -i parameter, and the model directory to the -m parameter (folder will automatically be created if it doesn't exist, which, in our case, it doesn't). We will also specify -t True to indicate that we are training and -d cuda to indicate that we will be training on the GPU. For demonstration purposes, we will only be training on 5 epochs. The number of epochs is the number of times that the model will pass through the entire training dataset during the training process. In our case, the model will pass through the dataset 5 times. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f01f2dab", + "metadata": {}, + "outputs": [], + "source": [ + "!python /content/GaNDLF/gandlf_run -c /content/config.yaml -i /content/medmnist/dataset/train_path_full.csv,/content/medmnist/dataset/val_path_full.csv -m /content/model/ -t True -d cuda\n" + ] + }, + { + "cell_type": "markdown", + "id": "7a508061", + "metadata": {}, + "source": [ + "\n", + "You will likely notice a persistent error saying \"y_pred contains classes not in y_true\"--this is a known issue and will not affect our training performance, so feel free to ignore it.\n", + "\n", + "If your program stops for any reason and you try to run it again, you may see this error:\n", + "ValueError: The parameters are not the same as the ones stored in the previous run, please re-check.\n", + "To resolve this, delete the existing model by deleting the “model” file and then try executing the command again. \n", + "\n", + "Potential Google Colab Error: If your Google Colab notebook is not correctly using the GPU, the program may run out of RAM and stop during the process of constructing the queue for training data. \n", + "\n", + "Now that training is complete, let's collect and save model statistics to the output_stats folder. Using -c True indicates that we'd like the 4 plots ordinarily generated by this command to be combined into two plots by overlaying training and validation statistics on the same graphs instead of keeping them separate. Feel free to experiment with this command by using -c False instead and viewing the resulting plots.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aaa90885", + "metadata": {}, + "outputs": [], + "source": [ + "!python /content/gandlf_collectStats_final -m /content/model/ -o /content/output_stats -c True" + ] + }, + { + "cell_type": "markdown", + "id": "5840544b", + "metadata": {}, + "source": [ + "Now, let's view our generated plots! Make sure to enter the Python shell if you are working in the command line interface.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85a53674", + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import Image\n", + "Image(\"/content/output_stats/plot.png\", width=1500, height=1000)" + ] + }, + { + "cell_type": "markdown", + "id": "3a823e40", + "metadata": {}, + "source": [ + "Your results should show both an Accuracy Plot and a Loss Plot.\n", + "\n", + "Since we only trained the model on a very small number of epochs, we shouldn't be expecting very impressive results here. However, from the graphs, we can tell that accuracy is steadily increasing and loss is steadily decreasing, which is a great sign.\n", + "\n", + "Finally, let's run the inference script. This is almost identical to running the training script; however, note that the argument for the -t parameter has been changed from True to False to specify that we are not training, and we are using the test_path_full csv file to access the testing images.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ba2b24a", + "metadata": {}, + "outputs": [], + "source": [ + "!python /content/GaNDLF/gandlf_run -c /content/config.yaml -i /content/medmnist/dataset/test_path_full.csv -m /content/model/ -t False -d cuda" + ] + }, + { + "cell_type": "markdown", + "id": "b3e630c0", + "metadata": {}, + "source": [ + "Now that inference is complete, let's view some sample test images along with their predicted and ground truth classes to get a visual idea of how well our model did on each class. Remember to enter the Python shell if you are using the command line interface." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c2dbdf6", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import matplotlib.image as mpimg\n", + "import pandas as pd\n", + "\n", + "\n", + "df_preds = pd.read_csv('./model/final_preds_and_avg_probs.csv')\n", + "\n", + "\n", + "selected_images = [4, 7, 14, 22, 26, 3, 11, 17, 37, 45, 52, 1, 2, 40, 47, 76, 8, 13, 28, 19, 36, 79, 0, 9, 10]\n", + "\n", + "\n", + "fig, ax = plt.subplot_mosaic([\n", + " ['img0', 'img1', 'img2', 'img3', 'img4'],\n", + " ['img5', 'img6', 'img7', 'img8', 'img9'],\n", + " ['img10', 'img11', 'img12', 'img13', 'img14'],\n", + " ['img15', 'img16', 'img17', 'img18', 'img19'],\n", + " ['img20', 'img21', 'img22', 'img23', 'img24']\n", + "], figsize=(13, 13), constrained_layout = True)\n", + "\n", + "\n", + "for i in range(len(selected_images)):\n", + " img = selected_images[i]\n", + " filename = df_preds.iloc[img]['SubjectID']\n", + " ground_truth = filename.split('_')[1].split('.')[0]\n", + " pred_class = df_preds.iloc[img]['PredictedClass']\n", + "\n", + " path_img = mpimg.imread(f'./medmnist/dataset/pathmnist/{filename}')\n", + "\n", + "\n", + " ax[f'img{i}'].imshow(path_img)\n", + " ax[f'img{i}'].axis('off')\n", + " ax[f'img{i}'].set_title(f'Predicted Class: {pred_class}\\nGround . Truth: {ground_truth}')\n", + "\n", + "\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9268f2ca", + "metadata": {}, + "source": [ + "Your produced plot should contain 16 images from the PathMNIST set, and each image should be accompanied by both a Predicted Class and a Ground Truth label. \n", + "\n", + "We can see that our model did a decent job of making predictions. However, we can see that it has a few common misconceptions. We can see that the model mistook smooth muscle for debris thrice and cancer-associated stroma for debris once. The model also mistook mucus as adipose twice, and it mistook Colorectal Adenocarcinoma Epithelium as Normal Colon Mucosa once.\n", + "\n", + "\n", + "To conclude this tutorial, let's zoom out and take a look at how well our model did as a whole on each class by constructing a confusion matrix from our inference data.\n", + "Note: if you'd like, feel free to change the colormap of the confusion matrix (denoted by \"cmap\" in the cm_display.plot() command) to your liking. Here's a list of some of the most popular colormaps: viridis (default), plasma, inferno, magma, cividis.\n", + "\n", + "Execute the following code in the Python shell: \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b827c049", + "metadata": {}, + "outputs": [], + "source": [ + "import sklearn\n", + "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", + "\n", + "\n", + "gt_list = []\n", + "pred_list = []\n", + "\n", + "\n", + "for i in range(len(df_preds)):\n", + " filename = df_preds.iloc[i]['SubjectID']\n", + "\n", + "\n", + " ground_truth = int(filename.split('_')[1].split('.')[0])\n", + " pred_class = int(df_preds.iloc[i]['PredictedClass'])\n", + "\n", + "\n", + " gt_list.append(ground_truth)\n", + " pred_list.append(pred_class)\n", + "\n", + "\n", + "cm = confusion_matrix(gt_list, pred_list)\n", + "cm_display = ConfusionMatrixDisplay(confusion_matrix = cm)\n", + "\n", + "\n", + "fig, ax = plt.subplots(figsize = (12, 12))\n", + "cm_display.plot(cmap = 'viridis', ax = ax)\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "607a9dd8", + "metadata": {}, + "source": [ + "Let’s quickly go over this code:\n", + "\n", + "1. We import the Scikit-Learn library, and we will be using this library's methods of creating confusion matrices. \n", + "\n", + "2. We create empty lists gt_list and pred_list to store our ground truth and predicted labels respectively.\n", + "\n", + "3. The for loop iterates through df_preds and extracts the ground truth label and the predicted class for each image and then appends the labels to their corresponding lists.\n", + "\n", + "4. We create a confusion matrix from these two lists and then call the plot() method to visualize it with our desired colormap.\n", + "\n", + "5. We display the confusion matrix using plt.show().\n" + ] + }, + { + "cell_type": "markdown", + "id": "dff24079", + "metadata": {}, + "source": [ + "Your resulting plot should be a 9x9 matrix that displays Predicted label vs. True label\n", + "\n", + "In our matrix, larger numbers along the diagonal represent correct classifications. Here, we can see that while the model performed well overall, it had difficulties when it came to images of Class 7, incorrectly predicting a majority of them as belonging to Class 2 instead. We can see a similar trend with images of Class 5, with the model incorrectly predicting most of them as belonging to Class 2.\n", + "\n", + "Given the appearance of the accuracy and loss plots, had we trained on more epochs, we would have expected these results to improve. However, given that we only trained on 5 epochs, these are great results. Indeed, the model did very well on other classes, including Classes 0 and 1.\n", + "\n", + "That concludes this GaNDLF tutorial! Hopefully, this tutorial was helpful to you in understanding how GaNDLF works as well as how to apply it to your own projects. If you need any additional information about GaNDLF's usage and capabilities, please consult the GitHub repo (https://github.com/CBICA/GaNDLF) and the documentation (https://cbica.github.io/GaNDLF/). For more questions and support, please visit the Discussions page on GitHub (https://github.com/CBICA/GaNDLF/discussions)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f91563c", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/classification_pathmnist_notebook/config.yaml b/tutorials/classification_pathmnist_notebook/config.yaml new file mode 100644 index 000000000..f90dea3a1 --- /dev/null +++ b/tutorials/classification_pathmnist_notebook/config.yaml @@ -0,0 +1,200 @@ +# affix version +version: + { + minimum: 0.0.14, + maximum: 0.0.16 # this should NOT be made a variable, but should be tested after every tag is created + } +# Choose the model parameters here + +model: + { + dimension: 2, # the dimension of the model and dataset: defines dimensionality of computations + base_filters: 32, # Set base filters: number of filters present in the initial module of the U-Net convolution; for IncU-Net, keep this divisible by 4 + architecture: vgg, # options: unet, resunet, fcn, uinc, vgg, densenet + batch_norm: True, # this is only used for vgg + norm_type: batch, + final_layer: sigmoid, # can be either sigmoid, softmax or none (none == regression) + amp: False, # Set if you want to use Automatic Mixed Precision for your operations or not - options: True, False + n_channels: 3, # set the input channels - useful when reading RGB or images that have vectored pixel types + class_list: ["0", "1", "2", "3", "4", "5", "6", "7", "8"] # changed for pathmnist + } +# this is to enable or disable lazy loading - setting to true reads all data once during data loading, resulting in improvements +# in I/O at the expense of memory consumption +in_memory: False +# this will save the generated masks for validation and testing data for qualitative analysis +save_masks: False +# Set the Modality : rad for radiology, path for histopathology +modality: rad +# Patch size during training - 2D patch for breast images since third dimension is not patched +patch_size: [256, 256] +# uniform: UniformSampler or label: LabelSampler +patch_sampler: uniform +#metrics: ["classification_accuracy"] +#metrics: ["accuracy"] +metrics: + #- cel + - classification_accuracy + - f1: { + average: weighted, + } + - accuracy + - balanced_accuracy + # - precision: { + # average: weighted, + # } + # - recall + # - iou: { + # reduction: sum, + # } +# Number of epochs +num_epochs: 5 +# Set the patience - measured in number of epochs after which, if the performance metric does not improve, exit the training loop - defaults to the number of epochs +patience: 5 +# Set the batch size +batch_size: 16 +# Set the initial learning rate +learning_rate: 0.001 +# Learning rate scheduler - options: triangle, triangle_modified, exp, reduce-on-lr, step, more to come soon - default hyperparameters can be changed thru code +scheduler: triangle +# Set which loss function you want to use - options : 'dc' - for dice only, 'dcce' - for sum of dice and CE and you can guess the next (only lower-case please) +# options: dc (dice only), dc_log (-log of dice), ce (), dcce (sum of dice and ce), mse () ... +# mse is the MSE defined by torch and can define a variable 'reduction'; see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss +# use mse_torch for regression/classification problems and dice for segmentation +loss_function: cel +# this parameter weights the loss to handle imbalanced losses better +weighted_loss: True +#loss_function: +# { +# 'mse':{ +# 'reduction': 'mean' # see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss for all options +# } +# } +# Which optimizer do you want to use - adam/sgd +opt: adam +# this parameter controls the nested training process +# performs randomized k-fold cross-validation +# split is performed using sklearn's KFold method +# for single fold run, use '-' before the fold number +nested_training: + { + testing: 1, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled + validation: -5 # this controls the validation data splits for model training + } +## pre-processing +# this constructs an order of transformations, which is applied to all images in the data loader +# order: resize --> threshold/clip --> resample --> normalize +# 'threshold': performs intensity thresholding; i.e., if x[i] < min: x[i] = 0; and if x[i] > max: x[i] = 0 +# 'clip': performs intensity clipping; i.e., if x[i] < min: x[i] = min; and if x[i] > max: x[i] = max +# 'threshold'/'clip': if either min/max is not defined, it is taken as the minimum/maximum of the image, respectively +# 'normalize': performs z-score normalization: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.ZNormalization +# 'normalize_nonZero': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels +# 'normalize_nonZero_masked': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels with the stats applied on non-zero pixels +# 'crop_external_zero_planes': crops all non-zero planes from input tensor to reduce image search space +# 'resample: resolution: X,Y,Z': resample the voxel resolution: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample +# 'resample: resolution: X': resample the voxel resolution in an isotropic manner: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample +# resize the image(s) and mask (this should be greater than or equal to patch_size); resize is done ONLY when resample is not defined + + +# data_augmentation: +# { +# # 'spatial':{ +# # 'probability': 0.5 +# # }, +# # 'kspace':{ +# # 'probability': 0.5 +# # }, +# 'bias':{ +# 'probability': 0.5 +# }, +# 'blur':{ +# 'probability': 0.5 +# }, +# 'noise':{ +# 'probability': 0.5 +# }, +# 'swap':{ +# 'probability': 0.5 +# } +# } + +data_preprocessing: + { + #'normalize_div_by_255', + # 'threshold':{ + # 'min': 10, + # 'max': 75 + # }, + # 'clip':{ + # 'min': 10, + # 'max': 75 + # }, + #'normalize_imagenet', + # 'resample':{ + # 'resolution': [1,2,3] + # }, + 'resize': [256,256], # this is generally not recommended, as it changes image properties in unexpected ways + #'resize_image': [256,256], #resizes the image and mask BEFORE applying any another operation + #'resize_patch': [256,256] #resizes the image and mask AFTER extracting the patch + } +# data_preprocessing: +# { +# # 'normalize', +# # # 'normalize_nonZero', # this performs z-score normalization only on non-zero pixels +# # 'resample':{ +# # 'resolution': [1,2,3] +# # }, + #'resize': [128, 128], # this is generally not recommended, as it changes image properties in unexpected ways +# # 'crop_external_zero_planes', # this will crop all zero-valued planes across all axes +# } +# various data augmentation techniques +# options: affine, elastic, downsample, motion, ghosting, bias, blur, gaussianNoise, swap +# keep/edit as needed +# all transforms: https://torchio.readthedocs.io/transforms/transforms.html?highlight=transforms +# 'kspace': one of motion, ghosting or spiking is picked (randomly) for augmentation +# 'probability' subkey adds the probability of the particular augmentation getting added during training (this is always 1 for normalize and resampling) +# data_augmentation: + # { + # default_probability: 0.5, + # 'affine', + # 'elastic', + # 'kspace':{ + # 'probability': 1 + # }, + # 'bias', + # 'blur': { + # 'std': [0, 1] # default std-dev range, for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomBlur + # }, + # 'noise': { # for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomNoise + # 'mean': 0, # default mean + # 'std': [0, 1] # default std-dev range + # }, + # 'anisotropic':{ + # 'axis': [0,1], + # 'downsampling': [2,2.5] + # }, + # } +# parallel training on HPC - here goes the command to prepend to send to a high performance computing +# cluster for parallel computing during multi-fold training +# not used for single fold training +# this gets passed before the training_loop, so ensure enough memory is provided along with other parameters +# that your HPC would expect +# ${outputDir} will be changed to the outputDir you pass in CLI + '/${fold_number}' +# ensure that the correct location of the virtual environment is getting invoked, otherwise it would pick up the system python, which might not have all dependencies + +# UB_commented: parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python', +#parallel_compute_command: '' +## queue configuration - https://torchio.readthedocs.io/data/patch_training.html?#queue +# this determines the maximum number of patches that can be stored in the queue. Using a large number means that the queue needs to be filled less often, but more CPU memory is needed to store the patches + +q_max_length: 5 + +# this determines the number of patches to extract from each volume. A small number of patches ensures a large variability in the queue, but training will be slower + +q_samples_per_volume: 1 + +# this determines the number subprocesses to use for data loading; '0' means main process is used + +q_num_workers: 0 # scale this according to available CPU resources (was 16) + +# used for debugging +q_verbose: False