SlideSleuth is a tool to analyze large whole slide image (WSI) datasets of lung adenocarcinoma (LUAD) via feature extraction and unsupervised learning. Specifically, SlideSleuth uses each slide image as input to a variational autoencoder (VAE), then clusters made by the VAE are analyzed. Within the clusters, we aim to identify biomarkers/cancer drivers for LUAD.
The tool includes pipelines that prepare WSI datasets for both a supervised classifier and a variational autoencoder.
Currently, the tool is still in active development. As of right now, only the data pipeline has been built. The development languages are Python and R, and bash. Pipelining and development are done with the help of Tensorflow, Openslide, and the R package Bioconductor. Containerization is done with Apptainer (formerly Singularity).
Run the command ./setup.sh
, followed by the command source ENV/bin/activate
in the same directory. This will install all necessary dependencies.
The project source code is divided into 4 main sections: features - code to tile images and sort the images into , data - code to perform model-specific post-processing on the data, models - code to train and test supervised and unsupervised models on the data, and visualization - code to visualize the trained model performance.
Assuming the tiled images are in folder
, execution of the script src/data/cvae_data_pipeline.sh
with folder
as the DIR_PATH
global variable will reorganize the data into a format that is readable by Tensorflow's data pipeline APIs. Similar to the data step, this step has been done by Jackson already for the UHN dataset (it is a little bit time consuming).
Assuming the use case of the UHN private dataset that this project was developed with, the script src/features/tile_uhn_binary.sh
should be run to make tiles from raw slide images. In the case of the UHN dataset, this has been done by Jackson already and may save some time if you contact him about transferring the data (assuming you have permission to view the data).
Once the data is processed, the convolutional variational autoencoder can be trained by running src/models/train_cvae.sh
with the desired DIR_PATH
, SAVE_PATH
, and FIG_PATH
global variables.
Once the model is trained, the autoencoder can reconstruct a sample of images by calling src/visualization/analyze_cvae.sh
.
The dataset used for the current iteration of this tool is a private dataset from UHN. Please contact the authors for inquiries regarding data availability. Other test datasets were used during development, mainly the TCGA-BRCA and TCGA-PAAD projects from GDC.
Please contact [email protected] for any questions, concerns, bug fixes, or further clarifications.