- Quiros A.C.+, Coudray N.+, Yeaton A., Yang X., Liu B., Chiriboga L., Karimkhan A., Narula N., Moore D.A., Park C.Y., Pass H., Moreira A.L., Le Quesne J.*, Tsirigos A.*, and Yuan K.* Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unlabeled, unannotated pathology slides. 2024
Abstract:
Definitive cancer diagnosis and management depend upon the extraction of information from microscopy images by pathologists. These images contain complex information requiring time-consuming expert human interpretation that is prone to human bias. Supervised deep learning approaches have proven powerful for classification tasks, but they are inherently limited by the cost and quality of annotations used for training these models. To address this limitation of supervised methods, we developed Histomorphological Phenotype Learning (HPL), a fully unsupervised methodology that requires no expert labels or annotations and operates via the automatic discovery of discriminatory image features in small image tiles. Tiles are grouped into morphologically similar clusters which constitute a library of histomorphological phenotypes, revealing trajectories from benign to malignant tissue via inflammatory and reactive phenotypes. These clusters have distinct features which can be identified using orthogonal methods, linking histologic, molecular and clinical phenotypes. Applied to lung cancer tissues, we show that they align closely with patient survival, with histopathologically recognised tumor types and growth patterns, and with transcriptomic measures of immunophenotype. We then demonstrate that these properties are maintained in a multi-cancer study. These results show the clusters represent recurrent host responses and modes of tumor growth emerging under natural selection.
@article{QuirosCoudray2024,
author = {Claudio Quiros, Adalberto and Coudray, Nicolas and Yeaton, Anna and Yang, Xinyu and Liu, Bojing and Le, Hortense and Chiriboga, Luis and Karimkhan, Afreen and Narula, Navneet and Moore, David A. and Park, Christopher Y. and Pass, Harvey and Moreira, Andre L. and Le Quesne, John and Tsirigos, Aristotelis and Yuan, Ke},
journal = {Nature Communications},
number = {1},
pages = {4596},
title = {Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides},
volume = {15},
year = {2024}}
}
In this repository you will find the following sections:
- WSI tiling process: Instructions on how to create H5 files from WSI tiles.
- Workspace setup: Details on H5 file content and directory structure.
- HPL instructions: Step-by-step instructions on how to run the complete methodology.
- Self-supervised Barlow Twins training.
- Tile vector representations.
- Combination of all sets into one H5.
- Fold cross validation files.
- Include metadata in H5 file.
- Leiden clustering.
- Removing background tiles.
- HPC configuration selection.
- Logistic regression for lung type WSI classification.
- Cox proportional hazards for survival regression.
- Correlation between annotations and HPCs.
- Get tiles and WSI samples for HPCs.
- HPL Visualizer: Interactive app to visualize UMAP representations, tiles, and HPC membership
- Frequently Asked Questions.
- TCGA HPL files: HPL output files from our paper results.
- Python Environment: Python version and packages.
- Dockers: Docker environments to run HPL steps.
This step divides whole slide images (WSIs) into 224x224 tiles and store them into H5 files. At the end of this step, you should have three H5 files. One per training, validation, and test sets. The training set will be used to train the self-supervised CNN, in our work this corresponded to 60% of TCGA LUAD & LUSC WSIs.
We used the framework provided in Coudray et al. 'Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning' Nature Medicine, 2018. The steps to run the framework are 0.1, 0.2.a, and 4 (end of readme). In our work we used Reinhardt normalization, which can be applied at the same time as the tiling is done through the '-N' option in step 0.1.
This section specifies requirements on H5 file content and directory structure to run the flow.
In the instructions below we use the following variables and names:
- dataset_name:
TCGAFFPE_LUADLUSC_5x_60pc
- marker_name:
he
- tile_size:
224
If you are not familiar with H5 files, you can find documentation on the python package here.
This framework makes the assumption that datasets inside each H5 set will follow the format 'set_labelname'. In addition, all H5 files are required to have the same number of datasets. Example:
- File:
hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
- Dataset names:
train_img
,train_tiles
,train_slides
,train_samples
- Dataset names:
- File:
hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_validation.h5
- Dataset names:
valid_img
,valid_tiles
,valid_slides
,valid_samples
- Dataset names:
- File:
hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_test.h5
- Dataset names:
test_img
,test_tiles
,test_slides
,test_samples
- Dataset names:
The code will make the following assumptions with respect to where the datasets, model training outputs, and image representations are stored:
- Datasets:
- Dataset folder.
- Follows the following structure:
- datasets/dataset_name/marker_name/patches_htile_size_wtile_size
- E.g.:
datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224
- Train, validation, and test sets:
- Each dataset will assume that at least there is a training set.
- Naming convention:
- hdf5_dataset_name_marker_name_set_name.h5
- E.g.:
datasets/TCGAFFPE_LUADLUSC_5x_60pc/he/patches_h224_w224/hdf5_TCGAFFPE_LUADLUSC_5x_60pc_he_train.h5
- Data_model_output:
- Output folder for self-supervised trained models.
- Follows the following structure:
- data_model_output/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
- E.g.:
data_model_output/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
- Results:
- Output folder for self-supervised representations results.
- This folder will contain the representation, clustering data, and logistic/cox regression results.
- Follows the following structure:
- results/model_name/dataset_name/htile_size_wtile_size_n3_zdimlatent_space_size
- E.g.:
results/BarlowTwins_3/TCGAFFPE_LUADLUSC_5x_60pc/h224_w224_n3_zdim128
The flow consists in the following steps:
- Self-supervised Barlow Twins training.
- Tile vector representations.
- Combination of all sets into one H5.
- Fold cross validation files.
- Include metadata in H5 file.
- Leiden clustering.
- Removing background tiles.
- HPC configuration selection.
- Logistic regression for lung type WSI classification.
- Cox proportional hazards for survival regression.
- Correlation between annotations and HPCs.
- Get tiles and WSI samples for HPCs.
You can find the full details on HPL instructions in this Readme_HPL file.
You can find standalone apps in the following locations. These were built using Marimo.
You can edit the code by running marimo edit tile_visualizer_umap.py
. Run the app with marimo run tile_visualizer_umap.py
.
You can find TCGA files, results, and commands to reproduce them on this Readme_replication file. For any questions regarding the New York University cohorts, please address reasonable requests to the corresponding authors.
You can follow steps on how to assign existing HPCs in this Readme_additional_cohort file. These instructions will guide you through assigning LUAD and Multi-cancer HPCs reported in the publication to your own cohort.
When I run the Leiden clustering step. I get an 'TypeError: can't pickle weakref objects' error in some folds.
Based on experience, this error occurs with non-compatible version on numba, umap-learn, and scanpy. The package versions in the python environment should work. But these alternative package combination works:
scanpy==1.7.1
pynndescent==0.5.0
numba==0.51.2
This section contains the following TCGA files produced by HPL:
- TCGA WSI tile image datasets.
- TCGA Self-supervised trained weights.
- TCGA tile projections.
- TCGA HPC configurations.
- TCGA WSI & patient representations.
For the New York University cohorts, please send reasonable requests to the corresponding authors.
You can find the WSI tile images at:
- LUAD & LUSC
- LUAD & LUSC 250K subsample for self-supervised model training.
- Multi-Cancer (BLCA, BRCA, CESC, COAD, LUSC, LUAD, PRAD, SKCM, STAD, UCEC)
- Multi-Cancer (BLCA, BRCA, CESC, COAD, LUSC, LUAD, PRAD, SKCM, STAD, UCEC) 250K subsample for self-supervised model training.
Self-supervised model weights:
Self-supervised model weights:
You can find tile projections for TCGA LUAD and LUSC cohorts at the following locations. These are the projections used in the publication results.
- LUAD & LUSC tile vector representations (background and artifact tiles unfiltered)
- LUAD & LUSC tile vector representations
- Multi-Cancer tile vector representations (BLCA, BRCA, CESC, COAD, LUSC, LUAD, PRAD, SKCM, STAD, UCEC)
You can find HPC configurations used in the publication results at:
- Background and artifact removal
- LUAD vs LUSC type classification
- LUAD survival
- Multi-cancer (BLCA, BRCA, CESC, COAD, LUSC, LUAD, PRAD, SKCM, STAD, UCEC)
You can find WSI and patient vector representations used in the publication results at:
- LUAD vs LUSC type classification
- LUAD survival
- Multi-cancer (BLCA, BRCA, CESC, COAD, LUSC, LUAD, PRAD, SKCM, STAD, UCEC)
The code uses Python 3.8 and the necessary packages can be found at requirements.txt
The flow uses TensorFlow 1.15 and according to TensorFlows Specs the closest CUDA and cuDNN version are cudatoolkits==10.0
and cudnn=7.6.0
.
However, depending on your GPU card you might need to use cudatoolkits==11.7
and cudnn=8.0
instead.
Newer cards with Ampere architecture (Nvidia 30s or A100s) would only work with CUDA 11.X, Nvidia maintains this repo, so you can use TensorFlow 1.15 with the new version of CUDA.
These commands should get the right environment to run HPL:
conda create -n HPL python=3.8 \
conda activate HPL \
python3 -m pip install --user nvidia-pyindex \
python3 -m pip install --user nvidia-tensorflow \
python3 -m pip install -r requirements.txt \
These are the dockers with the environments to run the steps of HPL. Step 'Leiden clustering' needs to be run with docker [2], all other steps can be run with docker [1]:
- Self-Supervised models training and projections:
- Leiden clustering:
If you want to run the docker image in your local machine. These commands should get you up and running.
Please take into account that the image aclaudioquiros/tf_package:v16 uses CUDA 10.0, if your GPU card uses the Ampere architecture (Nvidia 30s or A100s) it won't work appropriately.
In addition, if you want to run the Step 6 - Leiden clustering in HPL, you would need to change the image name:
docker run -it --mount src=`pwd`,target=/tmp/Workspace,type=bind aclaudioquiros/tf_package:v16
cd Workspace
# Command you want to run here.