Installation

To install and use this software, you need:

A GPU running CUDA 11.7 (other CUDA versions may work, but they are not officially supported),
conda (or Python 3.10 and pdm), and
git.

First, clone this repository.

git clone https://github.com/AllenCell/benchmarking_representations
cd benchmarking_representations

Create a virtual environment.

conda create --name br python=3.10
conda activate br

Depending on your GPU set-up, you may need to set the CUDA_VISIBLE_DEVICES environment variable. To achieve this, you will first need to get the Universally Unique IDs for the GPUs and then set CUDA_VISIBLE_DEVICES to some/all of those (a comma-separated list), as in the following examples.

Example 1

export CUDA_VISIBLE_DEVICES=0,1

Example 2: Using one partition of a MIG partitioned GPU

export CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Next, install all required packages

pip install -r requirements1.txt
pip install -r requirements2.txt
pip install -e .

For pdm users, follow these installation steps instead.

Troubleshooting

Q: When installing dependencies, pytorch fails to install with the following error message.

torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus

A: You may need to configure the CUDA_VISIBLE_DEVICES environment variable.

Set env variables

To run the models, you must set the CYTODL_CONFIG_PATH environment variable to point to the br/configs folder. Check that your current working directory is the benchmarking_representations folder, then run the following command (this will last for only the duration of your shell session).

export CYTODL_CONFIG_PATH=$PWD/configs/

1. Model training

Steps to download pre-processed data

Preprocessing the data can take several hours. To skip this step, download the preprocessed data for each dataset. This will use around 740 GB.

aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropriate_representation_learning/preprocessed_data/

Steps to train models

Training these models can take days. We've published our trained models so you don't have to. Skip to the next section if you'd like to just use our models.

Create a single cell manifest (e.g. csv, parquet) for each dataset with a column corresponding to final processed paths, and create a split column corresponding to train/test/validation split.
Update the datamodule config file with the path to this single cell manifest. For example, update the path key in the pcna config to be the path to the processed single cell manifest. Additionally, update the image and cell_id keys under transforms/groups to point to their corresponding column names in the single cell manifest. Similarly, update all other image and pointcloud datamodule files for the PCNA dataset here -

configs
└── data
    └── pcna
        ├── image.yaml               <- Datamodule for PCNA images
        ├── pc.yaml                  <- Datamodule for PCNA point clouds
        ├── pc_intensity.yaml        <- Datamodule for PCNA point clouds with intensity
        └── pc_intensity_jitter.yaml <- Datamodule for PCNA point clouds with intensity and jitter

Train models using cyto_dl. Ensure to run the training scripts from the folder where the repo was cloned (and where all the data was downloaded). Experiment configs for point cloud and image models for the cellpack dataset are located here:

configs
└── experiment
    └── cellpack
        ├── image_classical.yaml <- Classical image model experiment
        ├── image_so3.yaml       <- Rotation invariant image model experiment
        ├── pc_classical.yaml    <- Classical point cloud model experiment
        └── pc_so3.yaml          <- Rotation invariant point cloud model experiment

Here is an example of training a rotation invariant point cloud model.

python src/br/models/train.py experiment=cellpack/pc_so3

Override parts of the experiment config via command line or manually in the configs. For example, to train a classical model, run the following.

python src/br/models/train.py experiment=cellpack/pc_so3 model=pc/classical_earthmovers_sphere ++csv.save_dir=[SAVE_DIR]

2. Model inference

Steps to download pre-trained models

To skip model training, download our pre-trained models. For each of the six datasets, there are five .ckpt files. The easiest way to get these 30 models in the expected layout is with the AWS CLI.

Option 1: AWS CLI

Install the AWS CLI.
Confirm that you are in the benchmarking_representations folder.

$ pwd
/home/myuser/benchmarking_representations/

Download the 30 models. This will use almost 4GB.

aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropriate_representation_learning/model_checkpoints/

Option 2: Download individual checkpoints

Instead of installing the AWS CLI, you can download .ckpt files one at a time by browsing the dataset on Quilt.

By default, the checkpoint files are expected in benchmarking_representations/morphology_appropriate_representation_learning/model_checkpoints/, organized in six subfolders (one for each dataset). This folder structure is provided as part of this repo. Move the downloaded checkpoint files into the folder corresponding to their dataset.

Compute embeddings

Skip to the next section if you'd like to just use our pre-computed embeddings. Otherwise, to compute embeddings from the trained models, update the data paths in the datamodule files to point to your pre-processed data. Then, run the following commands.

Dataset	Embedding command
cellpack	`python src/br/analysis/run_embeddings.py --save_path "./outputs_cellpack/" --sdf False --dataset_name cellpack --batch_size 5 --debug False`
npm1_perturb	`python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1_perturb/" --sdf True --dataset_name npm1_perturb --batch_size 5 --debug False`
npm1	`python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1/" --sdf True --dataset_name npm1 --batch_size 5 --debug False`
npm1_64_res	`python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1_64_res/" --sdf True --dataset_name npm1_64_res --batch_size 5 --debug False --eval_scaled_img_resolution 64`
other_polymorphic	`python src/br/analysis/run_embeddings.py --save_path "./outputs_other_polymorphic/" --sdf True --dataset_name other_polymorphic --batch_size 5 --debug False`
other_punctate	`python src/br/analysis/run_embeddings.py --save_path "./outputs_other_punctate/" --sdf False --dataset_name other_punctate --batch_size 5 --debug False`
pcna	`python src/br/analysis/run_embeddings.py --save_path "./outputs_pcna/" --sdf False --dataset_name pcna --batch_size 5 --debug False`

3. Interpretability analysis

Steps to download pre-computed embeddings

Many of the results from the paper can be reproduced just from the embeddings produced by the model. You can download our pre-computed embeddings here.

cellPACK synthetic dataset
DNA replication foci dataset
WTC-11 hIPSc single cell image dataset v1 punctate structures
WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1)
WTC-11 hIPSc single cell image dataset v1 polymorphic structures
Nucleolar drug perturbation dataset

Steps to run benchmarking analysis

To compute benchmarking features from the embeddings and trained models, run the following commands.

Dataset	Benchmarking features command
cellpack	`python src/br/analysis/run_features.py --save_path "./outputs_cellpack/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/cellpack" --sdf False --dataset_name "cellpack" --debug False`
npm1	`python src/br/analysis/run_features.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1" --sdf True --dataset_name "npm1" --debug False`
npm1_64_res	`python src/br/analysis/run_features.py --save_path "./outputs_npm1_64_res/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_64_res" --sdf True --dataset_name "npm1_64_res" --debug False --eval_scaled_img_resolution 64`
other_polymorphic	`python src/br/analysis/run_features.py --save_path "./outputs_other_polymorphic/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_polymorphic" --sdf True --dataset_name "other_polymorphic" --debug False`
other_punctate	`python src/br/analysis/run_features.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --sdf False --dataset_name "other_punctate" --debug False`
pcna	`python src/br/analysis/run_features.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --sdf False --dataset_name "pcna" --debug False`

To combine features from different runs and compare, run

python src/br/analysis/run_features_combine.py --feature_path_1 './outputs_npm1/' --feature_path_2 './outputs_npm1_64_res/' --save_path "./outputs_npm1_combine/" --dataset_name_1 "npm1" --dataset_name_2 "npm1_64_res"

To run analysis like latent walks and archetype analysis on the embeddings and trained models, run the following commands.

Dataset	Analysis command
cellpack	`python src/br/analysis/run_analysis.py --save_path "./outputs_cellpack/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/cellpack" --dataset_name "cellpack" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False`
npm1	`python src/br/analysis/run_analysis.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1" --dataset_name "npm1" --run_name "Rotation_invariant_pointcloud_SDF" --sdf True --pacmap False`
other_polymorphic	`python src/br/analysis/run_analysis.py --save_path "./outputs_other_polymorphic/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_polymorphic" --dataset_name "other_polymorphic" --run_name "Rotation_invariant_pointcloud_SDF" --sdf True --pacmap True`
other_punctate	`python src/br/analysis/run_analysis.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --dataset_name "other_punctate" --run_name "Rotation_invariant_pointcloud_structurenorm" --sdf False --pacmap True`
pcna	`python src/br/analysis/run_analysis.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --dataset_name "pcna" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False`

Steps to run analysis for the nucleolar drug perturbation dataset

To compute q-values for the mean average precision scores associated with perturbation retrieval using the pre-computed features, run

python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb"

To compute cellprofiler features, open the project file using cellprofiler, and point to the single cell images of nucleoli in the npm1 perturbation dataset. This will generate a csv named MyExpt_Image.csv that contains mean, median, and stdev statistics per image across the different computed features.
To compute classification scores for the number of pieces of nucleoli using the pre-computed features and cellprofiler features, run

python src/br/analysis/run_classification.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1/" --dataset_name "npm1"

To run LDA analysis on the drug perturbation dataset, run

python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"

To run a baseline LDA on the DMSO subset of the drug perturbation dataset, run

python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!