To install and use this software, you need:
- A GPU running CUDA 11.7 (other CUDA versions may work, but they are not officially supported),
- conda (or Python 3.10 and pdm), and
- git.
First, clone this repository.
git clone https://github.com/AllenCell/benchmarking_representations
cd benchmarking_representations
Create a virtual environment.
conda create --name br python=3.10
conda activate br
Depending on your GPU set-up, you may need to set the CUDA_VISIBLE_DEVICES
environment variable.
To achieve this, you will first need to get the Universally Unique IDs for the GPUs and then set CUDA_VISIBLE_DEVICES
to some/all of those (a comma-separated list), as in the following examples.
Example 1
export CUDA_VISIBLE_DEVICES=0,1
Example 2: Using one partition of a MIG partitioned GPU
export CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Next, install all required packages
pip install -r requirements1.txt
pip install -r requirements2.txt
pip install -e .
For pdm
users, follow these installation steps instead.
Q: When installing dependencies, pytorch fails to install with the following error message.
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus
A: You may need to configure the CUDA_VISIBLE_DEVICES
environment variable.
To run the models, you must set the CYTODL_CONFIG_PATH
environment variable to point to the br/configs
folder.
Check that your current working directory is the benchmarking_representations
folder, then run the following command (this will last for only the duration of your shell session).
export CYTODL_CONFIG_PATH=$PWD/configs/
Preprocessing the data can take several hours. To skip this step, download the preprocessed data for each dataset. This will use around 740 GB.
aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropriate_representation_learning/preprocessed_data/
Training these models can take days. We've published our trained models so you don't have to. Skip to the next section if you'd like to just use our models.
- Create a single cell manifest (e.g. csv, parquet) for each dataset with a column corresponding to final processed paths, and create a split column corresponding to train/test/validation split.
- Update the final single cell dataset path (
SINGLE_CELL_DATASET_PATH
) and the column in the manifest for appropriate input modality (SDF_COLUMN
/SEG_COLUMN
/POINTCLOUD_COLUMN
/IMAGE_COLUMN
) in each datamodule file. e.g. for PCNA data these yaml files are located here -
configs
└── data
└── pcna
├── image.yaml <- Datamodule for PCNA images
├── pc.yaml <- Datamodule for PCNA point clouds
├── pc_intensity.yaml <- Datamodule for PCNA point clouds with intensity
└── pc_intensity_jitter.yaml <- Datamodule for PCNA point clouds with intensity and jitter
- Train models using cyto_dl. Ensure to run the training scripts from the folder where the repo was cloned (and where all the data was downloaded). Experiment configs for point cloud and image models for the cellpack dataset are located here:
configs
└── experiment
└── cellpack
├── image_classical.yaml <- Classical image model experiment
├── image_so3.yaml <- Rotation invariant image model experiment
├── pc_classical.yaml <- Classical point cloud model experiment
└── pc_so3.yaml <- Rotation invariant point cloud model experiment
Here is an example of training a rotation invariant point cloud model.
python src/br/models/train.py experiment=cellpack/pc_so3
Override parts of the experiment config via command line or manually in the configs. For example, to train a classical model, run the following.
python src/br/models/train.py experiment=cellpack/pc_so3 model=pc/classical_earthmovers_sphere ++csv.save_dir=[SAVE_DIR]
To skip model training, download our pre-trained models. For each of the six datasets, there are five .ckpt
files. The easiest way to get these 30 models in the expected layout is with the AWS CLI.
- Install the AWS CLI.
- Confirm that you are in the
benchmarking_representations
folder.
$ pwd
/home/myuser/benchmarking_representations/
- Download the 30 models. This will use almost 4GB.
aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropriate_representation_learning/model_checkpoints/
Instead of installing the AWS CLI, you can download .ckpt
files one at a time by browsing the dataset on Quilt.
By default, the checkpoint files are expected in benchmarking_representations/morphology_appropriate_representation_learning/model_checkpoints/
, organized in six subfolders (one for each dataset). This folder structure is provided as part of this repo. Move the downloaded checkpoint files into the folder corresponding to their dataset.
Skip to the next section if you'd like to just use our pre-computed embeddings. Otherwise, to compute embeddings from the trained models, update the data paths in the datamodule files to point to your pre-processed data. Then, run the following commands.
Dataset | Embedding command |
---|---|
cellpack | python src/br/analysis/run_embeddings.py --save_path "./outputs_cellpack/" --sdf False --dataset_name cellpack --batch_size 5 --debug False |
npm1_perturb | python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1_perturb/" --sdf True --dataset_name npm1_perturb --batch_size 5 --debug False |
npm1 | python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1/" --sdf True --dataset_name npm1 --batch_size 5 --debug False |
npm1_64_res | python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1_64_res/" --sdf True --dataset_name npm1_64_res --batch_size 5 --debug False --eval_scaled_img_resolution 64 |
other_polymorphic | python src/br/analysis/run_embeddings.py --save_path "./outputs_other_polymorphic/" --sdf True --dataset_name other_polymorphic --batch_size 5 --debug False |
other_punctate | python src/br/analysis/run_embeddings.py --save_path "./outputs_other_punctate/" --sdf False --dataset_name other_punctate --batch_size 5 --debug False |
pcna | python src/br/analysis/run_embeddings.py --save_path "./outputs_pcna/" --sdf False --dataset_name pcna --batch_size 5 --debug False |
Many of the results from the paper can be reproduced just from the embeddings produced by the model. You can download our pre-computed embeddings here.
- cellPACK synthetic dataset
- DNA replication foci dataset
- WTC-11 hIPSc single cell image dataset v1 punctate structures
- WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1)
- WTC-11 hIPSc single cell image dataset v1 polymorphic structures
- Nucleolar drug perturbation dataset
- To compute benchmarking features from the embeddings and trained models, run the following commands.
Dataset | Benchmarking features command |
---|---|
cellpack | python src/br/analysis/run_features.py --save_path "./outputs_cellpack/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/cellpack" --sdf False --dataset_name "cellpack" --debug False |
npm1 | python src/br/analysis/run_features.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1" --sdf True --dataset_name "npm1" --debug False |
npm1_64_res | python src/br/analysis/run_features.py --save_path "./outputs_npm1_64_res/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_64_res" --sdf True --dataset_name "npm1_64_res" --debug False --eval_scaled_img_resolution 64 |
other_polymorphic | python src/br/analysis/run_features.py --save_path "./outputs_other_polymorphic/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_polymorphic" --sdf True --dataset_name "other_polymorphic" --debug False |
other_punctate | python src/br/analysis/run_features.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --sdf False --dataset_name "other_punctate" --debug False |
pcna | python src/br/analysis/run_features.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --sdf False --dataset_name "pcna" --debug False |
To combine features from different runs and compare, run
python src/br/analysis/run_features_combine.py --feature_path_1 './outputs_npm1/' --feature_path_2 './outputs_npm1_64_res/' --save_path "./outputs_npm1_combine/" --dataset_name_1 "npm1" --dataset_name_2 "npm1_64_res"
- To run analysis like latent walks and archetype analysis on the embeddings and trained models, run the following commands.
Dataset | Analysis command |
---|---|
cellpack | python src/br/analysis/run_analysis.py --save_path "./outputs_cellpack/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/cellpack" --dataset_name "cellpack" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False |
npm1 | python src/br/analysis/run_analysis.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1" --dataset_name "npm1" --run_name "Rotation_invariant_pointcloud_SDF" --sdf True --pacmap False |
other_polymorphic | python src/br/analysis/run_analysis.py --save_path "./outputs_other_polymorphic/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_polymorphic" --dataset_name "other_polymorphic" --run_name "Rotation_invariant_pointcloud_SDF" --sdf True --pacmap True |
other_punctate | python src/br/analysis/run_analysis.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --dataset_name "other_punctate" --run_name "Rotation_invariant_pointcloud_structurenorm" --sdf False --pacmap True |
pcna | python src/br/analysis/run_analysis.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --dataset_name "pcna" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False |
- To run drug perturbation analysis using the pre-computed features, run
python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb"
To compute cellprofiler features, open the project file using cellprofiler, and point to the single cell images of nucleoli in the npm1 perturbation dataset. This will generate a csv named MyExpt_Image.csv
that contains mean, median, and stdev statistics per image across the different computed features.