Skip to content

Latest commit

 

History

History
208 lines (145 loc) · 15 KB

File metadata and controls

208 lines (145 loc) · 15 KB

Installation

To install and use this software, you need:

  • A GPU running CUDA 11.7 (other CUDA versions may work, but they are not officially supported),
  • conda (or Python 3.10 and pdm), and
  • git.

First, clone this repository.

git clone https://github.com/AllenCell/benchmarking_representations
cd benchmarking_representations

Create a virtual environment.

conda create --name br python=3.10
conda activate br

Depending on your GPU set-up, you may need to set the CUDA_VISIBLE_DEVICES environment variable. To achieve this, you will first need to get the Universally Unique IDs for the GPUs and then set CUDA_VISIBLE_DEVICES to some/all of those (a comma-separated list), as in the following examples.

Example 1

export CUDA_VISIBLE_DEVICES=0,1

Example 2: Using one partition of a MIG partitioned GPU

export CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Next, install all required packages

pip install -r requirements1.txt
pip install -r requirements2.txt
pip install -e .

For pdm users, follow these installation steps instead.

Troubleshooting

Q: When installing dependencies, pytorch fails to install with the following error message.

torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus

A: You may need to configure the CUDA_VISIBLE_DEVICES environment variable.

Set env variables

To run the models, you must set the CYTODL_CONFIG_PATH environment variable to point to the br/configs folder. Check that your current working directory is the benchmarking_representations folder, then run the following command (this will last for only the duration of your shell session).

export CYTODL_CONFIG_PATH=$PWD/configs/

1. Model training

Steps to download pre-processed data

Preprocessing the data can take several hours. To skip this step, download the preprocessed data for each dataset. This will use around 740 GB.

aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropriate_representation_learning/preprocessed_data/

Steps to train models

Training these models can take days. We've published our trained models so you don't have to. Skip to the next section if you'd like to just use our models.

  1. Create a single cell manifest (e.g. csv, parquet) for each dataset with a column corresponding to final processed paths, and create a split column corresponding to train/test/validation split.
  2. Update the final single cell dataset path (SINGLE_CELL_DATASET_PATH) and the column in the manifest for appropriate input modality (SDF_COLUMN/SEG_COLUMN/POINTCLOUD_COLUMN/IMAGE_COLUMN) in each datamodule file. e.g. for PCNA data these yaml files are located here -
configs
└── data
    └── pcna
        ├── image.yaml               <- Datamodule for PCNA images
        ├── pc.yaml                  <- Datamodule for PCNA point clouds
        ├── pc_intensity.yaml        <- Datamodule for PCNA point clouds with intensity
        └── pc_intensity_jitter.yaml <- Datamodule for PCNA point clouds with intensity and jitter
  1. Train models using cyto_dl. Ensure to run the training scripts from the folder where the repo was cloned (and where all the data was downloaded). Experiment configs for point cloud and image models for the cellpack dataset are located here:
configs
└── experiment
    └── cellpack
        ├── image_classical.yaml <- Classical image model experiment
        ├── image_so3.yaml       <- Rotation invariant image model experiment
        ├── pc_classical.yaml    <- Classical point cloud model experiment
        └── pc_so3.yaml          <- Rotation invariant point cloud model experiment

Here is an example of training a rotation invariant point cloud model.

python src/br/models/train.py experiment=cellpack/pc_so3

Override parts of the experiment config via command line or manually in the configs. For example, to train a classical model, run the following.

python src/br/models/train.py experiment=cellpack/pc_so3 model=pc/classical_earthmovers_sphere ++csv.save_dir=[SAVE_DIR]

2. Model inference

Steps to download pre-trained models

To skip model training, download our pre-trained models. For each of the six datasets, there are five .ckpt files. The easiest way to get these 30 models in the expected layout is with the AWS CLI.

Option 1: AWS CLI

  1. Install the AWS CLI.
  2. Confirm that you are in the benchmarking_representations folder.
$ pwd
/home/myuser/benchmarking_representations/
  1. Download the 30 models. This will use almost 4GB.
aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropriate_representation_learning/model_checkpoints/

Option 2: Download individual checkpoints

Instead of installing the AWS CLI, you can download .ckpt files one at a time by browsing the dataset on Quilt.

By default, the checkpoint files are expected in benchmarking_representations/morphology_appropriate_representation_learning/model_checkpoints/, organized in six subfolders (one for each dataset). This folder structure is provided as part of this repo. Move the downloaded checkpoint files into the folder corresponding to their dataset.

Compute embeddings

Skip to the next section if you'd like to just use our pre-computed embeddings. Otherwise, to compute embeddings from the trained models, update the data paths in the datamodule files to point to your pre-processed data. Then, run the following commands.

Dataset Embedding command
cellpack python src/br/analysis/run_embeddings.py --save_path "./outputs_cellpack/" --sdf False --dataset_name cellpack --batch_size 5 --debug False
npm1_perturb python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1_perturb/" --sdf True --dataset_name npm1_perturb --batch_size 5 --debug False
npm1 python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1/" --sdf True --dataset_name npm1 --batch_size 5 --debug False
npm1_64_res python src/br/analysis/run_embeddings.py --save_path "./outputs_npm1_64_res/" --sdf True --dataset_name npm1_64_res --batch_size 5 --debug False --eval_scaled_img_resolution 64
other_polymorphic python src/br/analysis/run_embeddings.py --save_path "./outputs_other_polymorphic/" --sdf True --dataset_name other_polymorphic --batch_size 5 --debug False
other_punctate python src/br/analysis/run_embeddings.py --save_path "./outputs_other_punctate/" --sdf False --dataset_name other_punctate --batch_size 5 --debug False
pcna python src/br/analysis/run_embeddings.py --save_path "./outputs_pcna/" --sdf False --dataset_name pcna --batch_size 5 --debug False

3. Interpretability analysis

Steps to download pre-computed embeddings

Many of the results from the paper can be reproduced just from the embeddings produced by the model. You can download our pre-computed embeddings here.

Steps to run benchmarking analysis

  1. To compute benchmarking features from the embeddings and trained models, run the following commands.
Dataset Benchmarking features command
cellpack python src/br/analysis/run_features.py --save_path "./outputs_cellpack/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/cellpack" --sdf False --dataset_name "cellpack" --debug False
npm1 python src/br/analysis/run_features.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1" --sdf True --dataset_name "npm1" --debug False
npm1_64_res python src/br/analysis/run_features.py --save_path "./outputs_npm1_64_res/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_64_res" --sdf True --dataset_name "npm1_64_res" --debug False --eval_scaled_img_resolution 64
other_polymorphic python src/br/analysis/run_features.py --save_path "./outputs_other_polymorphic/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_polymorphic" --sdf True --dataset_name "other_polymorphic" --debug False
other_punctate python src/br/analysis/run_features.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --sdf False --dataset_name "other_punctate" --debug False
pcna python src/br/analysis/run_features.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --sdf False --dataset_name "pcna" --debug False

To combine features from different runs and compare, run

python src/br/analysis/run_features_combine.py --feature_path_1 './outputs_npm1/' --feature_path_2 './outputs_npm1_64_res/' --save_path "./outputs_npm1_combine/" --dataset_name_1 "npm1" --dataset_name_2 "npm1_64_res"
  1. To run analysis like latent walks and archetype analysis on the embeddings and trained models, run the following commands.
Dataset Analysis command
cellpack python src/br/analysis/run_analysis.py --save_path "./outputs_cellpack/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/cellpack" --dataset_name "cellpack" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False
npm1 python src/br/analysis/run_analysis.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1" --dataset_name "npm1" --run_name "Rotation_invariant_pointcloud_SDF" --sdf True --pacmap False
other_polymorphic python src/br/analysis/run_analysis.py --save_path "./outputs_other_polymorphic/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_polymorphic" --dataset_name "other_polymorphic" --run_name "Rotation_invariant_pointcloud_SDF" --sdf True --pacmap True
other_punctate python src/br/analysis/run_analysis.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --dataset_name "other_punctate" --run_name "Rotation_invariant_pointcloud_structurenorm" --sdf False --pacmap True
pcna python src/br/analysis/run_analysis.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --dataset_name "pcna" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False
  1. To run drug perturbation analysis using the pre-computed features, run
python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb"

To compute cellprofiler features, open the project file using cellprofiler, and point to the single cell images of nucleoli in the npm1 perturbation dataset. This will generate a csv named MyExpt_Image.csv that contains mean, median, and stdev statistics per image across the different computed features.