Skip to content

Commit

Permalink
New evaluation suite and general improvements (#14)
Browse files Browse the repository at this point in the history
* Minor additions to data directory

* Update notebooks with new visualisations and metrics

* Clear processing script

* Update README

* Update configs

* Add RhoFold tool code

* Add RibonanzaNet tool code

* Update example env paths

* Implement new evaluation suite and metrics

* Minor updates

* Update tutorial notebook
  • Loading branch information
chaitjo authored May 21, 2024
1 parent b6f6c4e commit b5da568
Show file tree
Hide file tree
Showing 63 changed files with 17,647 additions and 4,713 deletions.
12 changes: 6 additions & 6 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
export PROJECT_PATH='/home/ckj24/rna-inverse-folding/'
export PROJECT_PATH='/home/ckj24/geometric-rna-design/'

export DATA_PATH='/home/ckj24/rna-inverse-folding/data/'
export DATA_PATH='/home/ckj24/geometric-rna-design/data/'

export WANDB_PROJECT='gRNAde'
export WANDB_ENTITY='chaitjo'
export WANDB_DIR='/home/ckj24/rna-inverse-folding/'
export WANDB_DIR='/home/ckj24/geometric-rna-design/'

export ETERNAFOLD='/home/ckj24/rna-inverse-folding/tools/EternaFold'
export ETERNAFOLD='/home/ckj24/geometric-rna-design/tools/EternaFold'

export X3DNA='/home/ckj24/rna-inverse-folding/tools/x3dna-v2.4'
export PATH="/home/ckj24/rna-inverse-folding/tools/x3dna-v2.4/bin:$PATH"
export X3DNA='/home/ckj24/geometric-rna-design/tools/x3dna-v2.4'
export PATH="/home/ckj24/geometric-rna-design/tools/x3dna-v2.4/bin:$PATH"
94 changes: 61 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

✍️ New to 3D RNA modelling? Here's a currated reading + watch list for beginners: [Resources](/tutorial/README.md)
✍️ New to 3D RNA modelling? Here's a currated reading + watch list for beginners: [Resources](https://www.chaitjo.com/post/rna-modelling-and-design/)

📄 For more details on the methodology, see the accompanying paper: ['gRNAde: Geometric Deep Learning for 3D RNA inverse design'](https://arxiv.org/abs/2305.14749)
> Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, Alex Morehead, and Pietro Liò. gRNAde: Geometric Deep Learning for 3D RNA inverse design. *ICML Computational Biology Workshop, 2023.*
Expand All @@ -25,7 +25,7 @@ RNA backbones are featurized as geometric graphs and processed via a multi-state
## Installation

In order to get started, set up a python environment by following the installation instructions below.
We have tested gRNAde on Linux with Python 3.10.12 and CUDA 11.8 on an NVIDIA A100 80GB GPU, as well as on MacOS.
We have tested gRNAde on Linux with Python 3.10.12 and CUDA 11.8 on NVIDIA A100 80GB GPUs and Intel XPUs, as well as on MacOS (CPU).
```sh
# Clone gRNAde repository
cd ~ # change this to your prefered download location
Expand All @@ -43,51 +43,67 @@ mamba create -n rna python=3.10
mamba activate rna
```

Next, install the dependencies within your new python environment.
Set up your new python environment, starting with PyTorch and PyG:
```sh
# Install Pytorch (ensure appropriate CUDA version for your hardware)
# Install Pytorch on Nvidia GPUs (ensure appropriate CUDA version for your hardware)
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install Pytorch Geometric (ensure matching torch + CUDA version)
# Install Pytorch on Intel XPUs (specific to Cambridge's Dawn supercomputer)
module load default-dawn
source /usr/local/dawn/software/external/intel-oneapi/2024.0/setvars.sh
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install Pytorch Geometric (ensure matching torch + CUDA version to PyTorch)
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install torch_scatter torch_cluster -f https://data.pyg.org/whl/torch-2.1.2+cu118.html
# To build PyG libraries from scartch, do not use -f (eg. in case of Intel XPUs)
```

# Install other dependencies
mamba install mdanalysis MDAnalysisTests jupyterlab matplotlib seaborn pandas networkx biopython biotite torchmetrics lovely-tensors -c conda-forge
pip install wandb pyyaml ipdb python-dotenv tqdm lmdb cpdb-protein
Next, install other compulsory dependencies:
```sh
mamba install jupyterlab matplotlib seaborn pandas biopython biotite -c conda-forge
pip install wandb gdown pyyaml ipdb python-dotenv tqdm cpdb-protein torchmetrics einops ml_collections mdanalysis MDAnalysisTests

# Install X3DNA for secondary structure determination
cd ~/geometric-rna-design/tools/
tar -xvzf x3dna-v2.4-linux-64bit.tar.gz
./x3dna-v2.4/bin/x3dna_setup
# Follow the instructions to test your installation

# Install EternaFold for secondary structure prediction
cd ~/rna-inverse-folding/tools/
cd ~/geometric-rna-design/tools/
git clone --depth=1 https://github.com/eternagame/EternaFold.git && cd EternaFold/src
make
# Notes:
# - Multithreaded version of EternaFold did not install for me
# - To install on MacOS, start a shell in Rosetta using `arch -x86_64 zsh`

# (Optional) Install X3DNA for secondary structure determination
cd ~/rna-inverse-folding/tools/
tar -xvzf x3dna-v2.4-linux-64bit.tar.gz
./x3dna-v2.4/bin/x3dna_setup
# Follow the instructions to test your installation

# (Optional) Install draw_rna for secondary structure visualization
cd ~/rna-inverse-folding/tools/
git clone --depth=1 https://github.com/DasLab/draw_rna.git draw_rna_dir && cd draw_rna_dir
python setup.py install
# Download RhoFold checkpoint (~500MB)
cd ~/geometric-rna-design/tools/rhofold/
gdown https://drive.google.com/uc?id=1To2bjbhQLFx1k8hBOW5q1JFq6ut27XEv
```

Optionally, you can also set up the following extra tools and dependencies:
```sh
# (Optional) Install CD-HIT for sequence identity clustering
mamba install cd-hit -c bioconda

# (Optional) Install US-align/qTMclust for structural similarity clustering
cd ~/rna-inverse-folding/tools/
cd ~/geometric-rna-design/tools/
git clone https://github.com/pylelab/USalign.git && cd USalign/ && git checkout 97325d3aad852f8a4407649f25e697bbaa17e186
g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
g++ -static -O3 -ffast-math -lm -o qTMclust qTMclust.cpp

# (Optional) Install draw_rna for visualisation
cd ~/geometric-rna-design/tools/
git clone --depth=1 https://github.com/DasLab/draw_rna && cd draw_rna
python setup.py install
```

Once your python environment is set up, create your `.env` file with the appropriate environment variables; see the .env.example file included in the codebase for reference.
```sh
cd ~/rna-inverse-folding/
cd ~/geometric-rna-design/
touch .env
```

Expand All @@ -105,7 +121,7 @@ Detailed usage instructions are available in [the tutorial notebook](/tutorial/t
├── LICENSE
|
├── gRNAde.py # gRNAde python module and command line utility
├── main.py # Main script for training models
├── main.py # Main script for training and evaluating models
|
├── .env.example # Example environment file
├── .env # Your environment file
Expand All @@ -118,14 +134,18 @@ Detailed usage instructions are available in [the tutorial notebook](/tutorial/t
├── tutorial # Tutorial with example usage
|
├── tools # Directory for external tools
| ├── EternaFold # RNA sequence to secondary structure prediction
| ├── draw_rna # RNA secondary structure visualization
| ├── EternaFold # RNA sequence to secondary structure prediction tool
| ├── RhoFold # RNA sequence to 3D structure prediction tool
| ├── ribonanzanet # RNA sequence to chemical mapping prediction tool
| └── x3dna-v2.4 # RNA secondary structure determination from 3D
|
└── src # Source code directory
├── constants.py # Constant values for data, paths, etc.
├── evaluator.py # Evaluation loop and metrics
├── layers.py # PyTorch modules for building Multi-state GNN models
├── models.py # Multi-state GNN models for gRNAde
├── trainer.py # Training and evaluation loops
├── trainer.py # Training loop
|
└── data # Data-related code
├── clustering_utils.py # Methods for clustering by sequence and structural similarity
Expand All @@ -139,26 +159,26 @@ Detailed usage instructions are available in [the tutorial notebook](/tutorial/t

## Downloading Data

gRNAde is trained on all RNA structures from the PDB at ≤4A resolution (12K 3D structures from 4.2K unique RNAs) downloaded via [RNASolo](https://rnasolo.cs.put.poznan.pl) on 31 October 2023.
gRNAde is trained on all RNA structures from the PDB at ≤4A resolution (12K 3D structures from 4.2K unique RNAs) downloaded via [RNASolo](https://rnasolo.cs.put.poznan.pl) with date cutoff: 31 October 2023.
If you would like to train your own models from scratch, download and extract the raw `.pdb` files via the following script into the `data/raw/` directory (or another location indicated by the `DATA_PATH` environment variable in your `.env` file).

> Alternatively to the instructions below, you can download a pre-processed [`.pt`](https://drive.google.com/file/d/1gcUUaRxbGZnGMkLdtVwAILWVerVCbu4Y/view?usp=sharing) file and [`.csv`](https://drive.google.com/file/d/1lbdiE1LfWPReo5VnZy0zblvhVl5QhaF4/view?usp=sharing) metadata, and place them into the `data/` directory.
> ❗️ Alternatively to the instructions below, you can download a pre-processed [`.pt`](https://drive.google.com/file/d/1gcUUaRxbGZnGMkLdtVwAILWVerVCbu4Y/view?usp=sharing) file and [`.csv`](https://drive.google.com/file/d/1lbdiE1LfWPReo5VnZy0zblvhVl5QhaF4/view?usp=sharing) metadata, and place them into the `data/` directory.
```sh
# Download structures in pdb format
mkdir ~/rna-inverse-folding/data/raw
cd ~/rna-inverse-folding/data/raw
mkdir ~/geometric-rna-design/data/raw
cd ~/geometric-rna-design/data/raw
curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_4_0__3_300.zip
unzip all_member_pdb_4_0__3_300.zip
rm all_member_pdb_4_0__3_300.zip
```
Manual download link: https://rnasolo.cs.put.poznan.pl/archive.
Select the following for creating the download: 3D (PDB) + all molecules + all members + res. ≤4.0

Next, process the raw PDB files into our ML-ready format, which will be saved under `data/processed.pt`.
Next, process the raw PDB files into our ML-ready format, which will be saved under `data/processed.pt`.
```sh
# Process raw data into ML-ready format (this may take several hours)
cd ~/rna-inverse-folding/
cd ~/geometric-rna-design/
python scripts/process_data.py
```

Expand All @@ -167,7 +187,7 @@ Each RNA will be processed into the following format (most of the metadata is op
{
'sequence' # RNA sequence as a string
'id_list' # list of PDB IDs
'coords_list' # list of 3D coordinates of shape ``(length, 27, 3)``
'coords_list' # list of structures, i.e. 3D coordinates of shape ``(length, 27, 3)``
'sec_struct_list' # list of secondary structure strings in dotbracket notation
'sasa_list' # list of per-nucleotide SASA values
'rfam_list' # list of RFAM family IDs
Expand All @@ -179,6 +199,14 @@ Each RNA will be processed into the following format (most of the metadata is op
}
```

Wwe have provided the splits used in our experiments in the `data/` directory:
- Single-state split from [Das et al., 2010](https://www.nature.com/articles/nmeth.1433): `data/das_split.pt` (called the Das split for compatibility with older code)
- Multi-state split of structurally flexible RNAs: `data/structsim_split.pt`

The precise procedure for creating the splits (which can be used to modify and customise them) can be found in the `notebooks/` directory.



## Citation

```
Expand All @@ -188,4 +216,4 @@ Each RNA will be processed into the following format (most of the metadata is op
journal={arXiv preprint},
year={2023},
}
```
```
6 changes: 3 additions & 3 deletions checkpoints/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For general usage and best performance, we recommend using the `all` split.
| Autoregressive | 5 | 5000 | [gRNAde_ARv1_5state_all.h5](gRNAde_ARv1_5state_all.h5) | 0.8197 | 0.6344 |
| | | | |

## Das et al. (2011) single-state design benchmark split: `das`
## Single-state split from [Das et al., 2010](https://www.nature.com/articles/nmeth.1433): `das`

| Model | Max. no. conformers | Max. training RNA length | Link | Sequence recovery | Self-consistency MCC |
| --- | --- | --- | --- | --- | --- |
Expand All @@ -25,12 +25,12 @@ For general usage and best performance, we recommend using the `all` split.
| Autoregressive | 5 | 5000 | [gRNAde_ARv1_5state_das.h5](gRNAde_ARv1_5state_das.h5) | 0.5669 | 0.6296 |
| | | | |

## Multi-state design benchmark split: `multi`
## Multi-state split of structurally flexible RNAs: `multi`

| Model | Max. no. conformers | Max. training RNA length | Link | Sequence recovery | Self-consistency MCC |
| --- | --- | --- | --- | --- | --- |
| Autoregressive | 1 | 5000 | [gRNAde_ARv1_1state_multi.h5](gRNAde_ARv1_1state_multi.h5) | 0.4468 | 0.5809 |
| Autoregressive | 3 | 5000 | [gRNAde_ARv1_2state_multi.h5](gRNAde_ARv1_2state_multi.h5) | 0.4471 | 0.5321 |
| Autoregressive | 2 | 5000 | [gRNAde_ARv1_2state_multi.h5](gRNAde_ARv1_2state_multi.h5) | 0.4471 | 0.5321 |
| Autoregressive | 3 | 5000 | [gRNAde_ARv1_3state_multi.h5](gRNAde_ARv1_3state_multi.h5) | 0.4692 | 0.5957 |
| Autoregressive | 5 | 5000 | [gRNAde_ARv1_5state_multi.h5](gRNAde_ARv1_5state_multi.h5) | 0.4844 | 0.5841 |
| | | | |
3 changes: 3 additions & 0 deletions configs/default.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Misc configurations
device:
value: 'gpu'
desc: Device to run on (cpu/cuda/xpu)
gpu:
value: 0
desc: GPU ID
Expand Down
13 changes: 9 additions & 4 deletions configs/sweep.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
program: main.py
method: grid
name: gRNAde_sweep
description: "gRNAde W&B sweep template"
description: "gRNAde W&B sweep template for reproducing benchmark results"

entity: chaitjo
project: gRNAde

parameters:

# Misc configurations
device:
value: 'gpu'
gpu:
value: 0
seed:
Expand Down Expand Up @@ -41,20 +43,23 @@ parameters:
max_nodes_batch:
value: 3000
max_nodes_sample:
value: 5000
values:
- 500
- 1000
- 2000
- 5000

# Splitting configurations
split:
values:
- 'das'
- 'structsim'
- 'seqid'

# Model configurations
model:
values:
- 'NARv1'
- 'ARv1'
- 'NARv1'
node_in_dim:
value: [15, 4] # (num_bb_atoms x 5, 2 + num_bb_atoms - 1)
node_h_dim:
Expand Down
47 changes: 47 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Downloading Data for gRNAde

gRNAde is trained on all RNA structures from the PDB at ≤4A resolution (12K 3D structures from 4.2K unique RNAs) downloaded via [RNASolo](https://rnasolo.cs.put.poznan.pl) with date cutoff: 31 October 2023.
If you would like to train your own models from scratch, download and extract the raw `.pdb` files via the following script into the `data/raw/` directory (or another location indicated by the `DATA_PATH` environment variable in your `.env` file).

> ❗️ Alternatively to the instructions below, you can download a pre-processed [`.pt`](https://drive.google.com/file/d/1gcUUaRxbGZnGMkLdtVwAILWVerVCbu4Y/view?usp=sharing) file and [`.csv`](https://drive.google.com/file/d/1lbdiE1LfWPReo5VnZy0zblvhVl5QhaF4/view?usp=sharing) metadata, and place them into the `data/` directory.
```sh
# Download structures in pdb format
mkdir ~/geometric-rna-design/data/raw
cd ~/geometric-rna-design/data/raw
curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_4_0__3_300.zip
unzip all_member_pdb_4_0__3_300.zip
rm all_member_pdb_4_0__3_300.zip
```
Manual download link: https://rnasolo.cs.put.poznan.pl/archive.
Select the following for creating the download: 3D (PDB) + all molecules + all members + res. ≤4.0

Next, process the raw PDB files into our ML-ready format, which will be saved under `data/processed.pt`.
```sh
# Process raw data into ML-ready format (this may take several hours)
cd ~/geometric-rna-design/
python scripts/process_data.py
```

Each RNA will be processed into the following format (most of the metadata is optional for simply using gRNAde):
```
{
'sequence' # RNA sequence as a string
'id_list' # list of PDB IDs
'coords_list' # list of structures, i.e. 3D coordinates of shape ``(length, 27, 3)``
'sec_struct_list' # list of secondary structure strings in dotbracket notation
'sasa_list' # list of per-nucleotide SASA values
'rfam_list' # list of RFAM family IDs
'eq_class_list' # list of non-redundant equivalence class IDs
'type_list' # list of structure types (RNA-only, RNA-protein complex, etc.)
'rmsds_list' # dictionary of pairwise C4' RMSD values between structures
'cluster_seqid0.8' # cluster ID of sequence identity clustering at 80%
'cluster_structsim0.45' # cluster ID of structure similarity clustering at 45%
}
```

We have provided the splits used in our experiments in the `data/` directory:
- Single-state split from [Das et al., 2010](https://www.nature.com/articles/nmeth.1433): `data/das_split.pt` (called the Das split for compatibility with older code)
- Multi-state split of structurally flexible RNAs: `data/structsim_split.pt`

The precise procedure for creating the splits (which can be used to modify and customise them) can be found in the `notebooks/` directory.
Loading

0 comments on commit b5da568

Please sign in to comment.