Skip to content

Commit

Permalink
add EternaBench dataset
Browse files Browse the repository at this point in the history
Signed-off-by: Zhiyuan Chen <[email protected]>
  • Loading branch information
ZhiyuanChen committed Oct 8, 2024
1 parent 0a0e9dc commit 9298c7c
Show file tree
Hide file tree
Showing 4 changed files with 166 additions and 0 deletions.
9 changes: 9 additions & 0 deletions docs/docs/datasets/eternabench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
authors:
- Zhiyuan Chen
date: 2024-05-04
---

# EternaBench

--8<-- "multimolecule/datasets/eternabench/README.md:21:"
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ nav:
- bpRNA-spot: datasets/bprna-spot.md
- bpRNA-new: datasets/bprna-new.md
- RYOS: datasets/ryos.md
- EternaBench: datasets/eternabench.md
- module:
- module/index.md
- heads: module/heads.md
Expand Down
96 changes: 96 additions & 0 deletions multimolecule/datasets/eternabench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
language: rna
tags:
- Biology
- RNA
license:
- agpl-3.0
size_categories:
- 1K<n<10K
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
pretty_name: EternaBench
library_name: multimolecule
---

# EternaBench

![EternaBench](https://eternagame.org/sites/default/files/thumb_eternabench_paper.png)

EternaBench is a database comprising the diverse high-throughput structural data gathered through the crowdsourced RNA design project Eterna, to evaluate the performance of a wide set of structure algorithms.

## Disclaimer

This is an UNOFFICIAL release of the [EternaBench](https://github.com/eternagame/EternaBench) by Hannah K. Wayment-Steele, et al.

**The team releasing EternaBench did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.**

## Dataset Description

- **Homepage**: https://multimolecule.danling.org/datasets/eternabench
- **Point of Contact**: [Rhiju Das](https://biochemistry.stanford.edu/people/rhiju-das/)

## Example Entry

| ID | design_name | sequence | structure | reactivity | errors | signal_to_noise |
| 769337-1 | d+m plots weaker again | GGAAAAAAAAAAA... | ................ | [0.642,1.4853,0.1629, ...] | [0.3181,0.4221,0.1823, ...] | 3.227 |

## Column Description

The EternaBench dataset consists of the following columns, providing crucial insights for understanding RNA stability for vaccine design:

- **ID**:
A unique identifier for each RNA sequence entry.

- **design_name**:
The name given to each RNA design by contributors, used for easy reference.

- **sequence**:
The nucleotide sequence of the RNA, using standard bases.

- **structure**:
The predicted secondary structure of the RNA, represented using dot-bracket notation.
The structure helps determine the likely secondary interactions within each RNA molecule.

- **reactivity**:
A list of floating-point values that provide an estimate of the likelihood of the RNA backbone being cut at each nucleotide position.
These values help determine the stability of the RNA structure under various experimental conditions.

- **errors**:
Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the **reactivity**.
These values help quantify the uncertainty in the degradation rates and reactivity measurements.

## Variations

This dataset is available in two subsets:

- [EternaBench-CM](https://huggingface.co/datasets/multimolecule/eternabench-cm)
- [EternaBench-Switch](https://huggingface.co/datasets/multimolecule/eternabench-switch)

## License

This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```

## Citation

```bibtex
@article{waymentsteele2022rna,
author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},
journal = {Nature Methods},
month = oct,
number = 10,
pages = {1234--1242},
publisher = {Springer Science and Business Media LLC},
title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},
volume = 19,
year = 2022
}
```
60 changes: 60 additions & 0 deletions multimolecule/datasets/eternabench/eternabench.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# MultiMolecule
# Copyright (C) 2024-Present MultiMolecule

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.

# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from __future__ import annotations

import os

import danling as dl
import pandas as pd
import torch

from multimolecule.datasets.conversion_utils import ConvertConfig as ConvertConfig_
from multimolecule.datasets.conversion_utils import save_dataset

torch.manual_seed(1016)

cols = [
"ID",
"design_name",
"sequence",
"structure",
"reactivity",
"errors",
"signal_to_noise",
]


def convert_dataset_(df: pd.DataFrame):
df = df[cols]
df.signal_to_noise = df.signal_to_noise.str.split(":").str[-1].astype(float)
return df


def convert_dataset(convert_config):
train = dl.load_pandas(convert_config.train_path)
test = dl.load_pandas(convert_config.test_path)
save_dataset(convert_config, {"train": convert_dataset_(train), "test": convert_dataset_(test)})


class ConvertConfig(ConvertConfig_):
root: str = os.path.dirname(__file__)


if __name__ == "__main__":
config = ConvertConfig()
config.parse() # type: ignore[attr-defined]
convert_dataset(config)

0 comments on commit 9298c7c

Please sign in to comment.