-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
69 lines (45 loc) · 4.05 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
output: github_document
---
# AMPBenchmark
AMPBenchmark is a part of our initative for the improvement of benchmarking standards in the field of antimicrobial peptide (AMP) prediction.
## How to use the public data?
1. Download the benchmark sequence data:
- [Dropbox link](https://www.dropbox.com/scl/fi/6hxboi6xy1jm1q1ie6vyg/AMPBenchmark_public.fasta?rlkey=3egb368kyh347fdfamcfd75m0&st=ld02vyiv&dl=0).
- [GitHub link](https://raw.githubusercontent.com/BioGenies/AMPBenchmark/main/data/AMPBenchmark_public.fasta?token=GHSAT0AAAAAABS4SIUMO3EI6JSQJJ2OC62WYUT5E6A).
2. Download the training sequence data for all methods and replications:
- [Dropbox link](https://www.dropbox.com/scl/fo/f8kdfgoa8htsvpc79v0u2/ANOcYXz3fSRyE5kEumDDsVs?rlkey=a0su8jyn5nsjnzs2gkqya5n24&st=xd69dycx&dl=0).
3. Train your model using each of the training data set (class of a sequence is denoted by AMP=1 for AMPs and AMP=0 for negative samples, see [Sequence data](https://github.com/BioGenies/AMPBenchmark#sequence-data) section for details.)
4. Benchmark trained models against our data. Make sure to use a subset of sequences for appropriate replication (replication number is denoted by, e.g. rep=1, see [Sequence data](https://github.com/BioGenies/AMPBenchmark#sequence-data) section for details.)
5. Submit the results in the format described below to the [AMPBenchmark web server](http://biogenies.info/AMPBenchmark/).
### Data submission format
| ID | training_sampling |AMP_probability |
|-------------------------|-------------------|----------------|
| DBAASP_10018_AMP=1_rep1 | dbAMP |0.97 |
| DBAASP_3217_AMP=1_rep1 | dbAMP |0.61 |
| ... | ... |... |
- **ID**: must contain the sequence ID, as provided in the FASTA headers of the input sequences.
- **training_sampling**: has to contain the type of negative sampling method used to train the model. Possible values are: *AMAP*, *AmpGram*, *ampir-mature*, *AMPlify*, *AMPScannerV2*, *CS-AMPPred*, *dbAMP*, *Gabere&Noble*, *iAMP-2L*, *Wang-et-al*, *Witten&Witten*. Remember that a proper benchmark requires you to train your model using every provided sampling method and evaluate it using all sampling methods using appropriate replication.
- **AMP_probability**: has to be in the range between 0 and 1.
Example data for a random classifier can be downloaded from [Dropbox](https://www.dropbox.com/scl/fi/xqeqdsygkxjg5qt2b7ezg/sample_data.csv?rlkey=ql7gtoumuecwbg5tr0frl81bb&st=w7pdevvn&dl=0).
### Sequence data
The input data is hosted on [Dropbox](https://www.dropbox.com/scl/fi/6hxboi6xy1jm1q1ie6vyg/AMPBenchmark_public.fasta?rlkey=3egb368kyh347fdfamcfd75m0&st=wj8wc93f&dl=0) and [GitHub](https://raw.githubusercontent.com/BioGenies/AMPBenchmark/main/data/AMPBenchmark_public.fasta?token=GHSAT0AAAAAABS4SIUMO3EI6JSQJJ2OC62WYUT5E6A). Note that this single file contains data for all replications which should be used separately with appropriate replications of training sets.
The training data sets are hosted on [Dropbox](https://www.dropbox.com/scl/fo/f8kdfgoa8htsvpc79v0u2/ANOcYXz3fSRyE5kEumDDsVs?rlkey=a0su8jyn5nsjnzs2gkqya5n24&st=vpcy0lyc&dl=0) and follow the same naming convention.
There are two types of the input sequences:
- positive sequence (e.g., **DBAASP_10718**\_*AMP=1*\_rep1): **IDinDBAASP**\_*class*\_replicateID.
- negative sequences (e.g., **Seq1896_sampling\_method=Gabere&Noble**\_*AMP=0*\_rep4): **IDandSamplingMethod**\_*class*\_replicateID.
AMP sequences are derived from the [DBAASP database](https://dbaasp.org/).
md5 sum of the **AMPBenchmark_public.fasta**: 58f1424c057aaeb64bc632cad6038cad.
```{r echo = FALSE, results = 'asis'}
source("https://raw.githubusercontent.com/BioGenies/NegativeDatasets/main/docs/rmd_scripts.R")
cat(negative_sampling_citation())
```
```{r echo = FALSE, results = 'asis'}
cat(negative_sampling_links())
```
```{r echo = FALSE, results = 'asis'}
cat(negative_sampling_contact())
```
## Changelog
- 2024/07/29: updated dropbox links.
- 2023/01/11: fixed data processing.