Wants to annotate ncRNA in a genome, but is having trouble navigating the dozens of different tools and databases out there? Trying to find functions for lncRNAs, but finding almost nothing?
This software may help you solve those problems.
RNA Gatherer is a software with ready to use pipelines for:
- annotate_ncrna.py: Annotation and prediction of ncRNA in genomes, taking into account transcriptome data, covariance models, reference sequences, reference annotations and data from public APIs;
- prophet.py: Computational prediction of lncRNA functions using gene coexpression;
RNA Gatherer requires some databases and software in order to run. It was developed for Linux x64 environments and uses a command line interface.
First of all, you should clone (or download) this repository:
git clone https://github.com/pentalpha/rna_gatherer.git
cd rna_gatherer
File | Is it mandatory? | Download Link |
---|---|---|
Gene Ontology Graph | Yes | http://purl.obolibrary.org/obo/go.obo |
RFAM Covariance Models | Yes | ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz |
Non-Redundant Proteins | Only if you want to remove known protein's mRNA from lncRNA data | ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz |
ncRNA Database FASTAs | Only if you want to look for known ncRNA through alignment | It can be ANY .fasta file. We suggest using RNA Central's database: ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/sequences/rnacentral_active.fasta.gz |
After downloading them, edit the config.json file to include the full paths. If the file does not already exist, create it:
cp config.dummy.json config.json
Now, open config.json with your favorite text editor. Fill in the empty fields with the path to the downloaded files:
[...]
"rna_dbs": {"DB Name": "<db_path>",
"DB Name 2": "<db_path_2>", ...},
"non_redundant": "path/to/nr.fasta",
"go_obo": "path/to/go.obo",
"rfam_cm": "path/to/Rfam.cm"
}
Non-mandatory fields can be left empty.
The required software are listed in the environment.yml file. Using conda, you can create the environment in one command:
conda env create -f environment.yml
Now activate the fresh new environment in order to use the software:
conda activate rna
This is an extensive pipeline for detecting ncRNA in a given genome. Given a genome (and maybe some optional inputs), it will give you a non-redundant .GFF annotation file and a .TSV file with functional annotations, based on RFAM and other databases.
A basic command would be:
python annotate_ncrna.py -g [genome.fasta] \
-tx [taxonomic ID for species] \
-o [output directory]
These are the only required input arguments. But other inputs can be passed in order to make the annotation a lot better!
This enables the annotation of lncRNA transcripts:
python annotate_ncrna.py -g [genome.fasta]\
-tx [taxonomic ID for species] \
-tr [transcriptome.fasta] \
-o [output directory]
This includes a ncRNA reference annotation file (.gff format):
python annotate_ncrna.py -g [genome.fasta] \
-tx [taxonomic ID for species] \
-gff [reference.gff] \
-o [output directory]
You can find reference files like these for many species here. Please note that inclusing mRNA in the reference annotation can mess things up a little bit...
Many species have reference ncRNA sequences out there with no position in the genome. RNA Gatherer can map them for you:
python annotate_ncrna.py -g [genome.fasta]\
-tx [taxonomic ID for species] \
-ref [reference.fasta] \
-o [output directory]
For more detailed description of the command line arguments, use --help:
python annotate_ncrna.py --help
Given a count reads table, a list of lncRNA names and a annotation of coding genes, this tool enables you to predict the functions (Gene Ontology terms) of lncRNA.
An example command:
python prophet.py -cr test_data/counts/mus_musculus_tpm.tsv \
-reg test_data/lnc_list/mus_musculus_lncRNA.txt \
-ann test_data/annotation/mgi_genes_annotation.tsv \
-o output_directory
-cr: The count reads table is a simple .TSV table where the first row is the sample names and the following rows start with a gene name, followed by the read counts at each sample. The counts must be normalized, preferably with TPM. It must include counts for both lncRNA and mRNA (example).
-reg: The lncRNA list specifies which ones of the genes in the count reads table are lncRNA. It's a simple .TXT file where every line is a lncRNA name (example).
-ann: The functional annotation for the coding genes, another .TSV table. Each line contains a gene name, a GO term and the respective ontology - molecular_function, biological_process or cellular_component (example).
For more detailed description of the command line arguments, use --help:
python prophet.py --help
- Create an utility to download the databases for the user;