Scripting analyses of genomes in Ensembl Plants

This repo contains code examples for interrogating Ensembl Plants from your own scripts and for masking & annotating repeats and calling pangenes in plant genomes.

List of recipes
Dependencies of recipes
- FTP
- MySQL
- Perl
- Python
- R
Repeat masking and annotation
Pangene analysis
Phylogenomics
Species tree
Citation

List of recipes

The code for the recipes in this section can be found in folder recipes. They are grouped by type (API, BioMart, CRAM, FTP, MySQL, REST & VEP) and their dependencies are explained below. To create your own recipes please read the appropriate documentation:

type	URLs
API	http://plants.ensembl.org/info/data/api.html
BioMart	http://plants.ensembl.org/info/data/biomart/index.html
FTP	http://plants.ensembl.org/info/data/ftp
MySQL	http://plants.ensembl.org/info/data/mysql.html
REST	http://plants.ensembl.org/info/data/rest.html
VEP	http://plants.ensembl.org/info/docs/tools/vep/index.html

These are the script recipes, obtained with grep -P "^## \w\d+" recipes/example* :

exampleAPI.pl:## A1) Load the Registry object with details of genomes available
exampleAPI.pl:## A2) Check which analyses are available for a species
exampleAPI.pl:## A3) Get soft masked sequences from Arabidopsis thaliana
exampleAPI.pl:## A4) Get BED file with repeats in chr4
exampleAPI.pl:## A5) Find the DEAR3 gene
exampleAPI.pl:## A6) Get the transcript used in Compara analyses
exampleAPI.pl:## A7) Find all orthologues of a gene
exampleAPI.pl:## A8) Get markers mapped on chr1D of bread wheat
exampleAPI.pl:## A9) Find all syntelogues among rices
exampleAPI.pl:## A10) Print all translations for otherfeatures genes

exampleBiomart.R:## B1) Check plant marts and select dataset
exampleBiomart.R:## B2) Check available filters and attributes
exampleBiomart.R:## B3) Download GO terms associated to genes
exampleBiomart.R:## B4) Get Pfam domains annotated in genes
exampleBiomart.R:## B5) Get SNP consequences from a selected variation source

exampleCRAM.pl:## C1) Find RNA-seq CRAM files for a genome assembly

exampleFTP.sh:## F1) Download peptide sequences in FASTA format
exampleFTP.sh:## F2) Download CDS nucleotide sequences in FASTA format
exampleFTP.sh:## F3) Download transcripts (cDNA) in FASTA format
exampleFTP.sh:## F4) Download soft-masked genomic sequences
exampleFTP.sh:## F5) Upstream/downstream sequences
exampleFTP.sh:## F6) Get mappings to UniProt proteins
exampleFTP.sh:## F7) Get indexed, bgzipped VCF file with variants mapped
exampleFTP.sh:## F8) Get precomputed VEP cache files
exampleFTP.sh:## F9) Download all homologies in a single TSV file, several GBs
exampleFTP.sh:## F10) Download UniProt report of Ensembl Plants, 
exampleFTP.sh:## F11) Retrieve list of new species in current release
exampleFTP.sh:## F12) Get current plant species tree (cladogram)

exampleMySQL.sh:## S1) Check currently supported Ensembl Genomes (EG) core schemas,
exampleMySQL.sh:## S2) Count protein-coding genes of a particular species
exampleMySQL.sh:## S3) Get stable_ids of transcripts used in Compara analyses 
exampleMySQL.sh:## S4) Get variants significantly associated to phenotypes
exampleMySQL.sh:## S5) Get Triticum aestivum homeologous genes across A,B & D subgenomes
exampleMySQL.sh:## S6) Count the number of whole-genome alignments of all genomes 
exampleMySQL.sh:## S7) Extract all the mutations and consequences for a selected wheat line
exampleMySQL.sh:## S8) Get FASTA of repeated sequences from selected species
exampleMySQL.sh:## S9) Get GFF of repeated sequences from selected species

exampleREST:## R1) Create a HTTP client and a helper functions 
exampleREST:## R2) Get metadata for all plant species 
exampleREST:## R3) Find features overlapping genomic region
exampleREST:## R4) Fetch phenotypes overlapping genomic region
exampleREST:## R5) Find homologues of selected gene
exampleREST:## R6) Get annotation of orthologous genes/proteins
exampleREST:## R7) Fetch variant consequences for multiple variant ids
exampleREST:## R8) Check consequences of SNP within CDS sequence
exampleREST:## R9) Retrieve variation sources of a species
exampleREST:## R10) Get soft-masked upstream sequence of gene in otherfeatures track
exampleREST:## R11) Get all species under a given taxonomy clade
exampleREST:## R12) transfer coordinates across genome alignments between species

exampleVEP.sh:## V1) Download, install and update VEP
exampleVEP.sh:## V2) Unpack downloaded cache file & check SIFT support 
exampleVEP.sh:## V3) Predict effect of variants 
exampleVEP.sh:## V4) Predict effect of variants for species not in Ensembl

Dependencies

Some of the recipes and scripts depend on additional software packages, see below to learn how to install them. Note that only make install requires sudo, you might need help from your sysadmin for that task.

FTP

The examples for bulk downloads from the FTP site require the software wget, which is usually installed on most Linux distributions. For macOS it is available on Homebrew. For Windows it ships with MobaXterm. On Debian/Ubuntu systems you can also install it with (requires sudo):

make install

MySQL

The examples for SQL queries to Ensembl Genomes database servers require the MySQL client. Depending on your Linux flavour this package can be named mysql-client or simply mysql. On Debian/Ubuntu systems you can also install it with (requires sudo):

make install

Perl

As listed in cpanfile, several modules are required for the REST examples: JSON, JSON::XS and HTTP::Tiny. Provided cpanm is available in your system (for instance after make install), these modules can be installed with:

#make install 
make install_REST

Similarly, the dependencies for the ensembl VEP (DBI, DBD::mysql and Archive::Zip), together with those used by recipes using the Ensembl Perl API, can be installed with:

#make install
make install_ensembl

Ensembl API installation instructions can be found here, or if you use git here. There is also a debugging guide, which lists some extra dependencies that might not have, such as modules DBI and DBD::mysql. Note that your local Ensembl API should match the version of the current Ensembl release.

Python

The REST recipes written in python require library requests. Provided pip3 is available in your system (for instance after make install), it can be installed with:

#make install
make install_REST

R

For the BioMart recipes you will need BioConductor package biomaRt (read more here). For the REST recipes two core packages are required: httr and jsonlite. All these can be installed with:

Rscript install_R_deps.R

Repeat masking and annotation

See examples and documentation in folder repeats.

If you want to annotate repeats you must first run:

#make install # install required bedtools
make install_repeats # requires gcc & g++ compilers

Pangenes

See examples and documentation in folder pangenes. We recommend checking out the Runmodes and HPC configuration docs.

Install it the bioconda way:

conda activate bioconda
conda create -n get_pangenes -c conda-forge -c bioconda get_pangenes
conda activate get_pangenes
# or simply
conda install bioconda::get_pangenes

Install it the compilation way:

#make install # install required bedtools
make install_pangenes # requires gcc & g++ compilers

# optionally you might also want to try:
make install_gsalign
make install_pangenes_quality

Phylogenomics

See examples and documentation in folder phylogenomics.

If you want to run any of those scripts you must first run:

#make install 
make install_REST

Species tree

Fig. 1. Species tree of Ensembl Plants release 47 obtained with recipe F12. Figure generated with iTOL

Citation

Contreras-Moreira B, Naamati G, Rosello M, Allen JE, Hunt SE, Muffato M, Gall A, Flicek P (2022) Scripting Analyses of Genomes in Ensembl Plants. In: Edwards D. (eds) Plant Bioinformatics. Methods in Molecular Biology, vol 2443. Humana, New York, NY. 10.1007/978-1-0716-2067-0_2

pangenes

For the pangene protocols the primary citation is:

Contreras-Moreira B, Saraf S, Naamati G, Casas AM, Amberkar SS, Flicek P, Jones AR & Dyer S (2023) GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation. Genome Biol 24, 223. https://doi.org/10.1186/s13059-023-03071-z

Check all the references you need to cite in each script by running:

perl get_pangenes.pl -v
perl check_evidence.pl -c
perl check_quality.pl -c
perl match_cluster.pl -c

repeats

For the scripts and data in the repeats folder please cite:

Contreras-Moreira B, Filippi CV, Naamati G, García Girón C, Allen JE, Flicek P (2021) Efficient masking of plant genomes by combining kmer counting and curated repeats Genomics. Plant Genome https://doi.org/10.1002/tpg2.20143 (preprint https://www.biorxiv.org/content/10.1101/2021.03.22.436504v1)

Girgis HZ (2015) Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinformatics 16:227. https://doi.org/10.1186/s12859-015-0654-5

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100. https://doi.org/10.1093/bioinformatics/bty191

Name		Name	Last commit message	Last commit date
Latest commit History 2,365 Commits
.github/workflows		.github/workflows
files		files
lib		lib
pangenes		pangenes
phylogenomics		phylogenomics
recipes		recipes
repeats		repeats
scripts		scripts
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
demo_test.t		demo_test.t
install_R_deps.R		install_R_deps.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scripting analyses of genomes in Ensembl Plants

List of recipes

Dependencies

FTP

MySQL

Perl

Python

R

Repeat masking and annotation

Pangenes

Phylogenomics

Species tree

Citation

pangenes

repeats

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors 10

Uh oh!

Languages

License

Ensembl/plant-scripts

Folders and files

Latest commit

History

Repository files navigation

Scripting analyses of genomes in Ensembl Plants

List of recipes

Dependencies

FTP

MySQL

Perl

Python

R

Repeat masking and annotation

Pangenes

Phylogenomics

Species tree

Citation

pangenes

repeats

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors 10

Uh oh!

Languages

Packages