A simple pipeline to run a macro synteny analysis.
It is under development, so if you wish to use the pipeline for your own research, please contact us (ecoflow.ucl [at] gmail.com). We can give you the up to date detail for the methods and the up to date figures. When it is published we will release a final version.
Synteny is the study of chromosome arrangement and gene order. Over evolutionary time, two species diverge from the state of the common ancestor, due to a variety of structural changes. These include indels, inversions, translocations, fusions and fissions. This pipeline aims to produce common synteny plots, as well as tables documenting the types of syntenic changes.
The pipeline takes a csv (comma separated value) file as input, which contains the species you wish to compare followed by their RefSeq ID. Genomes must be chromosome level assemblies, with a maximum of 50 chromosomes/scaffolds.
The main pipeline logic is as follows:
- Downloads the genome and gene annotation files
[DOWNLOAD]
. - Extract gene fasta sequences
[GFFREAD]
. - Finds orthologous genes using last
[JCVI]
. - Find syntenic block using MCScanX
[SYNTENY]
. - FUTURE: Plot figures and create summary output tables
[PLOT_SCORE]
,[PLOT_TREE]
and[SUMMARISE_PLOTS]
.
We have a short tutorial to help you test and explore the pipeline.
Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.
- Docker or Singularity.
- Java and openJDK >= 8 (Please Note: When installing Java versions are
1.VERSION
soJava 8
isJava 1.8
). - Nextflow >=
v23.07.0
.
To install the pipeline please use the following commands but replace VERSION with a release.
wget https://github.com/Eco-Flow/synteny/archive/refs/tags/VERSION.tar.gz -O - | tar -xvf -
or
curl -L https://github.com/Eco-Flow/synteny/archive/refs/tags/VERSION.tar.gz --output - | tar -xvf -
This will produce a directory in the current directory called synteny-VERSION
which contains the pipeline.
--input /path/to/csv/file
- A singular csv file as input in one of the two formats stated below.
This csv can take 2 forms:
- A 2 field csv where each row is a unique species name followed by a Refseq genome reference ID (NOT a Genbank reference ID) i.e.
data/Example-accession.csv
. The pipeline will download the relevant genome fasta file and annotation gff3 (or gff augustus) file. - A 3 field csv where each row is a unique species name, followed by an absolute path to a genome fasta file, followed by an absolute path to an annotation gff3 (or gff augustus) file i.e.
data/Example-local.csv
. Input can be gzipped (.gz) or not.
Please Note: The genome has to be chromosome level not contig level.
2 fields (Name,Refseq_ID):
Drosophila_yakuba,GCF_016746365.2
Drosophila_simulans,GCF_016746395.2
Drosophila_santomea,GCF_016746245.2
3 fields (Name,genome.fna,annotation.gff):
Drosophila_yakuba,data/Drosophila_yakuba/genome.fna.gz,data/Drosophila_yakuba/genomic.gff.gz
Drosophila_simulans,data/Drosophila_simulans/genome.fna.gz,data/Drosophila_simulans/genomic.gff.gz
Drosophila_santomea,data/Drosophila_santomea/genome.fna.gz,data/Drosophila_santomea/genomic.gff.gz
--outdir /path/to/output/directory
- A path to the output directory where the results will be written to (Default:Results
).--hex /path/to/hex/file
- A path to a file containing a singular, unique hex code on each line to be used when painting chromosomes (Default:data/unique_hex
).--go /path/to/directory/containing/species/hash/files
- A path to a directory containing a hash file for each species in the analysis i.e.data/go_input/hash_files
. These hash files can be generated by running the Goatee pipeline.--tree /path/to/tree/file
- A path to a file containing a phylogenetic tree for all species in Newick format i.e.data/score_tree_input/tree.txt
.--clean
- A true or false value assigned to this parameter will determine whether the work directory is automatically deleted or not if the pipeline is successful. Deleting the work directory saves space however it will not be possible to use this work directory in future for caching (Default:false
).--architecture
- Anamd
orarm
value assigned to this parameter determines whether containers built for the amd or arm CPU architecture are used (Default:amd
).--help
- A true value assigned to this parameter will cause the help message to be displayed instead of pipeline running (Default:false
).--custom_config
- A path or URL to a custom configuration file.--jcvi_ortholog_arguments
- Add additional flags for the jcvi ortholog (last step). If you wish to change:
- the score cutoff [default: 0.7] --cscore=CSCORE
- Extent of flanking regions to search [default: 20] --dist=DIST
- Minimum number of anchors in a cluster [default: 4] -n N, --min_size=N
- Quota align parameter [default: none] --quota=QUOTA (e.g.: 1:1 to remove duplications)
(Default: --no_strip_names
).
This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2
.
Please select one of the following profiles when running the pipeline.
docker
- This profile uses the container software Docker when running the pipeline. This container software requires root permissions so is used when running on cloud infrastructure or your local machine (depending on permissions). Please Note: You must have Docker installed to use this profile.singularity
- This profile uses the container software Singularity when running the pipeline. This container software does not require root permissions so is used when running on on-premise HPCs or you local machine (depending on permissions). Please Note: You must have Singularity installed to use this profile.apptainer
- This profile uses the container software Apptainer when running the pipeline. This container software does not require root permissions so is used when running on on-premise HPCs or you local machine (depending on permissions). Please Note: You must have Apptainer installed to use this profile.
local
- This profile is used if you are running the pipeline on your local machine.aws_batch
- This profile is used if you are running the pipeline on AWS utilising the AWS Batch functionality. Please Note: You must use theDocker
profile with with AWS Batch.test
- This profile is used if you want to test running the pipeline on your infrastructure. Please Note: You do not provide any input parameters if this profile is selected but you still provide a container profile.
If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.
Please note: The -resume
flag uses previously cached successful runs of the pipeline.
-
Running the pipeline with Singularity profiles and a chosen input csv file:
nextflow run main.nf -profile singularity -resume --input data/Example-accession.csv
-
Running the pipeline with Docker and test profiles:
nextflow run main.nf -resume -profile docker,test
-
Running the pipeline with all parameters:
nextflow run main.nf -profile docker,local -resume --input data/Example-local.csv --go data/go_input/hash_files --tree data/score_tree_input/tree.txt --ribbon Drosophila_yakuba,Drosophila_simulans,Drosophila_santomea --score
-
Running the pipeline with a custom config file:
nextflow run main.nf -profile docker,aws_batch -resume --input data/Example-accession.csv --custom_config /path/to/custom/config
Once completed, your output directory should be called Results
, unless you specified another name:
Subdirectories:
Figures
Karyotype_plots
- Karyotype plots of each pairwise comparison.(.karyotype.pdf). Showing a 1 to 1 chromosome mapping with lines drawn between syntenic chromosomes.Dotplot
- (.pdf). Showing the chromosome synteny as a dot plot.Depth_plot
- (.depth.pdf). Percentage of genome that correspond to non-orthlogous (0), 1to1 or 1toMany orthologs detected.Painted_chromosomes
- (.chromo.pdf).Showing on graphic chromosomes, which sections are syntenic between two species in colours.
Data
Gffread
- Species gene fasta files (.nucl.fa), plus reformatted gff files (.gff_for_jvci.gff3).Anchors
- (.anchors). Anchor files documenting the MSCanX genes in syntenic blocks. Using the lifted function from JCVI.Last
- Filtered last results for each pairwise run. Filtered using default settings from JCVI.
Tables
Trans_Inversion_junction_merged.txt
- A summary of the types of syntenic break between sets of anchors.Paired_anchor_change_junction_prediction
- A folder with each pairwise analysis of junction changes between syntenic blocks.My_scores.tsv
- A table (pairwise) of number of syntenic gene pairs, as well as the max and average syntenic block length (in numbers of genes)Synteny_matrix.tsv
- A Matrix of syntenic gene pair totals (pairwise).Trans_location_version.out.txt
- A Table of scores (pairwise), documenting numbers of scaffolds, syntenic block, genes, as well as a variety of scores.Synt_gene_scores
- A folder with pairwise gene scores. Scores are based on the distance to nearest syntenic break. Where '1' means a gene in on the edge of a syntenic block.My_sim_cores.tsv
- A Matrix containing nucleotide percentage similarities.My_comp_synteny_similarity.tsv
- A Matrix containing pairwise nucleotide percentages and total number of syntenic genes.
All of the pipeline run information can be found inside pipeline_info
.
This pipeline is not yet published. If you use this pipeline for your research please cite the main tool set we use (JCVI):
"Tang et al. (2008) Synteny and Collinearity in Plant Genomes. Science".
Ensure you record the release of the pipeline that you ran, as versions will change over time, so it is important to record exact releases.
If you need any support do not hesitate to contact us at any of:
ecoflow.ucl [at] gmail.com
c.wyatt [at] ucl.ac.uk