Esprit 2 is a software package to detect split genes in a proteome using a statistical test based on reconstructed gene trees with related reference genomes. The approach is described in detail in https://doi.org/10.1093/bioinformatics/bty772
Clone the git repository from https://github.com/dessimozlab/esprit2
git clone https://github.com/dessimozlab/esprit2
Adjust your file load_env according to your environment, i.e. python environment, paths to required software,...
You need to provide a folder with fasta files, one per gene family and an orthoxml file. The simplest way to obtain these files is by running OMA Standalone on your dataset. The produced files /HierarchicalGroups.orthoxml/ and the folder /HOGFasta/ are the two input files you need for Esprit 2. Please put them to your esprit2 directory.
As of now, Esprit 2 needs a SunGridEngine (SGE) scheduler. This will likely change in the future and will be extended to other HPC schedulers.
Run Esprit by
./pipeline.sh <ID_prefix> <path_to_family_folder> <path_to_orthoxml>
ID_prefix
is a unique species- or chromosome-specific substring of a gene ID. For example, in case of wheat where all gene IDs follow the format Traes_chromosomeArm_string
(e.g., Traes_1BL_6EC2AB17D
) or TRAES3Bstring
(e.g., TRAES3BF036000300CFD
) for 3B reference assembly, an ID_prefix
could be:
Traes_chromosomeArm
, e.g.,Traes_1BL
- for detecting split genes within a chromosome arm, e.g., long arm of chromosome 1BTraes_chromosome
, e.g.,Traes_1B
- for detecting split genes within a chromosome, e.g., chromosome 1B. This will probably yield candidate pairs where one fragment has been assigned to 1BL (long arm) and the other to 1BS (short arm).Traes
- for detecting split genes within the whole wheat genome. Please be aware that the set of candidate pairs will contain fragments coming from different chromosomes.TRAES3B
- for detecting split genes within 3B reference assembly
By calling the pipeline.sh script with -h
, it will output additional parameters
you can specify together with their default values.
collapsing_results.txt
- columns: gene1, gene2, sister taxa before collapsing (True/False), sister taxa after collapsing (True/False)
lrt_summaries.tar.gz, lrt_summary.txt, missing_lk.txt
- tar.gz contains a summary per case tested, lrt_summary.txt provides test statistics and p-values for all cases, missing_lk.txt indicates cases where the tree likelihood wasn't computed (please have a look at these computations and investigate what went wrong)
predictions_ambiguous.txt, predictions_unambiguous.txt
- contain gene IDs for predictions
details_predictions.txt
- columns: HOG ID, number of sequences in the HOG, number of sequences from the target species (or chromosome) in the HOG, number of candidate pairs in the HOG, number of predictions in the HOG, gene1, gene2, length of gene1, length of gene2, type of prediction ('A' for ambiguous, 'U' for unambiguous)
updated_gff_file.gff
- updated GFF file with inferred predictions (merged gene features). Only available if input GFF file is specified
alignment_positions.txt
TSV file with the following columns:
- HOG ID
- gene1
- gene2
- start position of gene1 in the MSA
- end position of gene1 in the MSA
- start position of gene2 in the MSA
- end position of gene2 in the MSA
- overlap start position (or -1 if no overlap)
- overlap end position (or -1 if no overlap)
- %overlap of aligned gene1
- %overlap of aligned gene2
cuts.txt
- columns: HOG ID, gene1, gene2, their cut/middle position in the alignment
hog_size.txt
- columns: HOG ID, number of sequences in the HOG
mapping.txt
- mapping between OMA IDs and IWGSC IDs
sequence_lengths.txt
TSV file with following columns: HOG ID, gene1, length of gene1, gene2, length of gene2
Contains also pairs with short sequence(s) which didn't pass the min sequence length criteria
aln_c.tar.gz, aln.tar.gz, phy_c.tar.gz, phy.tar.gz
- contain aligned families in FASTA format (aln_c, aln) and phylip (phy_c, phy). aln_c and phy_c contain families with n-1 sequences whereas aln and phy contain n sequences
hog_aln.tar.gz
- alignments of HOGs which contain at least 2 wheat genes from the chromosome of interest
bootstrap_aln.tar.gz, bootstrap_s_aln.tar.gz, bootstrap_phy.tar.gz, bootstrap_s_phy.tar.gz
- similar as above but for bootstrap samples. bootstrap_aln.tar.gz and bootstrap_phy.tar.gz contain samples with n-1 sequences whereas bootstrap_s_aln.tar.gz and bootstrap_s_phy.tar.gz contain samples with n sequences
collapsed.tar.gz
- contains trees after collapsing
n_1_res.tar.gz, n_notop_res.tar.gz, n_top_res.tar.gz, n_1_b_res.tar.gz, n_b_notop_res.tar.gz, n_b_top_res.tar.gz
- contain stats output from FastTree
n_1_trees.tar.gz, n_trees_notop.tar.gz, n_1_b_trees.tar.gz
- contain the infered FastTree trees
n_1_trees_s.tar.gz, n_1_b_trees_s.tar.gz
- contain input topologies for tree reconstructions with input topology