Skip to content

Latest commit

 

History

History
131 lines (109 loc) · 7.95 KB

README.md

File metadata and controls

131 lines (109 loc) · 7.95 KB

get_phylomarkers

This project hosts the code for the get_phylomarkers pipeline. This file describes its aim and basic usage notes. See INSTALL.md for installation instructions. The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain.

Aim

The pipeline selects markers with optimal phylogenetic attributes from the homologous gene culsters produced by get_homologues at GitHub, described in the following publications: Contreras-Moreira and Vinuesa, AEM 2013 and Vinuesa and Contreras-Moreira, 2015. The selected homologous gene/protein clusters are optimally suited for genome phylogenies. The pipeline is primarily tailored to select ideal markers to infer DNA-level phylogenies of different species of the same genus or family. It can also select optimal markers for population genetics, when the source genomes belong to the same species.

The pipeline is run by executing the main script run_get_phylomarkers_pipeline.sh. There are two runmodes: -R 1 (for phylogenetics) and -R 2 (for population genetics). There pipeline can be run on DNA or PROTein sequences (-t DNA|PROT). The latter is intended for the analysis of more divergent genome sequences, above the genus level.

Usage and design notes

  1. Start the run from within the directory holding core genome clusters generated by compare_clusters.pl

NOTE: both faa and fna files are required to generate codon alignments from DNA fasta files. This means that two runs of compare_clusters.pl (from the get_homologues package) are required, one of them using the -n flag.

  1. run_get_phylomarkers_pipeline.sh is intended to run on a collection of genomes from different species.

    NOTES: an absolute minimum of 4 distinct genomes are required. However, the power of the pipeline for selecting optimal genome loci for phylogenomics improves when a larger number of genomes are available for analysis. Reasonable numbers lie in the range of 10 to 100 clearly distinct genomes from multiple species of a genus, family, order or phylum. The pipeline may not perform satisfactorily with too distant genome sequences, particularly when sequences with significantly distinct nucleotide or aminoacid compositions are used. This type of sequence heterogeneity is well known to cause systematic bias in phylogenetic inference. Otherwise, too distantly related organisms, such as those from different phyla or even domains, are also not properly handled by run_get_phylomarkers_pipeline.sh.

On the filtering criteria.

run_get_phylomarkers_pipeline.sh uses a hierarchical filtering scheme, as follows:

i) Detection of recombinant loci.

Codon or protein alignments (depending on runmode) are first screened with Phi for the presence of potential recombinant sequences. It is a well established fact that recombinant sequences negatively impact phylogenetic inference when using algorithms that do not account for the effects of this evolutionary force. The permutation test with 1000 permutations is used to compute the p-values. These are considerd significant if < 0.05.

ii) Detection of trees deviating from expectations of the (multispecies) coalescent.

  The next filtering step is provided by the kdetrees test, which checks the distribution of
  topologies, tree lengths and branch lenghts. kdetrees is a non-parametric method for 
  estimating distributions of phylogenetic trees, with the goal of identifying trees that 
  are significantly different from the rest of the trees in the sample. Such "outlier" 
  trees may arise for example from horizontal gene transfers or gene duplication 
  (and subsequent neofunctionalization) followed by differential loss of paralogues among
  lineages. Such processes will cause the affected genes to exhibit a history distinct 
  from those of the majority of genes, which are expected to be generated by the 
  (multispecies) coalescent as species or populations diverge. Alignments producing 
  significantly deviating trees in the kdetrees test are discarded.
  
  * Parameter for controlling kdetrees stingency:
  -k <real> kde stringency (0.7-1.6 are reasonable values; less is more stringent)
 			       [default: 1.5]

iii) Phylogenetic signal content and congruence.

The alignments passing the two previous filters are subjected to maximum likelihood (ML) tree searches with FastTree to infer the corresponding ML gene trees. The phylogenetic signal of these trees is computed from the Shimodaria-Hasegawa-like likelihood ratio test of branch support values, which vary between 0-1. The support values of each internal branch or bipartition are parsed to compute the mean support value for each tree. Trees with a mean support value below a cutoff threshold are discarded. In addition, a consensus tree is computed from the collection of trees that passed filters i and ii, and the Robinson-Fould distance is computed between each gene tree and the consensus tree.

  * Parameters controlling filtering based on mean support values.
  -m <real> min. average support value (0.7-0.8 are reasonable values) 
 		for trees to be selected [default: 0.75]

iv) Evaluating the global molecular clock hypothesis.

run_get_phylomarkers_pipeline.sh calls the auxiliary script
run_parallel_molecClock_test_with_paup.sh to evaluate the global molecular clock hypothesis on the topo markers, selected according to the criteria explained in the three previos points. The script calls paup* to evaluate the free-rates and clock hypothesis on codon alingments

v) On tree searching:

run_get_phylomarkers_pipeline.sh performs tree searches using the FastTree ML algorithm. This program meets an excellent compromise between speed and accuracy, runnig both with DNA and protein sequence alignments. It computes the above-mentioned Shimodaria-Hasegawa-like likelihood ratio test of branch support values. A limitation though, is that it implements only very few substitution models. However, for divergent sequences of different species within a bacterial taxonomic genus or family, our experience has shown that almost invariably the GTR+G model is selected by jmodeltest2, particularly when there is base frequency heterogeneity. The GTR+G+CAT is the substitution model used by run_get_phylomarkers_pipeline.sh calls of FastTree on codon alignments. The gene trees are computed with high accuracy by performing a thorough tree search, as hardcoded in the following FastTree call:

 	-nt -gtr -gamma -bionj -slownni -mlacc 3 -spr 8 -sprlength 8 
 	
  For concatenated codon alignments, which may take a considerable time (up to several hours)
  for large datasets (~ 100 taxa and > 300 concatenated genes) the user can choose to run 
  FastTree with at different levels of tree-search thoroughness: high|medium|low|lowest 
  
  high:   -nt -gtr -bionj -slownni -gamma -mlacc 3 -spr 4 -sprlength 8
  medium: -nt -gtr -bionj -slownni -gamma -mlacc 2 -spr 4 -sprlength 8 
  low:    -nt -gtr -bionj -slownni -gamma -spr 4 -sprlength 8 
  lowest: -nt -gtr -gamma -mlnni 4
  
  where -s $spr and -l $spr_length can be set by the user. 
  The lines above show their default values.
  
  For protein alignments, the search parameters are the same, only the model changes to lg
  
  high: -lg -bionj -slownni -gamma -mlacc 3 -spr 4 -sprlength 8
  ...
  
  Please refer to the FastTree manual for the details.