2023/08/07 PGI2: Parsimov-scoared Genetic Inference Method
This package implements the PGi algorithm for the analysis of the evolution of developmental sequences. Many supporting functions and the basic structure of the vignette are inspired or based on functions/vignettes available in the R APE package.
Please refer to the following article for background on the algorithm. This is also the prefered citation for PGi:
Please contact Luke Harrison if you have any comments/questions or bug reports etc. [email protected]
To install PGi2, ensure an up-to-date version of R is installed.
Installation from the assemebed R source pacakge:
- Download the assemled PGi package "pgi2_2.1-1.tar.gz" in this repository
- Open R, change to the directory containing the package and install the PGi2 package in R:
install.packages("pgi2_2.1-1.tar.gz")
libray(pgi2)
Installation from GitHub
- Open R
library(devtools)
install_github("lukebharrison/PGi2")
libray(pgi2)
Please refer to the R documentation of this package and its functions, each available in the R documentation system after the package is loaded. The key functions are pgi.read.nexus(), pgi(), pgi.supercon().
Please also refer to the R vignette: Running a Small Analysis for a worked example of the execution of PGi2 using Velhagen's (1997) data set.
-
Enter the dataset into a NEXUS file with the tree topology (see vignette for a worked example) Notes: enter the RANKED data as discrete characters. So for example, if the sequence is 1-2-4-3 (= A-B-D-C), then:
character 1(A) = 1
character 2(B) = 2
character 3(C) = 4
character 4(D) = 3
For ranks between 10 and 35 use A,B,C,...,Y, in the nexus file and so on. Enter unscorable/missing data as Z.
It is possible to have 36+ but the the pgi data structure must be manually edited.
Make sure the taxon names have no spaces and aren't too long (it'll copy those too).
Enter a single phylogenetic topology and save it in to the nexus file in the standard parenthetical format (newick). -
Load the data into PGi, using the function
my_tree<-pgi.read.nexus("filename")
more details in ?pgi.read.nexus
Run the PGi Algorithm in Interative Mode
pgi(interactive=T) and follow the instruction prompts
A: The implementation in PGi2 is based on the R parallel/foreach/doMC packages. In PGi2, only the computation of fully independent runs is parallelized. The number of runs is controlled by the nruns variable in the pgi() function: using a vector of two integers nruns=c(nseries,nparallel), where any integer greater than 1 will active parallel processing. The total number of runs is the product of nseries*nparallel. For a clustering-based systems, you would need to execute R/PGi2 on a single computational node only and use multiple cores of that node, not simply submit to the job scheduler.
Q: What parameters of cycles/replicates/retained ancestral sequences should be used for a given data set?
A: Please refer to the PGi article above in Systematic Biology for theoretical details; PGi is a heuristic algorithm, and the solution space of ancestral sequences and heterochronies increases with the number of nodes and sequence length. The objective of the analysis is to sample enough of the solution space for a given tree and set of sequences to arrive at a set of solutions from independent computations that are more or less equally parsimonious in terms of tree length (total number of sequence heterochronies, each run doesn't have to yield necessarily identical lengths, just close). This is to ensure enough sampling of the solution space. This is similar to a bayesian MCMC analysis where convergence of the MCMC chains is desired - although PGi is not a Markov-chain based method). For example, if 4 independent runs with fixed parameters are performed and the tree lengths and consensus solutions are highly divergent, then the analysis is not being run with adequate parameters (in terms of cycles/replicates/retained sequences).
There are several concerns to also be aware of: if there is a high degree of simultaneity in the data sets, there will be many more possible solutions of similar length - this will require more computation and will also mean that you will discover many different possible solutions that, when summarized in a superconensus of the Indvidual runs, will likely not recover very many heterochronies with a high degree of support.
Practically speaking: the analysis should be run with the largest values for cycles/replicates/ret.anc.seqs that are computationally feasible to achieve convergences of multiple independent runs. As an example, the Sanchez-Villagra data set (2002) of 24 events and 10 sequences was analyzed with cycles=150, replicates=150, ret.anc.seqs=150 with good results. The Harington et al. (2013) analysis of 25 elements and 30 sequences yielded acceptable results at 100/100/100. These values do not need to be identical, but in practice, I've typically scaled them together.
Please consult the R documentation and the PDF Vignette available in the R package and on this repository for further details.
Don't hesitate to contact me (Luke Harrison) - [email protected] with any bugs, comments, suggestions or questions.