-
Notifications
You must be signed in to change notification settings - Fork 4
Unifying Phylogenetics #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Ward9250 @jangevaare @cecileane @richardreeve I still think it would be excellent to have a uniform framework for phylogenies in Julia. Currently there are (at least) 4 different packages for phylogenies, with different implementations and different advantages (Phylo has good import facilities and a newick parser, PhyloNetworks allows for reticulated phylogenies and is quite featured, PhyloTrees has plotting recipes and is lightweight, Phylogenies is based on LightGraphs which is a really cutting-edge graph library). Having four different packages is not a good situation for the ecosystem. Of course individual users can just choose which package they like and forget the others, but it is hard for downstream libraries to incorporate phylogenies - two packages choosing different phylogeny libraries would become incompatible etc. The easiest and clearest solution to this, IMHO, would be to define a PhyloBase package, much like StatsBase, which defined an This is an extension of Julia's concept of informal interfaces (https://docs.julialang.org/en/latest/manual/interfaces/#Interfaces-1) and the model of many of the must successful ecosystems. Phylogenies.jl already has the beginnings of such an abstract interface, which could be moved out. Would you be interested in doing this? I don't think it would limit each of your abilities to develop your packages freely. I wouldn't mind helping with some of the grunt work involved. |
I currently have implemented something approximating such a solution in richardreeve/Phylo.jl - there is a public interface in src/Interface.jl and a private API that has to be implemented by any new subclass of:
in src/API.jl. This was originally a separate package called AbstractPhylo.jl to unify the interface with PhyloTrees, but ultimately Justin wasn't interested. For my tastes it's much too complicated, but at the time it just built on the existing PhyloTrees interface. Anyway, generally I agree some consistency would be great I guess, though I'd happily delete mine given the chance if other packages implemented what I needed! |
I'd still love to see a single minimal, core Bio.jl owned Phylogenetic tree/network package that was extended based on our needs/research into our own maintained packages. The fact that we haven't been able to make progress on such a goal may suggest something like that is too idealistic though... If we all wish to continue doing our own thing (perhaps, in the interest of our own research, and not so much in the interest of success of phylogenetics in Julia in general...), I do think at the very least we should do something as @mkborregaard suggests. An interface to the several current packages, à la Plots.jl would really help. We can continue to do our own things, however we want to do them, but we shield the average users from the pedantry. |
Hi all, I think this is a great idea, I'm already aware of some current attempts to do this in @richardreeve's packages, I'd be very happy for BioJulia/Phylogenies.jl to be said interface provider. I made noises about this quite a while ago, I should apologise to you guys for not doing more, about it for phylogenetics, my evolution related programming this year hasn't revolved around trees though but on pop-genetics on sequences. But I'm a fan of BioJulia taking an interface oriented approach to all the packages: on a branch of my BioSequences.jl fork I've been renaming types and refining the API to be much more "trait/interface-ey". |
To avoid the problem highlighted in the xckd, I think the interface of Phylogenies.jl like the good interfaces of Base, should concern themselves with the behaviour of phylogenies, rather than the specific internal representation of the data. Moving to a trait (or Concepts if you like C++), based interface. That should then place it in a good position to allow people freedom in their exact implementations, whilst brining order to any excess chaos. Much how the internal representation of the substitution models were different, but high level behaviour is consistent. |
Yes - I was definitely suggesting something simple (i.e no AbstractNode etc) and implementation-independent. In that sense these interfaces are not standards, it's a common shared behaviour that packages can implement in addition to particulars. One place to start would just be the union (EDIT: I mean intersection) of behaviours (i.e. functions) that are already defined (possibly under different names) in all 4 packages. |
That's fine - I agree, in fact - and the AbstractNode, etc. certainly isn't compulsory, but but it's worth pointing out that at the moment following at least parts of that API (addbranch!, addnode!, primarily) will give any phylogenetics package a newick parser (with some minor modifications), and the ability to translate to and from the phylo class in R. My only concern about using the BioJulia/Phylogenies.jl interface for everything, is that I'm not sure what that is... @Ward9250 do you just mean all of the functions currently exported by the package? I think realistically there would have to be a serious discussion about what the interface ought to look like. At the moment, what I've done is pretty rubbish, but I would like to be able to use the final interface to copy trees freely between implementations ideally, and certainly to write code that works with anything that implements the interface, so that tools like the parser can be written exactly once. |
I very much like the idea of unifying the landscape with a "trait"-based approach: to allow different packages their own internal structure. Different implementations work best for different goals --for example the internal structure in PhyloNetworks was meant for easy transformation of networks, like prune-and-regraft etc. A couple notes about import/export between packages:
julia> using PhyloNetworks
julia> using RCall
julia> net1 = readTopology("(((A:4.0,(B)#H1:1.1::0.9):0.5,(C:0.6,#H1):1.0):3.0,D:5.0);")
PhyloNetworks.HybridNetwork, Rooted Network
9 edges
9 nodes: 4 tips, 1 hybrid nodes, 4 internal tree nodes.
tip labels: A, B, C, D
(((A:4.0,(B)#H1:1.1::0.9):0.5,(C:0.6,#H1):1.0):3.0,D:5.0);
R> $net1
$edge
[,1] [,2]
[1,] 5 6
[2,] 5 1
[3,] 6 8
[4,] 6 7
[5,] 7 2
[6,] 8 4
[7,] 8 9
[8,] 9 3
$Nnode
[1] 5
$edge.length
[1] 3.0 5.0 0.5 1.0 0.6 4.0 1.1 NA
$reticulation
[,1] [,2]
[1,] 7 9
$tip.label
[1] "D" "C" "B" "A"
attr(,"class")
[1] "evonet" "phylo"
R> library(ape)
R> $net1
Evolutionary network with 1 reticulation
--- Base tree ---
Phylogenetic tree with 4 tips and 5 internal nodes.
Tip labels:
[1] "D" "C" "B" "A"
Rooted; includes branch lengths. To move forward, we would need to list of traits that each package should have, ideally. To get it started: |
A very useful feature would be some standard to unite a phylogeny with data. For statistical models, "fitted" models contain information about both the model and the data, so we can ask questions like Having a structure for phylogeny + data has been a struggle for R folks, with a standard that was developed but not really used. @bomeara would know better than me. The difficulty is that data comes in data frames, with taxa in rows say, but taxa may be ordered differently in a phylogeny. Another difficulty is when we want to attach character states along edges, with possible shifts and multiple states along a single edge. If we could learn from what works in R, and set a standard now before each package has its own implementation, that would be great. |
@cecileane After going back and fourth a bit with this myself, I take a similar approach and use Dicts for node data with PhyloTrees. I keep as a separate object from the tree itself (at a time I didn't, however). I do think this is more suitable than a dataframe, but I understand it is not as integrated as what some are looking for. In fact at this time, I have no special code included in PhyloTrees.jl for managing node or branch data (branch length notwithstanding). I've been considering implementing a special Dict for nodes and branches, with an API that was more accessible for this application... |
Thanks @jangevaare. Do you handle missing data as missing keys in Dicts? To provide a wrapper, is there a standard function to take a data frame, a phylogeny, a formula perhaps (like y~x to indicate that y is the response, x is a predictor), and returns a dictionary for each necessary trait, with keys restricted to tips in the phylogeny? I could use such a function for what I am doing right now. For our work on trait evolution (in PhyloNetworks), we built on tools from the GLM package, which uses data frames for input data. So a standard for data frames could be useful, still. |
@cecileane Yep! For the key I use a Int64 node id, which corresponds to my definition of
Adding an extra key look up even though it is very fast to do, I'm not 100% sure is the best approach. It is very simple from a developer and maintainers POV. This is where I think @richardreeve and I disagreed (@richardreeve went on to define parametric nodes in his package IIRC). To get back on topic: I currently don't have Newick string functionality, so I will look to all of your packages that currently do, and lift and modify for my own use license depending :) |
Phylogeny.jl, which uses LightGraphs, could use MetaGraphs.jl for the phylogeny+data, that would be very efficient (but that's an implementation detail, not an interface detail so I guess that wouldn't need coordination among packages??). @Ward9250 given your suggestion to let your abstract types provide a basis for the abstract interface (and if people can get behind that), what would you need me to do to help you get that rolling? @cecileane just out of curiosity, where did you end up getting PhyloNetworks out? I still can't belive the Sys Bio editor rejected it on two utterly positive reviews. |
Actually I think abstract types for nodes and edges might be very good for compatibility, even more flexible and powerful that a newick string based method of transforming from one representation to the other, if it is well designed. Say there is a type of tree which <: from an AbstractTree(or Phylogeny), it has traits of say nodetype, and branchtype, (in the same way collections have |
thanks for asking, @mkborregaard! PhyloNetworks went to MBE. The SB editor mentioned a preference for standalone packages as opposed to R packages, for instance --what can we do. I completely agree that a newick-string-based approach would be inefficient, and a trait-based approach would extend possibilities to handle many things beyond what we can store in the newick string. |
MBE's a nice outcome too, great. Yeah not wanting a julia package is 100% fair but then it would seem more natural to reject it editorially IMHO. |
Yeah @Ward9250 I've been looking at the exported functions from all 4 packages today, and I agree it's nice and powerful with an AbstractNode/AbstractEdge -based function system as well, but will all implementations support it? I'm not sure PhyloNetworks will? |
PhyloNetworks too should work the first way: like |
Oh cool, I missed that. Then I would appear that AbstractPhylogeny, AbstractNode, and maybe AbstractEdge and AbstractDataPhylogeny could be the types of a common interface for all? That's cool. |
Hi all, I'm afraid I'm in Tanzania at the moment without great internet, but I definitely like the direction this is going. As I've said elsewhere, my main interest isn't phylogenetics, and I'm more than happy to remove my package if it causes trouble and something else covers what I need - I just created it to scratch my own itch because I couldn't get other packages to work for me. In particular, what I most needed was and a random tree generator, handling of at least tip traits, and possibly internal node and edge traits too, and then an R and newick interface turned out to be handy too.
|
On more specific points about the interface:
|
Last one, I promise! As far as the interface itself is concerned, we're talking above about the function signatures, which seems a bit premature. What is the actual functionality we need for interoperability? I think we need to be able to:
Things that we might then be able to provide at a high level (not in individual packages):
On things that perhaps don't belong:
I'm sure there's lots more, but those are my thoughts for now. |
I think the first thing to think about is a shared interface. Interoperability is also cool, but I think separate. Also, things like building a phylogeny on MetaGraphs is implementation (wrt loops in the structure PhyloNetworks has that - but you could enforce that by using the SimpleGraph from LightGraphs). I do agree with getting an overview of shared functionality, and with the list @richardreeve provided above. I've gone through the different packages and tried to compile a list of all the implemented functions, to scope how much functionality is overlapping. I've compiled a draft into a worksheet, here https://docs.google.com/spreadsheets/d/1Y1MvYA5AMs6Fue7xFe8-ghq-oH7sWKXA3k666mmkIAM/edit?usp=sharing Actually for what I wrote above, it turns out that Phylo, PhyloTrees and Phylogenies all have 💯 👍 for having different types of iterators for traversal, to use the Distributions interface for randomness and the StatsBase interface for statistical functions |
@richardreeve when you talk about a wrapper, do you mean something like this? module PhyloBase
abstract type AbstractNode end
isleaf(node::AbstractNode) = error("")
root(phy) = error("")::AbstractNode
end
module MyPhylogeny
import PhyloBase
struct BaseNode <: PhyloBase.AbstractNode
phy::Phylogeny
node::Int
end
myroot(phy) = 1
myisleaf(phy, node) = node <= ntips(phy)
PhyloBase.isleaf(node::BaseNode) = myisleaf(node.phy, node.node)
PhyloBase.root(phy) == BaseNode(phy, myroot(phy))
end which keeps the shared interface and package interface separate (in that PhyloBase functions only accept and return Base and PhyloBase types)
|
regarding newick parsing (to answer your question @richardreeve), PhyloNetworks is also limited to plain newick trees, not nexus files (so taxon names need to be in the trees). We can read multiple trees (or networks) in the same file. Metadata about branch support is ignored. Other metadata could cause issues. So yes, I agree that newick parsing is tricky and is a last resort option for interoperability. About interfaces like |
There are ways of handling node data that don't have the issue of being invalidated if the topology changes, as you do if data is stored in a DataFrame where a node corresponds to a row (and thus has to be considered as part of a linear order, as described by richard above). E.g. a Phylogeny could store data as a |
@cecileane The inheritance from
I don't know how that kind of information is stored in the tree - does it fit in the standard newick format, in which case 1. would presumably be best, or is it held outside somehow, in which case 2. would presumably be a better approach? As far as storing associated data, I absolutely agree that some standardisation would be fantastic. For me, it's the main thing, and in fact the phylogeny is really the added data, with species abundances and locations of samples, etc. as core. There's no reason this can't work either way though. At the moment I fix the leaf order when I create a tree by using an OrderedDict to hold the nodes, and then reread the leaf names as I bring the tree into my Diversity package to check that worked and use that order for my other data (which is in some subtype of |
nice work on tidytree in R, showing the need for a data structure combining a phylogeny with data long edges & nodes. also gives examples of getters: |
Uh oh!
There was an error while loading. Please reload this page.
This continues from BioJulia/Bio.jl issue number 263
The text was updated successfully, but these errors were encountered: