-
Notifications
You must be signed in to change notification settings - Fork 63
Unifying Phylogenetics #263
Comments
Just a few scattered thoughts. There might an advantage to try and use a data type that can easily be converted into the "phylo" class in the R package ape: the core for many specialized R packages for phylogenetic comparative methods of all kinds (see this taks view. The "phylo" class in
An often implicit assumption of functions in The data type in PhyloNetworks.jl is completely different: nodes link to their edges and edges link to their nodes to allow for fast tree traversals. It also allows for an easy manipulation of trees when searching for the best topology: to perturb the tree locally (NNI moves, removal/addition of a hybrid edge etc). It looks like the data structure in PhyloTrees.jl has matrices like in ape, but not a single |
I'm a regular user of ape myself - mutation counting code and distance code I've just completed is based very much on its C code :). I'm all for interoperating with ape. In the current Bio.Phylo type, the similar assumptions to ape are made regarding the numbering of the nodes. The new type will do as both PhyloNetworks and PhyloTrees do and have edges and node types, rather than use the LightGraphs. But conversion to the matrix format should definitely be possible! |
A bit of a summary of the PhyloTrees data structures... In PhyloTrees.jl, a tree is a vector of node objects, and a vector of edge objects. The indices of nodes and edges in these vectors serve as the node and edge IDs. Each node knows of it's branches, and each branch knows of it's nodes. Branch length in the branch type is a I'm unsure if the redundancy in this data structure is worthwhile or not. I haven't run any comparisons. It certainly made writing code easier for me though.
|
@cecileane Furthur to your suggestion of ape interoperability, I've just been reading Rcall.jl and it looks like it should be trivial to get the ape phylogenetic tree structures from R as we could just process the Rcall.RObject variable that is returned from the |
Just to add a few cents of mine, I came to julia after developing a phylogeny-using package in R, and found @Ward9250 's Phylo types. After getting used to the graph-like implementation I found it MUCH easier to program with than the ape format. For instance, ape uses several different sorting orders of nodes because some sorting orders are more efficient for some processes. The sorting order has to be guessed from the position of nodes in the edge matrix. Many internal ape functions rely on shifting sorting order (ape-package.ird.fr/misc/FormatTreeR_24Oct2012.pdf). In @Ward9250 's code you just use a different iterator, e.g. a DepthFirst, when that is most appropriate. I have heard many express chagrin over the issues caused in R over having that legacy format. I wrote some code a year ago for plotting phylogenies to compose contexts (see https://drive.google.com/file/d/0BwWjA0Ez4iLhRTVVTEpqTXJiTXc/view?usp=sharing , scroll down for plots), and comparing that relatively clean and legible code to the ape::plot.phylo code is instructive. I would thus advise against basing julia code on this format. That said, it'd be very easy to calculate ape-trees from the Phylo format, and put it into an R session, something like (untested code, as this is based on my old version of @Ward9250 s branch ): using RCall, Bio.Phylo
# calculate numerics
ids = Dict(PhyNode, Int)
tips = 0
nodes = 0
ntips = terminaldescendents(phy)
totalnodes = length(DepthFirst(phy))
Nnode = ntips - totalnodes
# get the tip.label vector
tiplabel = [node.name for node in DepthFirst(phy) if isleaf(node)]
# assign ape ids to nodes and leafs
for clade in DepthFirst(phy)
ids[clade] = isleaf(clade) ? tips += 1 : ntips + nodes += 1
end
# build the edge matrix
edge = Matrix{Int}(totalnodes - 1, 2)
for (i, clade) in enumerate(DepthFirst(phy)[2:end])
edge[i, 1] = ids[parent(clade)]
edge[i, 2] = ids[clade]
end
# edgelengths and such are easy to add
R"""
library(ape)
phy = list(Nnode = $Nnode, edge = $edge, tip.label = $tiplabel)
class(phy) = "phylo"
plot(phy)
""" |
@mkborregaard For the benefit of everyone, since you have an old branch could you please copy-paste the old definition of phylogenies and nodes here as @jangevaare did above with PhyloTrees? |
Sure. Here they are with documentation and constructors - I can remove that for clarity if desired. """
Phylogeny represents a phylogenetic tree.
A tree can have:
- `name`
- `root`
- `rooted`
- `rerootable`
"""
type Phylogeny
name::String
root::PhyNode
rooted::Bool
rerootable::Bool
Phylogeny() = new("", PhyNode(), false, true)
end
# Phylogeny Node Type
# --------------------
"""
PhyNode represents a node in a phylogenetic tree.
A node can have:
- `name`
- `branchlength`
- a reference to its `parent` PhyNode
- reference to one or more `children`
"""
type PhyNode
name::String
branchlength::Nullable{Float64}
confidence::Nullable{Float64}
children::Vector{PhyNode}
parent::PhyNode
"""
Create a PhyNode.
PhyNodes represent nodes in a phylogenetic tree. All arguments are optional
when creating PhyNodes:
**Example:**
one = PhyNode()
two = PhyNode(name = "two", branchlength = 1.0, parent = one)
**Parameters:**
* `name`: The name of the node (optional). Defaults to an empty string, indicating
the node has no name.
* `branchlength`: The branch length of the node from its parent (optional).
Because branch lengths can be not applicable (cladograms), or not known.
The value is Nullable, and is null by default.
* `confidence`: A Nullable Floating point value that can represent confidence for that clade (optional).
Such confidences are usually Maximum Likelihood values or Bootstrap
values.
* `children`: A Vector containing references to the PhyNodes that are children of this node (optional).
Default to an empty vector.
* `parent`: The parent node (optional). Defaults to a self-reference, indicating
the node has no parent.
"""
function PhyNode(name::String = "",
branchlength::Nullable{Float64} = Nullable{Float64}(),
confidence::Nullable{Float64} = Nullable{Float64}(),
children::Vector{PhyNode} = PhyNode[],
parent = nothing)
x = new()
name!(x, name)
branchlength!(x, branchlength)
confidence!(x, confidence)
if parent != nothing
graft!(parent, x)
else
x.parent = x
end
x.children = PhyNode[]
for child in children
graft!(x, child)
end
return x
end
end |
But I am sure the current implementation based on LightGraphs is better - I have just not started using it yet, and the code I suggested for transferring to R is just back-of-the-envelope. |
I understand it's back of envelope, but I thought it was nessecary to help everyone understand how things were before I made the LightGraphs implementation of phylogenies. And whilst I made that implementation for what I felt were some solid reasons, if it's so dissimilar to what people using julia for phylogenetics want to do (such as PhyloNetworks and PhyloTrees) that it's not useful, then it fails in its primary objective, no matter my reasons for forcing the switch to LightGraphs on people. So going back towards a linked data structure - which it originally was, and the other two packages use - may well be better, I've already hit on some limitations to the LightGraphs implementation similar to the ape format with regards to node ordering and so on, and I've been dreading addressing them. |
I opened an issue at RCall that might be relevant to this: JuliaInterop/RCall.jl#138 |
So to summarise the previous iteration of Phylo posted above. The design did not use edges and nodes, but only nodes. So maybe we should talk about design of edges and nodes. Maybe start with edges? An edge looks like: type Branch
source::Int64
target::Int64
length::Nullable{Float64}
end Where source and target refer to the two nodes connecting them. In PhyloNetworks, edges are defined like so: type Edge
number::Int
length::Float64 #default 1.0
hybrid::Bool
y::Float64 # exp(-t), cannot set in constructor for congruence
z::Float64 # 1-y , cannot set in constructor for congruence
gamma::Float64 # set to 1.0 for tree edges, hybrid?gamma:1.0
node::Array{ANode,1} # we can also leave blank: node (see issues.jl)
isChild1::Bool # used for hybrid edges to set the direction (default true)
isMajor::Bool # major edge treated as tree edge for network traversal
# true if gamma>.5, or if it is the original tree edge
inCycle::Int # = Hybrid node number if this edge is part of a cycle created by such hybrid node
# -1 if not part of cycle. used to add new hybrid edge. updated after edge is part of a network
containRoot::Bool # true if this edge can contain a root given the direction of hybrid edges
# used to add new hybrid edge. updated after edge is part of a network
istIdentifiable::Bool # true if the parameter t (length) for this edge is identifiable as part of a network
# updated after part of a network
fromBadDiamondI::Bool # true if the edge came from deleting a bad diamond I hybridization
# inner constructors: ensure congruence among (length, y, z) and (gamma, hybrid, isMajor), and size(node)=2
Edge(number::Int) = new(number,1.0,false,exp(-1.0),1-exp(-1.0),1.,[],true,true,-1,true,true,false)
Edge(number::Int, length::Float64) = new(number,length,false,exp(-length),1-exp(-length),1.,[],true,true,-1,true,true,false)
Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64)= new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,[],true,(!hybrid || gamma>0.5)?true:false,-1,!hybrid,true,false)
Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64,isMajor::Bool)= new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,[],true,isMajor,-1,!hybrid,true,false)
function Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64,node::Array{ANode,1})
size(node,1) != 2 ?
error("vector of nodes must have exactly 2 values") :
new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,node,true,(!hybrid || gamma>0.5)?true:false,-1,!hybrid,true,false)
end
function Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64,node::Array{ANode,1},isChild1::Bool, inCycle::Int, containRoot::Bool, istIdentifiable::Bool)
size(node,1) != 2 ?
error("vector of nodes must have exactly 2 values") :
new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,node,isChild1,(!hybrid || gamma>0.5)?true:false,inCycle,containRoot,istIdentifiable,false)
end
end There are a few possibilities here. |
One possibility is we create a Bio.Phylo AbstractBranch, which edge types inherit from, and these subtype edges are expected to have certain methods defined. This then makes them work seamlessly with higher level functions and makes things more cross compatible. The type hierarchy could even distinguish between different kinds of branches; parent-child branches, hybrid branches and so on if desired. A similar hierarchy could be applied to nodes. An alternative thing we can do is have a parametric It could look something like this: abstract EdgeData
type Edge{M<:EdgeData}
nodes::Pair{Node, Node}
length::Nullable{Float64}
data::M
end So To extend the type and include data and methods in PhyloNetworks the following then is possible: type PhyloNetworkData <: EdgeData # or any name.
hybrid::Bool
y::Float64
z::Float64
gamma::Float64
# ... and so on and so on.
end
typealias NetworkBranch Edge{PhyloNetworkData}
# And then methods specific to NetworkBranch.
Which gives you almost the same data-structure as PhyloNetworks' A similar parametric style could be applied to nodes. Let me know what you think. |
Just to understand - is your suggestion to keep the implementation of phylogenies as current in BioJulia, but extend/modify it slightly by specifying an edge type that makes it easy to port functions from e.g. PhyloNetwork? |
I'm suggesting binning the LightGraphs implementation, and the creation of Edge, and Node as the types that build phylogenies. Either through the creation of a type hierarchy and standard interface, or through parametric types in a manner above. |
So essentially basing it on the existing PhyloNetworks package, or are there things that package does not do efficiently that needs to be extended? Since there has already existed several implementations of phylogenetic types, It would be cool to have a little bit of meta-discussion of what would be the properties of the most succesful and efficient implementation overall and then design the types to fit those. Ideally, this would form a uniform framework for any operation using phylogenies in julia (making the work of derived package programmers a lot easier). Personally I would like to have a phylogeny types that allowed for:
I think the interchangeability with R types should be easy to implement for any implementation, really. |
It should be based on the commonalities in both PhyloNetworks and PhyloTrees. The basic edge types of both of these packages have commonality in type structure, and methods. The difference is in PhyloNetworks edge types have many more data fields that are network specific, and so can be moved to the parametric |
So the intention is still to keep data and topology separate? |
As much as is possible yes, the only additional data that would go into the |
The alternative to going parametric could be to create a type hierarchy and come up with an interface we document that should be adhered to. Maybe something such as: abstract AbstractBranch
abstract PhylogenyBranch <: AbstractBranch
abstract NetworkBranch <: AbstractBranch
type PhyloTreesBranch <: PhylogenyBranch
....
end
type PhyloNetworksBranch <: NetworkBranch
....
end Then we say ok everything inheriting from AbstractBranch must implement: length,
length!
to,
from,
to!,
from! and so on... So then higher level code and phylogeny and network types can benefit from generic code in Bio.Phylo, but package authors also have a much greater degree of freedom than with us going down the parametric route, which forces a certain structure on them. |
I am happy to help with this, but it is still not clear exactly what outcome we are heading towards. Is the intention that PhyloTrees and PhyloNetworks will both be subsumed / rewritten into a new common structure? I will keep tracking this issue and see what you come up with. |
Yes, I'm still working on how this will work conceptually. One of the biggest problems with the node and edge design above that I've discovered is that it is not type-safe, and circular type definitions are not possible in julia yet, and likely won't be for a long time. What I mean is the Nodes need to have Edges defined, and Edges need to have Nodes defined. You can get around that using abstract types, but then you have to be sure to use type assertions everywhere a field is accessed and you still get some performance regression. So I'm still figuring out the way I think is best to go for both performance and that works for everyone, I'll put up a PR with a few different designs and ping here when they're ready 👍 and we can go from there. |
Yes, I saw Tim Holy's response on julia-users. Is it imperative that there is a separate edge type? I am still not 100% clear what the problem was with lightgraphs (or did you check the new Graft.jl?) |
I'm veering towards LightGraphs once more, now that I can see the issue with type-uncertainty. And am looking at fusing the idea proposed above, with what was already implemented with LightGraphs. I haven't seen Graft.jl, I will take a look |
Development of Phylo is now moving over to my fork Ward9250/Bio.jl |
Bio.Phylo is moving to BioJulia/Phylogenies.jl, so I'm going to close this and open an issue with the same title over there. |
With the addition of the PhyloTrees.jl and PhyloNetworks.jl packages to the julia ecosystem (#230), it is clear that the types ad methods provided by Phylo need yet another re-work.
Whilst the current design of types in Phylo make the use of graph data structures in LightGraphs.jl, this makes them really very different from types in PhyloTrees and PhyloNetworks, meaning the types and methods in Bio.Phylo do not constitute a useful infrastructure for these other phylogenetics researchers and coders, who have developed their own phylogenetic data types.
The data types in these packages share commonalities in structure, which makes it possible to take the parametric metadata ideas in Bio.Phylo to create a common phylogenetic tree/network type useable by the authors of these packages and others for their packages and work.
Design and code drafts to follow.
Development will occur on
unify_phylo
branch.The text was updated successfully, but these errors were encountered: