Skip to content
This repository was archived by the owner on Aug 26, 2023. It is now read-only.

Unifying Phylogenetics #263

Closed
TransGirlCodes opened this issue Aug 17, 2016 · 25 comments
Closed

Unifying Phylogenetics #263

TransGirlCodes opened this issue Aug 17, 2016 · 25 comments

Comments

@TransGirlCodes
Copy link
Member

With the addition of the PhyloTrees.jl and PhyloNetworks.jl packages to the julia ecosystem (#230), it is clear that the types ad methods provided by Phylo need yet another re-work.

Whilst the current design of types in Phylo make the use of graph data structures in LightGraphs.jl, this makes them really very different from types in PhyloTrees and PhyloNetworks, meaning the types and methods in Bio.Phylo do not constitute a useful infrastructure for these other phylogenetics researchers and coders, who have developed their own phylogenetic data types.

The data types in these packages share commonalities in structure, which makes it possible to take the parametric metadata ideas in Bio.Phylo to create a common phylogenetic tree/network type useable by the authors of these packages and others for their packages and work.

Design and code drafts to follow.
Development will occur on unify_phylo branch.

@cecileane
Copy link

Just a few scattered thoughts.

There might an advantage to try and use a data type that can easily be converted into the "phylo" class in the R package ape: the core for many specialized R packages for phylogenetic comparative methods of all kinds (see this taks view.
RCall.jl makes it very easy to call R within Julia, at least if data types are simple (like matrices). I am really all for native Julia code, but for now users have a lot of phylogenetics tools in R. A data structure that is easy to export to R could facilitate a smooth transition of R users towards Julia.

The "phylo" class in ape is basically a list:

  • edge: matrix of integers, one row per edge, 2 columns: one for the index of the parent node, one for the child node
  • Nnode: integer, number of internal nodes
  • tip.label: vector of strings for the tip labels
  • edge.length (optional): numeric vector
  • nodel.label (optional): label of internal nodes
  • root.edge (optional): edge length

An often implicit assumption of functions in ape and related packages, is that nodes 1 through n are the n tips, and node n+1 is the root.

The data type in PhyloNetworks.jl is completely different: nodes link to their edges and edges link to their nodes to allow for fast tree traversals. It also allows for an easy manipulation of trees when searching for the best topology: to perturb the tree locally (NNI moves, removal/addition of a hybrid edge etc).

It looks like the data structure in PhyloTrees.jl has matrices like in ape, but not a single edge matrix: so there is some redundant information because each node knows the index of its edges, and each edge knows the index of its nodes, which should ease tree traversals.

@TransGirlCodes
Copy link
Member Author

I'm a regular user of ape myself - mutation counting code and distance code I've just completed is based very much on its C code :). I'm all for interoperating with ape. In the current Bio.Phylo type, the similar assumptions to ape are made regarding the numbering of the nodes. The new type will do as both PhyloNetworks and PhyloTrees do and have edges and node types, rather than use the LightGraphs. But conversion to the matrix format should definitely be possible!

@jangevaare
Copy link
Member

jangevaare commented Aug 17, 2016

A bit of a summary of the PhyloTrees data structures...

In PhyloTrees.jl, a tree is a vector of node objects, and a vector of edge objects. The indices of nodes and edges in these vectors serve as the node and edge IDs. Each node knows of it's branches, and each branch knows of it's nodes. Branch length in the branch type is a Nullable{Float64}. Pretty barebones. I can include the code (save the constructors) below easily.

I'm unsure if the redundancy in this data structure is worthwhile or not. I haven't run any comparisons. It certainly made writing code easier for me though.

type Node
  in::Vector{Int64}
  out::Vector{Int64}
end

type Branch
  source::Int64
  target::Int64
  length::Nullable{Float64}
end

type Tree
  nodes::Vector{Node}
  branches::Vector{Branch}
end

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 18, 2016

@cecileane Furthur to your suggestion of ape interoperability, I've just been reading Rcall.jl and it looks like it should be trivial to get the ape phylogenetic tree structures from R as we could just process the Rcall.RObject variable that is returned from the rcall / reval functions. Putting trees into R from julia might be a bit more complex, I haven't read so far ahead yet, but if I can't glean from the docs how that might be done I'll ask the authors.

@mkborregaard
Copy link

mkborregaard commented Aug 18, 2016

Just to add a few cents of mine, I came to julia after developing a phylogeny-using package in R, and found @Ward9250 's Phylo types. After getting used to the graph-like implementation I found it MUCH easier to program with than the ape format. For instance, ape uses several different sorting orders of nodes because some sorting orders are more efficient for some processes. The sorting order has to be guessed from the position of nodes in the edge matrix. Many internal ape functions rely on shifting sorting order (ape-package.ird.fr/misc/FormatTreeR_24Oct2012.pdf). In @Ward9250 's code you just use a different iterator, e.g. a DepthFirst, when that is most appropriate.

I have heard many express chagrin over the issues caused in R over having that legacy format. I wrote some code a year ago for plotting phylogenies to compose contexts (see https://drive.google.com/file/d/0BwWjA0Ez4iLhRTVVTEpqTXJiTXc/view?usp=sharing , scroll down for plots), and comparing that relatively clean and legible code to the ape::plot.phylo code is instructive. I would thus advise against basing julia code on this format.

That said, it'd be very easy to calculate ape-trees from the Phylo format, and put it into an R session, something like (untested code, as this is based on my old version of @Ward9250 s branch ):

using RCall, Bio.Phylo

# calculate numerics
ids = Dict(PhyNode, Int)
tips = 0
nodes = 0
ntips = terminaldescendents(phy)
totalnodes = length(DepthFirst(phy))
Nnode = ntips - totalnodes

# get the tip.label vector
tiplabel = [node.name for node in DepthFirst(phy) if isleaf(node)]

# assign ape ids to nodes and leafs
for clade in DepthFirst(phy)
    ids[clade] = isleaf(clade) ? tips += 1 : ntips + nodes += 1
end

# build the edge matrix
edge = Matrix{Int}(totalnodes - 1, 2)
for (i, clade) in enumerate(DepthFirst(phy)[2:end])
    edge[i, 1] = ids[parent(clade)]
    edge[i, 2] = ids[clade]
end

# edgelengths and such are easy to add

R"""
library(ape)
phy = list(Nnode = $Nnode, edge = $edge, tip.label = $tiplabel)
class(phy) = "phylo"
plot(phy)
"""

@TransGirlCodes
Copy link
Member Author

@mkborregaard For the benefit of everyone, since you have an old branch could you please copy-paste the old definition of phylogenies and nodes here as @jangevaare did above with PhyloTrees?

@mkborregaard
Copy link

mkborregaard commented Aug 18, 2016

Sure. Here they are with documentation and constructors - I can remove that for clarity if desired.

"""
Phylogeny represents a phylogenetic tree.

A tree can have:

- `name`
- `root`
- `rooted`
- `rerootable`
"""
type Phylogeny
    name::String
    root::PhyNode
    rooted::Bool
    rerootable::Bool

    Phylogeny() = new("", PhyNode(), false, true)
end


# Phylogeny Node Type
# --------------------

"""
PhyNode represents a node in a phylogenetic tree.

A node can have:

- `name`
- `branchlength`
- a reference to its `parent` PhyNode
- reference to one or more `children`
"""
type PhyNode
    name::String
    branchlength::Nullable{Float64}
    confidence::Nullable{Float64}
    children::Vector{PhyNode}
    parent::PhyNode
    """
    Create a PhyNode.
    PhyNodes represent nodes in a phylogenetic tree. All arguments are optional
    when creating PhyNodes:

    **Example:**
    one = PhyNode()
    two = PhyNode(name = "two", branchlength = 1.0, parent = one)

    **Parameters:**
    * `name`:         The name of the node (optional). Defaults to an empty string, indicating
                      the node has no name.

    * `branchlength`: The branch length of the node from its parent (optional).
                      Because branch lengths can be not applicable (cladograms), or not known.
                      The value is Nullable, and is null by default.

    * `confidence`:   A Nullable Floating point value that can represent confidence for that clade (optional).
                      Such confidences are usually Maximum Likelihood values or Bootstrap
                      values.

    * `children`:     A Vector containing references to the PhyNodes that are children of this node (optional).
                      Default to an empty vector.

    * `parent`:       The parent node (optional). Defaults to a self-reference, indicating
                      the node has no parent.
"""
    function PhyNode(name::String = "",
                     branchlength::Nullable{Float64} = Nullable{Float64}(),
                     confidence::Nullable{Float64} = Nullable{Float64}(),
                     children::Vector{PhyNode} = PhyNode[],
                     parent = nothing)
        x = new()
        name!(x, name)
        branchlength!(x, branchlength)
        confidence!(x, confidence)
        if parent != nothing
            graft!(parent, x)
        else
            x.parent = x
        end
        x.children = PhyNode[]
        for child in children
            graft!(x, child)
        end
        return x
    end
end

@mkborregaard
Copy link

mkborregaard commented Aug 18, 2016

But I am sure the current implementation based on LightGraphs is better - I have just not started using it yet, and the code I suggested for transferring to R is just back-of-the-envelope.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 18, 2016

I understand it's back of envelope, but I thought it was nessecary to help everyone understand how things were before I made the LightGraphs implementation of phylogenies. And whilst I made that implementation for what I felt were some solid reasons, if it's so dissimilar to what people using julia for phylogenetics want to do (such as PhyloNetworks and PhyloTrees) that it's not useful, then it fails in its primary objective, no matter my reasons for forcing the switch to LightGraphs on people. So going back towards a linked data structure - which it originally was, and the other two packages use - may well be better, I've already hit on some limitations to the LightGraphs implementation similar to the ape format with regards to node ordering and so on, and I've been dreading addressing them.

@mkborregaard
Copy link

I opened an issue at RCall that might be relevant to this: JuliaInterop/RCall.jl#138

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 23, 2016

So to summarise the previous iteration of Phylo posted above. The design did not use edges and nodes, but only nodes.

So maybe we should talk about design of edges and nodes. Maybe start with edges?
In PhyloTrees

An edge looks like:

type Branch
  source::Int64
  target::Int64
  length::Nullable{Float64}
end

Where source and target refer to the two nodes connecting them.

In PhyloNetworks, edges are defined like so:

type Edge
    number::Int
    length::Float64 #default 1.0
    hybrid::Bool
    y::Float64 # exp(-t), cannot set in constructor for congruence
    z::Float64 # 1-y , cannot set in constructor for congruence
    gamma::Float64 # set to 1.0 for tree edges, hybrid?gamma:1.0
    node::Array{ANode,1} # we can also leave blank: node (see issues.jl)
    isChild1::Bool # used for hybrid edges to set the direction (default true)
    isMajor::Bool  # major edge treated as tree edge for network traversal
                   # true if gamma>.5, or if it is the original tree edge
    inCycle::Int # = Hybrid node number if this edge is part of a cycle created by such hybrid node
    # -1 if not part of cycle. used to add new hybrid edge. updated after edge is part of a network
    containRoot::Bool # true if this edge can contain a root given the direction of hybrid edges
                      # used to add new hybrid edge. updated after edge is part of a network
    istIdentifiable::Bool # true if the parameter t (length) for this edge is identifiable as part of a network
                          # updated after part of a network
    fromBadDiamondI::Bool # true if the edge came from deleting a bad diamond I hybridization
    # inner constructors: ensure congruence among (length, y, z) and (gamma, hybrid, isMajor), and size(node)=2
    Edge(number::Int) = new(number,1.0,false,exp(-1.0),1-exp(-1.0),1.,[],true,true,-1,true,true,false)
    Edge(number::Int, length::Float64) = new(number,length,false,exp(-length),1-exp(-length),1.,[],true,true,-1,true,true,false)
    Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64)= new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,[],true,(!hybrid || gamma>0.5)?true:false,-1,!hybrid,true,false)
    Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64,isMajor::Bool)= new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,[],true,isMajor,-1,!hybrid,true,false)
    function Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64,node::Array{ANode,1})
        size(node,1) != 2 ?
        error("vector of nodes must have exactly 2 values") :
        new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,node,true,(!hybrid || gamma>0.5)?true:false,-1,!hybrid,true,false)
    end
    function Edge(number::Int, length::Float64,hybrid::Bool,gamma::Float64,node::Array{ANode,1},isChild1::Bool, inCycle::Int, containRoot::Bool, istIdentifiable::Bool)
        size(node,1) != 2 ?
        error("vector of nodes must have exactly 2 values") :
        new(number,length,hybrid,exp(-length),1-exp(-length),hybrid?gamma:1.,node,isChild1,(!hybrid || gamma>0.5)?true:false,inCycle,containRoot,istIdentifiable,false)
    end
end

There are a few possibilities here.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 23, 2016

One possibility is we create a Bio.Phylo AbstractBranch, which edge types inherit from, and these subtype edges are expected to have certain methods defined. This then makes them work seamlessly with higher level functions and makes things more cross compatible. The type hierarchy could even distinguish between different kinds of branches; parent-child branches, hybrid branches and so on if desired. A similar hierarchy could be applied to nodes.

An alternative thing we can do is have a parametric Edge{M} type, which can be extended using different M's

It could look something like this:

abstract EdgeData

type Edge{M<:EdgeData}
    nodes::Pair{Node, Node}
    length::Nullable{Float64}
    data::M
end

So Edge{Void} would essentially be like the PhyloTrees' Branch.

To extend the type and include data and methods in PhyloNetworks the following then is possible:

type PhyloNetworkData <: EdgeData # or any name.
   hybrid::Bool
   y::Float64
   z::Float64
   gamma::Float64
# ... and so on and so on.
end


typealias NetworkBranch Edge{PhyloNetworkData}

# And then methods specific to NetworkBranch.

Which gives you almost the same data-structure as PhyloNetworks' Edge

A similar parametric style could be applied to nodes.

Let me know what you think.
@mkborregaard, @cecileane, @jangevaare

@mkborregaard
Copy link

Just to understand - is your suggestion to keep the implementation of phylogenies as current in BioJulia, but extend/modify it slightly by specifying an edge type that makes it easy to port functions from e.g. PhyloNetwork?

@TransGirlCodes
Copy link
Member Author

I'm suggesting binning the LightGraphs implementation, and the creation of Edge, and Node as the types that build phylogenies. Either through the creation of a type hierarchy and standard interface, or through parametric types in a manner above.

@mkborregaard
Copy link

mkborregaard commented Aug 23, 2016

So essentially basing it on the existing PhyloNetworks package, or are there things that package does not do efficiently that needs to be extended?

Since there has already existed several implementations of phylogenetic types, It would be cool to have a little bit of meta-discussion of what would be the properties of the most succesful and efficient implementation overall and then design the types to fit those. Ideally, this would form a uniform framework for any operation using phylogenies in julia (making the work of derived package programmers a lot easier).

Personally I would like to have a phylogeny types that allowed for:

  • efficient iteration in different patterns (e.g. depth first)
  • flexible regrafting and subsetting
  • quick simulation
  • light storage, specifically
    • representation of populations of trees (from e.g. posterior distributions), where data is kept separate from the population of topologies
  • an intuitive way of storing data on
    • edge lengths, which are part of the topology
    • node attributes (such as support), which are sometimes independent of topology
    • tip attributes, which are independent of topology.

I think the interchangeability with R types should be easy to implement for any implementation, really.
It looks like PhyloNetworks supports some of those but not all.

@TransGirlCodes
Copy link
Member Author

It should be based on the commonalities in both PhyloNetworks and PhyloTrees. The basic edge types of both of these packages have commonality in type structure, and methods. The difference is in PhyloNetworks edge types have many more data fields that are network specific, and so can be moved to the parametric data field of Edge{M} (Note that data in the data field is data regarding topology (like whether or not the edge is in a cycle) not stuff like tip attributes). I think that allows for the things you desire in your list.

@mkborregaard
Copy link

So the intention is still to keep data and topology separate?

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 23, 2016

As much as is possible yes, the only additional data that would go into the data::M field of the Edge{M} (and it's only a suggested route), is data that is required for topology operations supported by networks and not your normal phylogeny e.g. the fields hybrid::Bool or inCycle::Int. The same would apply for nodes. Other data/info not tied to the topology would be annotated with dicts or something like that, as was done in the PhyNode days of Bio.Phylo.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 23, 2016

The alternative to going parametric could be to create a type hierarchy and come up with an interface we document that should be adhered to.

Maybe something such as:

abstract AbstractBranch
abstract PhylogenyBranch <: AbstractBranch
abstract NetworkBranch <: AbstractBranch

type PhyloTreesBranch <: PhylogenyBranch
    ....
end

type PhyloNetworksBranch <: NetworkBranch
    ....
end

Then we say ok everything inheriting from AbstractBranch must implement:

length,
length!
to,
from,
to!,
from!

and so on... So then higher level code and phylogeny and network types can benefit from generic code in Bio.Phylo, but package authors also have a much greater degree of freedom than with us going down the parametric route, which forces a certain structure on them.

@mkborregaard
Copy link

I am happy to help with this, but it is still not clear exactly what outcome we are heading towards. Is the intention that PhyloTrees and PhyloNetworks will both be subsumed / rewritten into a new common structure? I will keep tracking this issue and see what you come up with.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Aug 29, 2016

Yes, I'm still working on how this will work conceptually. One of the biggest problems with the node and edge design above that I've discovered is that it is not type-safe, and circular type definitions are not possible in julia yet, and likely won't be for a long time. What I mean is the Nodes need to have Edges defined, and Edges need to have Nodes defined. You can get around that using abstract types, but then you have to be sure to use type assertions everywhere a field is accessed and you still get some performance regression. So I'm still figuring out the way I think is best to go for both performance and that works for everyone, I'll put up a PR with a few different designs and ping here when they're ready 👍 and we can go from there.

@mkborregaard
Copy link

Yes, I saw Tim Holy's response on julia-users. Is it imperative that there is a separate edge type? I am still not 100% clear what the problem was with lightgraphs (or did you check the new Graft.jl?)

@TransGirlCodes
Copy link
Member Author

I'm veering towards LightGraphs once more, now that I can see the issue with type-uncertainty. And am looking at fusing the idea proposed above, with what was already implemented with LightGraphs. I haven't seen Graft.jl, I will take a look

@TransGirlCodes
Copy link
Member Author

Development of Phylo is now moving over to my fork Ward9250/Bio.jl

@TransGirlCodes
Copy link
Member Author

Bio.Phylo is moving to BioJulia/Phylogenies.jl, so I'm going to close this and open an issue with the same title over there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants