Replies: 4 comments 6 replies
-
Here are two pictures of what I mean. The first shows the genomes (coloured circles) and the actual events that join them. The second shows the equivalent TS representation, where only genomes and the direct links between them are encapsulated in the format: |
Beta Was this translation helpful? Give feedback.
-
Another way to put it: a tree sequence "recombinant node" is the first genome in the TS (forwards in time) in which the effect of a recombination event is seen. A "common ancestor" node is the first genome in the TS (backwards in time) in which the effect of a coalescent event can be seen. Or something like that. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this @hyanwong, it's good to think this through properly. I think you're focusing on the wrong thing here ("events"), but the reasons are interesting and I think we're slowly homing in on the right way of thinking about this.
If you define edges If you define edges just as (parent, child) then you must have different parent nodes because of the ambiguity (which we've gone over elsewhere). I find thinking about "events" unhelpful to be honest, and I'm convinced it's just a hangover of language from the stochastic process. These "events" are unobservable, and things are much simpler when we just start thinking about genomes and their properties, rather than the nebulous concept of an event. In a recombination, we have three genomes: the left parent, the right parent and the child. You can observe them at various points in time, and mostly it doesn't matter. You can implicitly represent the existence of the parents as pass-through nodes that exist on the (tree sequence) edges emerging from the child, if you like. I'm convinced that the most economical and elegant way to represent recombination is via this triad of nodes (u, p1, p2), and edges
But, unless there's some specific information that we have about v itself (say, it's timing) then its redundant. You can (as you suggest) go the other direction, by getting rid of p1 and p2, and representing the information as edges, like
and imagine that p1 and p2 exist on whatever edges emerge from v. However, this is less expressive, and has a nasty consequence in that we're depending on not squashing adjacent edges for crucial information (most of the time we'd naturally just express this as So, to summarise:
|
Beta Was this translation helpful? Give feedback.
-
Ah yes - case in point for de-emphasising events: tsinfer. We don't infer events in tsinfer, we infer genomes. These genomes don't specifically correspond to the recombinants or their parents from the precise events, but we take into account most of the effects of those events. We can have an ARG that captures the majority of the relevant or knowable information without having events. |
Beta Was this translation helpful? Give feedback.
-
I've had a good long think about the events vs genomes thing, and I'd like to jot down my thoughts somewhere, as I feel I have a consistent view of this now (which may, of course, be different from other people, so hopefully this will kick off a discussion).
The first thing to say is that I think we need to be careful to think about the ARG structure independently of whether it is easy or difficult to simulate the structure under one model or another. This is essentially for the same reason that we point out up top in the paper: we don't want to confuse the process with the structure. After all, there are many uses of an ARG format, such as in tsinfer, which are nothing to do with simulation.
The second thing I want to say is that I think the distinction between a node being a genome and a node being an event is indeed an important one, and I feel is a great insight (thanks Jerome!). One implication is that a genome (rather than event) based format - such as a tree sequence - specifies the edges that link the genomes in the graph, but doesn't necessarily have anything to say about exactly when either a recombination or a coalescent event takes place. It's true that, by convention, in an msprime simulation the coalescence event happens exactly at what we call a coalescence node (or presumably infinitesimally after the coalescence in time). But the format doesn't actually require that. Because the nodes are genomes, the format is valid even if the only genomes which we happen to have recorded in our tree sequence are, for example, two children (each at time t), and a great-great-grandparent node at time t+10. In this case, the coalescence event could have happened at any point between t and t+10. Similarly, we now define a "recombinant node" to be the first node in the tree sequence when we observe two genomes merging together. It could be that the recombinant node is at time t and the two parents are at time t+20 and t+100. All we can say in this case is that the recombination event happened between time t and t+20. In fact, this appears to be a standard feature of a simplified tree sequence.
In the (new) ARG format which is being output e.g. by the
argutils.sim_coalescent
code in this repository, we can think of the 2 parent nodes of a recombination as being a bit like "census nodes", that we use both to mark the exact time of a recombination, and (for algorithmic purposes), to be able to save the edges during the course of a backwards simulation, as discussed in detail by @jeromekelleher for example here. Note however that I don't think of these nodes as a requirement of the genome-based ARG format, just a useful method to simulate such an ARG backwards in time, especially if we are using the nifty trick of jumping multiple generations at a time until the next event. To hammer home the point, I believe that the jumping-back-in-time process as used in the Hudson algorithm should not be allowed to define the structure of the format: one would hope that the ARG format would be agnostic as to the exact process used to create the ARG, whether by simulation or by e.g. inference. I think, for instance, that it would be equally valid to (once the simulation has completed) replace the 2 parent nodes with a single node infinitesimally more recent than the recombination event. I believe that the structure would still be a complete and valid representation of the ancestry, right? It's just that the 2-node version helps during the simulation process?We can generalise this idea and think of a "coalescent node" in a simulation as being actually another form of "census" node, because the Hudson algorithm places the node simultaneously with the coalescence event itself. I.e. in a tree sequence created by the Hudson algorithm, a coalescent node really does occur (more or less) simultaneously with a coalescent event. This need not be the case, however, in a tree sequence in general.
Another neat thing about this view of the world is that it allows us to "collapse" multiple coalescent events, close together in time, into a single polytomy. For example, we can think of a WF model as censusing the genomes at a specific point in time (e.g. at birth), and so an individual that has hundreds of children could logically generate a lambda coalescent, even though actual biological process is made up of multiple bifurcating coalescent events, one per germ-line mitosis. From my perspective, thinking of nodes as genomes rather than events is thus a justification for using the lambda coalescent as an approximation of population dynamics in species such as salmon etc.
As an aside, the distinction between nodes as tangible objects and events linking the processes is somewhat reminiscent of the distinction between a Bayesian Network and its representation as a "factor graph", where the factors are the "events" that join the nodes. We got a little into this when thinking about the tsdate representation of an ARG, with @awohns.
Beta Was this translation helpful? Give feedback.
All reactions