Nodes as events vs genomes, and timings #86

hyanwong · 2022-02-10T08:50:10Z

hyanwong
Feb 10, 2022
Maintainer

I've had a good long think about the events vs genomes thing, and I'd like to jot down my thoughts somewhere, as I feel I have a consistent view of this now (which may, of course, be different from other people, so hopefully this will kick off a discussion).

The first thing to say is that I think we need to be careful to think about the ARG structure independently of whether it is easy or difficult to simulate the structure under one model or another. This is essentially for the same reason that we point out up top in the paper: we don't want to confuse the process with the structure. After all, there are many uses of an ARG format, such as in tsinfer, which are nothing to do with simulation.

The second thing I want to say is that I think the distinction between a node being a genome and a node being an event is indeed an important one, and I feel is a great insight (thanks Jerome!). One implication is that a genome (rather than event) based format - such as a tree sequence - specifies the edges that link the genomes in the graph, but doesn't necessarily have anything to say about exactly when either a recombination or a coalescent event takes place. It's true that, by convention, in an msprime simulation the coalescence event happens exactly at what we call a coalescence node (or presumably infinitesimally after the coalescence in time). But the format doesn't actually require that. Because the nodes are genomes, the format is valid even if the only genomes which we happen to have recorded in our tree sequence are, for example, two children (each at time t), and a great-great-grandparent node at time t+10. In this case, the coalescence event could have happened at any point between t and t+10. Similarly, we now define a "recombinant node" to be the first node in the tree sequence when we observe two genomes merging together. It could be that the recombinant node is at time t and the two parents are at time t+20 and t+100. All we can say in this case is that the recombination event happened between time t and t+20. In fact, this appears to be a standard feature of a simplified tree sequence.

In the (new) ARG format which is being output e.g. by the argutils.sim_coalescent code in this repository, we can think of the 2 parent nodes of a recombination as being a bit like "census nodes", that we use both to mark the exact time of a recombination, and (for algorithmic purposes), to be able to save the edges during the course of a backwards simulation, as discussed in detail by @jeromekelleher for example here. Note however that I don't think of these nodes as a requirement of the genome-based ARG format, just a useful method to simulate such an ARG backwards in time, especially if we are using the nifty trick of jumping multiple generations at a time until the next event. To hammer home the point, I believe that the jumping-back-in-time process as used in the Hudson algorithm should not be allowed to define the structure of the format: one would hope that the ARG format would be agnostic as to the exact process used to create the ARG, whether by simulation or by e.g. inference. I think, for instance, that it would be equally valid to (once the simulation has completed) replace the 2 parent nodes with a single node infinitesimally more recent than the recombination event. I believe that the structure would still be a complete and valid representation of the ancestry, right? It's just that the 2-node version helps during the simulation process?

We can generalise this idea and think of a "coalescent node" in a simulation as being actually another form of "census" node, because the Hudson algorithm places the node simultaneously with the coalescence event itself. I.e. in a tree sequence created by the Hudson algorithm, a coalescent node really does occur (more or less) simultaneously with a coalescent event. This need not be the case, however, in a tree sequence in general.

Another neat thing about this view of the world is that it allows us to "collapse" multiple coalescent events, close together in time, into a single polytomy. For example, we can think of a WF model as censusing the genomes at a specific point in time (e.g. at birth), and so an individual that has hundreds of children could logically generate a lambda coalescent, even though actual biological process is made up of multiple bifurcating coalescent events, one per germ-line mitosis. From my perspective, thinking of nodes as genomes rather than events is thus a justification for using the lambda coalescent as an approximation of population dynamics in species such as salmon etc.

As an aside, the distinction between nodes as tangible objects and events linking the processes is somewhat reminiscent of the distinction between a Bayesian Network and its representation as a "factor graph", where the factors are the "events" that join the nodes. We got a little into this when thinking about the tsdate representation of an ARG, with @awohns.

hyanwong · 2022-02-10T09:28:07Z

hyanwong
Feb 10, 2022
Maintainer Author

Here are two pictures of what I mean. The first shows the genomes (coloured circles) and the actual events that join them. The second shows the equivalent TS representation, where only genomes and the direct links between them are encapsulated in the format:

1 reply

hyanwong Feb 10, 2022
Maintainer Author

Note that the process of (full) simplification removes multiple recombination events which could have occurred in different points in time (even different generations), and combines them as multiple parents of a single, basal, "recombinant" node. A roughly symmetrical argument, but in the other direction in time, applies to the multiple-merger coalescent. In fact, it could also apply to the polytomies created by tsinfer.

hyanwong · 2022-02-10T09:44:17Z

hyanwong
Feb 10, 2022
Maintainer Author

Another way to put it: a tree sequence "recombinant node" is the first genome in the TS (forwards in time) in which the effect of a recombination event is seen. A "common ancestor" node is the first genome in the TS (backwards in time) in which the effect of a coalescent event can be seen. Or something like that.

1 reply

jeromekelleher Feb 10, 2022
Maintainer

This is the right way to think about it I think. We imagine a continuum of individuals existing along an edge that share the same genome from chromosome coordinate left to right. The nodes that we put into the graph are the genomes that are at either end of that continuum.

jeromekelleher · 2022-02-10T10:58:12Z

jeromekelleher
Feb 10, 2022
Maintainer

Thanks for this @hyanwong, it's good to think this through properly. I think you're focusing on the wrong thing here ("events"), but the reasons are interesting and I think we're slowly homing in on the right way of thinking about this.

I think, for instance, that it would be equally valid to (once the simulation has completed) replace the 2 parent nodes with a single node infinitesimally more recent than the recombination event. I believe that the structure would still be a complete and valid representation of the ancestry, right? It's just that the 2-node version helps during the simulation process?

If you define edges (left, right, parent, child) (as in, they have genome coordinates) then yes, you can do this if you want. I guess you can think about it as the "original" recombinant genome, which is inherited as one unit by the lineage until the next significant event (node).

If you define edges just as (parent, child) then you must have different parent nodes because of the ambiguity (which we've gone over elsewhere).

I find thinking about "events" unhelpful to be honest, and I'm convinced it's just a hangover of language from the stochastic process. These "events" are unobservable, and things are much simpler when we just start thinking about genomes and their properties, rather than the nebulous concept of an event.

In a recombination, we have three genomes: the left parent, the right parent and the child. You can observe them at various points in time, and mostly it doesn't matter. You can implicitly represent the existence of the parents as pass-through nodes that exist on the (tree sequence) edges emerging from the child, if you like.

I'm convinced that the most economical and elegant way to represent recombination is via this triad of nodes (u, p1, p2), and edges
(0, x, p1, u), (x, L, p2, u). Unarguably, the genomes corresponding to u, p1 and p2 exist at some point in time. If we want, we can think of the first recombinant genome v, and add it in like

(0, L, v, u), (0, x, p1, v), (x, L, p2, v)

But, unless there's some specific information that we have about v itself (say, it's timing) then its redundant. You can (as you suggest) go the other direction, by getting rid of p1 and p2, and representing the information as edges, like

(0, x, v, u), (x, L, v, u)

and imagine that p1 and p2 exist on whatever edges emerge from v. However, this is less expressive, and has a nasty consequence in that we're depending on not squashing adjacent edges for crucial information (most of the time we'd naturally just express this as (0, L, v, u).

So, to summarise:

I think the language of "events" is unhelpful and we should stop trying to shoehorn the model to fit with the concepts that are inherited from a highly idealised stochastic process
If we think about nodes as representing unique genomes (ignoring mutations) then there's a natural interpretation of the recombination process, in terms of three nodes and where the precise timings of the nodes aren't important (and, lets face it, these timings will never be precisely knowable)
The representation of important information should be robust to operations like edge squashing (it makes like quite unpleasant when we have to keep track of which interval breaks are significant and which arent).

3 replies

hyanwong Feb 10, 2022
Maintainer Author

Just a quick initial reply to say that I agree that in tree sequence world we don't need to think about "events", and I agree that the are usually unknowable (although that doesn't stop us simulating them, of course). However, thinking about how the concept of "events" fits into the TS viewpoint it is really helpful for other people, coming from the statistical world.

Also, I think "events" do happen in real life, not just in an idealised stochastic process. Mitosis is an "event", not a genome, as is meiosis. But we don't need to incorporated them into the TS world (for the reasons stated above)

jeromekelleher Feb 10, 2022
Maintainer

Sure - once we agree that the basic unit that we're talking about "nodes" correspond to genomes, and "events" are things that happen to those genomes along edges joining genomes that exist at different time points, then it doesn't matter. I just want to avoid the idea that nodes are events.

We simulate events - those events produce genomes (nodes). We don't have to have one node per event (which is why I'm so adamant about this), and it happens that recombinations are most elegantly and economically modelled with two nodes. There's no good reason to try to model them with one node, other than to fulfil the idea that nodes should be events.

hyanwong Feb 10, 2022
Maintainer Author

Yes, 100%. One of my arguments here is that we are better off thinking that nodes are genomes rather than events because then we can condense multiple events into a single relationship between genomes (nodes). I think that's what you are saying too.

jeromekelleher · 2022-02-10T13:37:30Z

jeromekelleher
Feb 10, 2022
Maintainer

Ah yes - case in point for de-emphasising events: tsinfer. We don't infer events in tsinfer, we infer genomes. These genomes don't specifically correspond to the recombinants or their parents from the precise events, but we take into account most of the effects of those events. We can have an ARG that captures the majority of the relevant or knowable information without having events.

1 reply

hyanwong Feb 10, 2022
Maintainer Author

Treating nodes as genomes is also compatible with the philosophy that tree sequences record the outcome of a process, rather than (necessarily) the process itself. I think you said this in one of the papers somewhere @jeromekelleher

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes as events vs genomes, and timings #86

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Nodes as events vs genomes, and timings #86

hyanwong Feb 10, 2022 Maintainer

Replies: 4 comments · 6 replies

hyanwong Feb 10, 2022 Maintainer Author

hyanwong Feb 10, 2022 Maintainer Author

hyanwong Feb 10, 2022 Maintainer Author

jeromekelleher Feb 10, 2022 Maintainer

jeromekelleher Feb 10, 2022 Maintainer

hyanwong Feb 10, 2022 Maintainer Author

jeromekelleher Feb 10, 2022 Maintainer

hyanwong Feb 10, 2022 Maintainer Author

jeromekelleher Feb 10, 2022 Maintainer

hyanwong Feb 10, 2022 Maintainer Author

hyanwong
Feb 10, 2022
Maintainer

Replies: 4 comments 6 replies

hyanwong
Feb 10, 2022
Maintainer Author

hyanwong Feb 10, 2022
Maintainer Author

hyanwong
Feb 10, 2022
Maintainer Author

jeromekelleher Feb 10, 2022
Maintainer

jeromekelleher
Feb 10, 2022
Maintainer

hyanwong Feb 10, 2022
Maintainer Author

jeromekelleher Feb 10, 2022
Maintainer

hyanwong Feb 10, 2022
Maintainer Author

jeromekelleher
Feb 10, 2022
Maintainer

hyanwong Feb 10, 2022
Maintainer Author