Recombination must be encoded with two nodes #41
Replies: 11 comments 34 replies
-
I think the recombinant "node" is C, and the two parents are node 5 and its neighbour (unlabelled, perhaps node 6) in D? Unless I've misunderstood. I'll try drawing it out myself and post here. |
Beta Was this translation helpful? Give feedback.
-
Maybe another way to think about it - is it possible to always generate a Griffiths ARG from a diploid Wright-Fisher simulation? I don't think so, because the assumption that we don't get recombination and coalescence happening at the same time will be broken. |
Beta Was this translation helpful? Give feedback.
-
So I think I figured out what I would argue (from the Griffiths graph). The recombination nodes are gametes, and don't belong in any individual. The two parents of a recombination are the two genomes within a single parent individual. Coalescent nodes can be within individuals, but recombination nodes don't need to be. (edit - if you really want to group 2 recombination nodes into a single individual, the individual would be the child formed of 2 recombination nodes, one from each parent) |
Beta Was this translation helpful? Give feedback.
-
Here's a (crappy) pic, forwards in time, which I always find a bit easier. Big circles / squares are individuals. Sperm have wiggly tails (!) And here's the version from the same example where you collapse the coalescences within an individual into the 2 recombination nodes above them (rather that treat recombination nodes as gametes). Note that in this picture, all individuals consist of 2 recombination nodes. I've highlighted the path of the dark green segment that's shared between the 2 sample nodes in orange, to show it coalesces in the top left hand genome: |
Beta Was this translation helpful? Give feedback.
-
I'm not sure I understand here. I think you need rules on node IDs in the 2-node case too: for example, you need to keep the 2 nodes together, so that you can look at the rightmost edge value of one and the leftmost of the other. Additionally, it becomes very difficult, if not impossible, to model multiple breakpoints (you probably need more than 2 nodes as parents, as discussed with @JereKoskela. |
Beta Was this translation helpful? Give feedback.
-
(I don't really think this is a theorem or a proper proof, I just found it helpful to lay out the thought processes as if it was) Theorem: the Griffiths graph representation cannot be used to uniquely represent a simulation of the coalescent with recombination (under reasonable assumptions) Proof:
Consider this ARG and the process of simulating it. Each lineage is a tuple (node, ancestry) where Time 0: Initialisation
Time t1: Recombination event on lineages[1], breakpoint=x
Time t2: CA event merging lineages[0] and lineages[2] after:
Time t3: CA event merging lineages[0] and lineages[1] after:
Final output:
Now, suppose we had chosen lineages[0] and lineages[1] to coalesce at time t2. This would lead to a different ancestry to the one given above (the left and right trees would be swapped), but yet have an identical Griffiths Graph output. Therefore, two different ancestral histories produced under the coalescent with recombination have identical outputs. |
Beta Was this translation helpful? Give feedback.
-
An additional problem if you have 2 recombination nodes: each recombination results in 2 nodes. So if a single individual produces multiple gametes, each gamete must have a pair of nodes and the pairs must be distinct from each other, right? But the nodes actually represent the same genomes in a single individual. So in the case of a diploid male who has 2 successful sperm whose lineages lead to the samples, the single diploid male would be represented by 4 separate nodes (I think?) |
Beta Was this translation helpful? Give feedback.
-
A stupid question: if we have 2 nodes for a recombination, where do we store the node annotation for the breakpoint position? Of can we not use the 2-RE-node representation with a node-annotated ARG? Is it only appropriate for an edge-annotated one? |
Beta Was this translation helpful? Give feedback.
-
By the way, representations like those used in ARGweaver's That's a perfectly reasonable and tight definition, right? Just, as Jerome says, not great for storing during a simulation (it also gets a bit complex if you allow > 2 parents per node, but that's seen as a weird necessity by most people I would guess, until you explain simplification to them). Perhaps we are somewhat too much in the mindset of having to store edges and nodes as separate entities, whereas most formats simply define a node with associated parents in a single call or single file line (this is how the mathematical graph libraries often do it too). |
Beta Was this translation helpful? Give feedback.
-
This is a good discussion, so I'm going to summarise what I think are the key points:
The standard single RE-node-plus-breakpoint representation is also limited in what can be modelled. Gene conversion or multiple recombination between two parents along a chromosome cannot be represented without changing the encoding and adding additional shared logic. For example, you can't represent the result of a standard Wright-Fisher simulation in this form, and it would probably be quite challenging to represent an inferred ancestry from within a densely sampled pedigree (e.g., the 1000 Bulls dataset or something). Explicitly representing the ancestry intervals associated with edges removes these representational limitations. |
Beta Was this translation helpful? Give feedback.
-
For reference for future readers, the following discussion talks about why the msprime representation requires 2 (rather than one) recombination node: tskit-dev/msprime#1942 |
Beta Was this translation helpful? Give feedback.
-
The fatal flaw in the Griffiths ARG representation is that it represents recombination events by one node in the graph. There are two reasons I think this is a flaw:
Consider the following with embedded ancestry:
We have a sample individual A, with two genomes 0 and 1. It has parents B and C (who are full sibs with parents = D and E).
A has inherited all of genome 0 from node 5 in D, via B (node 2 - in retrospect I shouldn't have numbered this node, as it's a pass-through node and not interesting for the ARG). Node 1 is a recombinant of genomes 3 and 4 from C, who inherited these from D (node 5). (I think this is probably unbiological, but you get the idea!)
The point is that C contains two genomes, corresponding to the recombined fragments sample 1. How do we represent this as an ARG? If we think of nodes as genomes, then it's simple:
If we record recombinations with one node then it's not clear what the distinction between individuals and genomes is. Which node becomes the recombination? What happens if we have a coalescence and a recombination in the same individual?
(note: this argument seemed stronger when I started writing this, but I've inevitably gotten confused as I tried to write it out and draw some pictures. I think this is the right way to go though: draw a small inbred pedigree like you'd get from a lab drosophila line, and point out some issues with the Griffiths single recombination way of doing things.)
Beta Was this translation helpful? Give feedback.
All reactions