Recombination must be encoded with two nodes #41

jeromekelleher · 2022-01-23T17:02:20Z

jeromekelleher
Jan 23, 2022
Maintainer

The fatal flaw in the Griffiths ARG representation is that it represents recombination events by one node in the graph. There are two reasons I think this is a flaw:

It means that there is no bijective mapping from an arbitrary graph to a form in which the ancestry is fully resolved without some additional rules on (e.g.) node IDs. I think that any such rule would be extremely difficult to enforce in, say, simulated output. (This needs to be explored a bit more and thought through, see Arg sim #39 for the starting point).
It is fundamentally unbiological (if we see nodes are corresponding to genomes)

Consider the following with embedded ancestry:

We have a sample individual A, with two genomes 0 and 1. It has parents B and C (who are full sibs with parents = D and E).

A has inherited all of genome 0 from node 5 in D, via B (node 2 - in retrospect I shouldn't have numbered this node, as it's a pass-through node and not interesting for the ARG). Node 1 is a recombinant of genomes 3 and 4 from C, who inherited these from D (node 5). (I think this is probably unbiological, but you get the idea!)

The point is that C contains two genomes, corresponding to the recombined fragments sample 1. How do we represent this as an ARG? If we think of nodes as genomes, then it's simple:

0 -> 5
1 -> 3 (bp =x)
1 -> 4 (bp = x)
3 -> 5
4 -> 5

If we record recombinations with one node then it's not clear what the distinction between individuals and genomes is. Which node becomes the recombination? What happens if we have a coalescence and a recombination in the same individual?

(note: this argument seemed stronger when I started writing this, but I've inevitably gotten confused as I tried to write it out and draw some pictures. I think this is the right way to go though: draw a small inbred pedigree like you'd get from a lab drosophila line, and point out some issues with the Griffiths single recombination way of doing things.)

hyanwong · 2022-01-23T17:10:31Z

hyanwong
Jan 23, 2022
Maintainer

I think the recombinant "node" is C, and the two parents are node 5 and its neighbour (unlabelled, perhaps node 6) in D? Unless I've misunderstood. I'll try drawing it out myself and post here.

3 replies

jeromekelleher Jan 23, 2022
Maintainer Author

Right, so the "node" is the individual. But the node for the samples within individual A is the genome, so it's inconsistent. (As I said, this argument seemed stronger when I started writing this up.)

hyanwong Jan 23, 2022
Maintainer

So I think we were both wrong. If you want the recombination nodes to be present in an individual, then the RH genome (node 1) of the sample individual is a recombination node, and the 2 parents of that node are the 2 genomes in C.

In this case the Griffiths notation works fine. You associate the breakpoint with node 1, which is a sample recombination node.

hyanwong Jan 23, 2022
Maintainer

I suspect that in the Griffiths notation, every sample node must also be a recombination node, assuming one recombination per gametogenesis. I'm not sure anyone has pointed this out before?

jeromekelleher · 2022-01-23T17:11:32Z

jeromekelleher
Jan 23, 2022
Maintainer Author

Maybe another way to think about it - is it possible to always generate a Griffiths ARG from a diploid Wright-Fisher simulation? I don't think so, because the assumption that we don't get recombination and coalescence happening at the same time will be broken.

0 replies

hyanwong · 2022-01-23T17:29:15Z

hyanwong
Jan 23, 2022
Maintainer

So I think I figured out what I would argue (from the Griffiths graph). The recombination nodes are gametes, and don't belong in any individual. The two parents of a recombination are the two genomes within a single parent individual. Coalescent nodes can be within individuals, but recombination nodes don't need to be.

(edit - if you really want to group 2 recombination nodes into a single individual, the individual would be the child formed of 2 recombination nodes, one from each parent)

3 replies

hyanwong Jan 23, 2022
Maintainer

We know in a WF model, a single individual can produce many gametes, each with a recombination breakpoint. So we know that a single node with a single recombination breakpoint doesn't belong within the individual that generated it. It either is outside of the individual, or it belongs in the child (individual) of that parent.

Basically, a single gamete-producing individual can generate a large number of recombination nodes, so we can't include those in the definition of the parent (which, being diploid) only has 2 nodes.

hyanwong Jan 23, 2022
Maintainer

NB: it's neater (I think) to imagine the recombination nodes as gametes, because then you definitely can't get coalescence and recombination happening at the same time. You could then also argue that removing the recombination nodes from the graph is equivalent to removing explicitly defined gametes from the schematic.

However, it does raise the question of where in time to place the recombination node in time in a WF simulation. I guess at noninteger generations, e.g. the first recombination nodes are at generation 0.5. Alternatively, if you require all nodes to be at integer generations, you need to consider an individual to consist of a set of coalescences below and a pair of recombination nodes at the "top" (i.e. that create the nodes in the individual). You can perhaps merge these together into a single "node" with both coalescences (below) and a recombination (at the head)

jeromekelleher Jan 24, 2022
Maintainer Author

Great point about the gametes - I think this is the essential difference all right: do we see nodes as an individual's genomes or as free floating gametes?

hyanwong · 2022-01-23T19:11:14Z

hyanwong
Jan 23, 2022
Maintainer

Here's a (crappy) pic, forwards in time, which I always find a bit easier. Big circles / squares are individuals. Sperm have wiggly tails (!)

And here's the version from the same example where you collapse the coalescences within an individual into the 2 recombination nodes above them (rather that treat recombination nodes as gametes). Note that in this picture, all individuals consist of 2 recombination nodes. I've highlighted the path of the dark green segment that's shared between the 2 sample nodes in orange, to show it coalesces in the top left hand genome:

0 replies

hyanwong · 2022-01-23T19:25:16Z

hyanwong
Jan 23, 2022
Maintainer

It means that there is no bijective mapping from an arbitrary graph to a form in which the ancestry is fully resolved without some additional rules on (e.g.) node IDs. I think that any such rule would be extremely difficult to enforce in, say, simulated output. (This needs to be explored a bit more and thought through, see Arg sim #39 for the starting point).

I'm not sure I understand here. I think you need rules on node IDs in the 2-node case too: for example, you need to keep the 2 nodes together, so that you can look at the rightmost edge value of one and the leftmost of the other.

Additionally, it becomes very difficult, if not impossible, to model multiple breakpoints (you probably need more than 2 nodes as parents, as discussed with @JereKoskela.

2 replies

jeromekelleher Jan 24, 2022
Maintainer Author

That's true, you either need a rule about which parent goes left/right or you associated some information on the edge.

The point is though that it's possible to make a bijection in the two node case, whereas I don't think it's true in the single node case unless you have strict (and difficult to implement) rules about the form of the graph. I'll try to make this more precise elsewhere.

hyanwong Jan 24, 2022
Maintainer

I commented in #39 (comment) that I think it's the edge order that's used in the single node case, but I'd be interested if that's not sufficient for a one-to-one mapping.

jeromekelleher · 2022-01-24T16:30:41Z

jeromekelleher
Jan 24, 2022
Maintainer Author

(I don't really think this is a theorem or a proper proof, I just found it helpful to lay out the thought processes as if it was)

Theorem: the Griffiths graph representation cannot be used to uniquely represent a simulation of the coalescent with recombination (under reasonable assumptions)

Proof:

t3  ┊  4  ┊     ┊
    ┊ ┏┻┓ ┊     ┊
t2  ┊ 3 ┃ ┊  3  ┊
    ┊ ┃ ┃ ┊ ┏┻┓ ┊
t1  ┊ ┃ 2 ┊ ┃ 2 ┊
    ┊ ┃ ┃ ┊ ┃ ┃ ┊
0.00┊ 0 1 ┊ 0 1 ┊
    0     x     L

Consider this ARG and the process of simulating it.

Each lineage is a tuple (node, ancestry) where node is the ID of the node in which this lineage was created, and ancestry
is a list of (left, right, ancestral_to) values. The left and right coordinates define the interval, and the ancestral_to
value counts how many samples that interval is ancestral to.

Time 0: Initialisation

lineages = [
    (0, [(0, L, 1)]),
    (1, [(0, L, 1)]),
]

Time t1: Recombination event on lineages[1], breakpoint=x
record RE node, id=2, breakpoint=x time=t1
output edge: 1 -> 2
after:

lineages = [
    (0, [(0, L, 1)]),
    (2, [(0, x, 1)]),
    (2, [(x, L, 1)]),
]

Time t2: CA event merging lineages[0] and lineages[2]
record node, id=3, time=t2
output edges: 0 -> 3, 2 -> 3,

after:

lineages = [
    (3, [(0, x, 1), (x, L, 2)]), # NOTE: second element would be dropped here as coalesced
    (2, [(0, x, 1)]),
]

Time t3: CA event merging lineages[0] and lineages[1]
record node, id=4, time=t3
output edges: 2 -> 4, 3 -> 4

after:

lineages = [] # fully coalesced

Final output:

Nodes:
0 time=0
1 time=0
2 time=t1, breakpoint=x
3 time=t2
4 time=t3

Edges:
1 -> 2
0 -> 3
2 -> 3
2 -> 4
3 -> 4

Now, suppose we had chosen lineages[0] and lineages[1] to coalesce at time t2. This would lead to a different ancestry to the one given above (the left and right trees would be swapped), but yet have an identical Griffiths Graph output.

Therefore, two different ancestral histories produced under the coalescent with recombination have identical outputs.

12 replies

jeromekelleher Jan 24, 2022
Maintainer Author

No, I'm looking at this from a more fundamental perspective - what are the drawbacks of the "standard" Griffiths representation?

It's ambiguous without extra rules about the ordering of parent edges from a recombination node
It is difficult to simulate a Griffiths graph that respects these (or any such?) rules under the CwR
Outputting a Griffiths graph from a simulation essentially means throwing away all the information that you had to maintain about ancestral material along intervals, in order to simulate that graph in the first place (otherwise you would have to simulate the Big ARG)
Reconstructing the flow of ancestry through the graph in order to generate trees involves reconstructing precisely the information that you threw away

The conclusions are:

a. You can remove the ambiguity from the Griffiths representation by storing two recombination nodes and a breakpoint (but it doesn't generalise to back and forth recombination between two parents)
b. However, you still have to reconstruct the flow of ancestral material through the graph in order to reconstruct the trees. It's much simpler and more efficient if you just store the intervals of ancestral material along with the edges (as we do in tskit).

So I'm imaging we start the reader out somewhere familiar and then take them through the problems, and end up with the conclusion that actually, we should all just be storing ancestry resolved ARGs in the first place.

hyanwong Jan 24, 2022
Maintainer

So I mostly agree @jeromekelleher. Here's how I'd put it: Your number 1. is easy to define (as I said, it's simply the order of edges, which is what I assume everyone uses). However, if this is used, it becomes very difficult to (efficiently) simulate the Griffiths graph, for all the reasons you have worked out and state so eloquently above. Therefore the Griffiths representation is a perfectly reasonable (if somewhat limiting) way to represent an ARG, but a pretty hopeless representation if you want to simulate an ARG.

We need to be a little cautious, I think. Although we are interested in simulating the CwR, there may be lots of people out there who don't care about this at all. They might, for example, simply want a mathematical representation for analytical maths or probability theory. Or they might want to store an inferred Griffiths-type ARG (this is what ARGweaver does, right?). Or they might be simulating an ARG forwards in time. These are a few of the possibilities to which the node-annotated Griffiths representation might be suited. But there is essentially no extra problem in storing the information on the edges instead, apart from the one about calculating likelihoods under the CwR (how do we intend to do this now, by the way?)

jeromekelleher Jan 25, 2022
Maintainer Author

These are fair points @hyanwong, and you're right we need to get into the mindset of people who think this is the right way to do things. Some quick response though:

For mathematical work, I'd be deeply worried about working with a representation that's ambiguous. Surely it's the underlying ancestry that we are trying to reason about? For example, suppose we came up with an expression to count Griffiths ARGs. What use would that be? We'd still have to figure out how many distinct ancestral histories each one corresponds to in order for this to be meaningful. Similarly for things like comparison metrics. Maybe you can work around this with the ordering rules, but it seems very tricky to me.
If you want to work with inferred ARGs, then presumably you also want to interchange the inferences. Something with this level of ambiguity seems like a poor basis for interchange to me. (I haven't thought about your point about the "one line" representation below though)

hyanwong Jan 25, 2022
Maintainer

Thanks @jeromekelleher. I'm not sure the representation is ambiguous as long as the edge (or parent ID) order is specified. It's just that this is a surprisingly tricky thing to specify (especially during simulation, right?). Also edge (or parent) order is not something that is usually important in MultiDiGraphs, so the frameworks may not support this, and the maths probably hasn't been developed, if indeed it is possible to incorporate that mathematically (as you say, it seems very tricky).

So yes, I basically agree, apart from the concept of ambiguity, which I think is resolved by using order, which itself causes all sorts of other problems.

jeromekelleher Jan 25, 2022
Maintainer Author

You're right @hyanwong, it's not ambiguous so long as you know that the edge order is significant. A good representation for computation and mathematics should require minimal extrinsic logic. As Eric S Raymond says, "Smart data structures and dumb code works a lot better than the other way around.”

hyanwong · 2022-01-24T18:13:22Z

hyanwong
Jan 24, 2022
Maintainer

An additional problem if you have 2 recombination nodes: each recombination results in 2 nodes. So if a single individual produces multiple gametes, each gamete must have a pair of nodes and the pairs must be distinct from each other, right? But the nodes actually represent the same genomes in a single individual.

So in the case of a diploid male who has 2 successful sperm whose lineages lead to the samples, the single diploid male would be represented by 4 separate nodes (I think?)

2 replies

hyanwong Jan 24, 2022
Maintainer

Alternatively, we could keep 2 nodes (not 4) in this male, but have 4 edges (2 leading up into each RE node). But then we can't (I think) use the node order to determine the ancestry path, because the required order could be reversed in the 2 recombinations. Wow. This is mind bending stuff. I'm not sure there's an obvious way to do it.

jeromekelleher Jan 24, 2022
Maintainer Author

All this mental gymnastics is needed to workaround the basic ambiguity of the representation, I think. There's a point where you have to start thinking it's just a bad representation...

hyanwong · 2022-01-24T18:28:47Z

hyanwong
Jan 24, 2022
Maintainer

A stupid question: if we have 2 nodes for a recombination, where do we store the node annotation for the breakpoint position? Of can we not use the 2-RE-node representation with a node-annotated ARG? Is it only appropriate for an edge-annotated one?

3 replies

JereKoskela Jan 24, 2022

Either on the child node of the two recombination nodes, or on both recombination nodes?

hyanwong Jan 24, 2022
Maintainer

If on both recombination nodes, we hit the problem above, that one individual can be responsible for multiple meiosis, one for each generated gamete (this is distinct from multiple breakpoints per meiosis or multiple chromosomes). In that case we would have to either have >2 nodes per individual, or store multiple breakpoints per RE node, even though there weren't multiple breakpoints in the biological process. So we would do better to store it on the child. But that needn't be a recombination node (could even be a NOCOAL_CA node, I guess)

jeromekelleher Jan 25, 2022
Maintainer Author

Simplest thing is to store it on just one of the nodes I guess. That's an easy way of identify the left node then.

I don't think there's much point in torturing ourselves trying to figure out how to make node annotations work in general - they don't (or, if they do it's very complicated and requires even more logic extrinsic to the actual representation).

hyanwong · 2022-01-24T21:03:22Z

hyanwong
Jan 24, 2022
Maintainer

By the way, representations like those used in ARGweaver's .arg format don't really bother storing (or really thinking about) the edges and nodes separately. They just store each node with a set of parents (or redundantly, also with a set of children). If a node has 2 parents, it also has a breakpoint. The order of the parents specified tells you which bits of genome go where.

That's a perfectly reasonable and tight definition, right? Just, as Jerome says, not great for storing during a simulation (it also gets a bit complex if you allow > 2 parents per node, but that's seen as a weird necessity by most people I would guess, until you explain simplification to them). Perhaps we are somewhat too much in the mindset of having to store edges and nodes as separate entities, whereas most formats simply define a node with associated parents in a single call or single file line (this is how the mathematical graph libraries often do it too).

2 replies

jeromekelleher Jan 25, 2022
Maintainer Author

Good point, but it's the same thing. The parents array associated stored with the nodes is the same thing has having a table of (child, parent) edges - one is just a denormalised version of the other. Either way you must regard the ordering as significant.

hyanwong Jan 25, 2022
Maintainer

Yes, absolutely. The order is still important, and sometimes not appreciated as such. I just think we need to again be a little careful that some people are not thinking of nodes and edges separately.

jeromekelleher · 2022-01-25T13:50:55Z

jeromekelleher
Jan 25, 2022
Maintainer Author

This is a good discussion, so I'm going to summarise what I think are the key points:

The standard single RE node Griffiths-like ARG encoding requires us to adopt some extra rules about the direction that ancestry flows through an RE node in order to unambiguously define the embedded ancestry (==set of marginal trees). This is well known.
This standard ARG format cannot be simulated simply/efficiently because we cannot know ahead of time what order the parent edges will occur in.
This issue is simply resolved by storing two nodes per recombination event, and storing the breakpoint on the node for the (say) left lineage.
In order to simulate an ARG efficiently (avoiding the Big ARG), you must keep track of the number of samples each lineage (==ARG edge) is ancestral to. This must take the form of a set ofdisjoint intervals along the genome because of recombination.
In order to recover paths that the ancestry took through the graph in a simulation (to recover the trees, or compute a likelihood, or...), you must compute these sets of intervals. What you need in order to do any practical calculation on the ARG is precisely the information that you threw away when outputting the bare nodes and edges.
Therefore, there is no point in storing a simulated ARG without also storing the intervals of ancestry along each edge. You must have them in order to run the simulation in the first place, and you also must have them in order to do any calculations.
The same argument must hold for an inferred ancestry. There is no way that an ARG could be inferred without having some representation of the ancestral material that is associated with each edge. (I don't know how to illustrate this, but I'm certain it must be true)
You cannot decouple the ancestral material from an ARG. The graph is always defined wrt to the ancestry of a given set of samples (so, it doesn't make sense to simulate an ARG forwards in time, it's only something you can define retrospectively). If ancestral material is not taken into consideration we are left with the Big ARG. The flow of ancestral material is implicitly defined in the standard encoding, and must be reconstituted each time a practical calculation is undertaken.
It is much simpler, more efficient, and less redundant to explicitly define the passage of ancestral material through the graph by storing the set of intervals carrying ancestral material on each edge.

The standard single RE-node-plus-breakpoint representation is also limited in what can be modelled. Gene conversion or multiple recombination between two parents along a chromosome cannot be represented without changing the encoding and adding additional shared logic. For example, you can't represent the result of a standard Wright-Fisher simulation in this form, and it would probably be quite challenging to represent an inferred ancestry from within a densely sampled pedigree (e.g., the 1000 Bulls dataset or something).

Explicitly representing the ancestry intervals associated with edges removes these representational limitations.

7 replies

JereKoskela Jan 25, 2022

I think the point about inference is worded a little strongly. I'm sure it is possible to infer all sorts of things in all sorts of ways. But if your inference doesn't identify the flow of ancestral material, then I think it is unlikely to be useful for biology. Computing all the ways in which ancestry could flow through a given ARG is probably still a daunting task, though I haven't thought about it enough to be sure.

I think the point about forward-in-time simulation needs to be made carefully. It is clearly possible to simulate a particle system forwards in time and extract an ARG from it in a backward pass. The appropriately generalised Moran, Wright-Fisher, and lookdown particle systems all do the job, for example. If you go all the way back to the GMRCA (but still ignore recombinations in nonancestral material), then it is possible to define particle systems which generate an ARG forward in time that terminate in n lineages which carry full ancestral material by conditioning. I would expect the resulting dynamics to be pretty intractable though. I guess the observation is something like "If you want to avoid going all the way to the GMRCA, and want a system you can actually simulate, then all known approaches involve a backward-in-time pass to track ancestral material."

jeromekelleher Jan 25, 2022
Maintainer Author

Thanks both, all good points. Here's some quick clarifications:

Re point 2, I think you mean that the standard CwR model cannot store the ARG during simulation efficiently (we previously argued that an ARG structure doesn't' need to be generated by the CwR: other models to create ARGs are available, and anyway, I suspect a per-generation WF simulation would be fine, because we know all the parents are in the previous generation).

I'd bet against that, I think this is true of any retrospective population model that generates a one-RE node ARG. Unless you identify the lineages as "left RE node x" and "right RE node x", and then be careful to cache edges until both RE nodes have been seen as children, and then output the edge with "left RE node x" first.

Re: point 3 about 2 nodes, this only works for the Griffiths ARG, of course; also, storing the breakpoint on the left node falls down in a WF model (for this reason it might be preferable to store the breakpoint on the child node, which I think is WF compatible).

Yes - I was trying to make the thought process incremental. You can fix this flaw in the single RE event case with this change. But, it doesn't work more generally. Therefore you should just record ancestral material on the edges and be done with it.

I'm not sure I agree about simulating forwards in time.

I meant directly. Of course we can simulate an ARG by doing the standard tree sequence recording stuff and extracting an ARG out of it. But the point is that the ARG is only defined WRT to a sample. It's an inherently retrospective structure.

Last point is that there is one thing you lose when storing the edges only: you lose information about where recombinations happen in non-ancestral material.

That's true --- if it genuinely turns out to be important though you can store it as metadata or whatever. Hard to see how it's important tbh.

I'm not entirely sure what this means for the record_arg option.

I'm in favour of leaving things as they are, using two nodes. It's biologically reasonable and gets rid of awkward problems. We can store the breakpoint or not in metadata, if we like, but I see no motivation for breaking the likelihood code now.

the 2 node version still seems very unintuitive to me, and won't generalise easily to non-Griffiths models, e.g. with multiple chromosomes or GC,

How is that? We have two nodes, one per parental genome. We have (tree sequence) edges everywhere ancestral material flows, so it's completely general. (I'm beyond caring about how general non-edge annotated ARGs of various forms are at this point: it's a broken data structure in my opinion.)

I think the point about inference is worded a little strongly. I'm sure it is possible to infer all sorts of things in all sorts of ways. But if your inference doesn't identify the flow of ancestral material, then I think it is unlikely to be useful for biology.

My point here was about inferring an ARG, specifically @JereKoskela (not things about them). If you infer an ARG, you must output it (say, as a single RE node standard ARG). In order to have generated the nodes in this ARG, you must have reasoned about the flow of ancestral material through the graph - how else could you be sure that it's a well-formed ARG and actually corresponds to something which has well-defined trees at all positions on the genome? To be able to output an ARG which actually is an ARG, you must have gone through the process of going through these interval calculations.

Put it another way: suppose I generate a random Graph and choose some fraction of the nodes to be RE events. Throw random breakpoints on those. What do you think the chance of it corresponding to a well-formed ancestry is? (Close to zero is my guess)

hyanwong Jan 25, 2022
Maintainer

the 2 node version still seems very unintuitive to me, and won't generalise easily to non-Griffiths models, e.g. with multiple chromosomes or GC,

How is that? We have two nodes, one per parental genome. We have (tree sequence) edges everywhere ancestral material flows, so it's completely general. (I'm beyond caring about how general non-edge annotated ARGs of various forms are at this point: it's a broken data structure in my opinion.)

Well,

(a) other people don't think about or plot it like that, and I think we want to keep the plots as-is in the ARG paper. Of course, we can easily just use my code to convert the output into a one-RE node arg, so that's more of a convenience thing.

(b) I think the 2 node structure breaks when we e.g. have multiple breakpoints, and is therefore only suitable for the non WF Griffiths ARG (therefore something for msprime, not tskit).

(c) I think the concept of having recombination nodes is completely independent from the idea of storing breakpoints on the node (which, as you have argued, is very problematic). I think we want to be careful not to mix the two up. It's reasonable, IMO, to have an edge annotated ARG with single recombination nodes, and this has the strongest biological interpretation. If you have 2 nodes, in a WF model you end up having lots of overlapping recombination nodes, which is very confusing.

Maybe this is something to discuss when we meet up in person, though?

JereKoskela Jan 25, 2022

My point here was about inferring an ARG, specifically @JereKoskela (not things about them). If you infer an ARG, you must output it (say, as a single RE node standard ARG). In order to have generated the nodes in this ARG, you must have reasoned about the flow of ancestral material through the graph - how else could you be sure that it's a well-formed ARG and actually corresponds to something which has well-defined trees at all positions on the genome? To be able to output an ARG which actually is an ARG, you must have gone through the process of going through these interval calculations.

My point is more pedantic and less substantial than that. I could write down an inference procedure which ignores the data and always returns the same ARG, e.g. a single tree so that there is no question of validity (at least in a model with recurrent mutations). It would be a useless inference method, but illustrates why we need some kind of qualifier to indicate we are talking about sensible methods.

I think (but am not sure without more though) that inferring a big ARG all the way back to the GMRCA is another, only slightly less dumb example of a similar situation, where you've simulated so much stuff that you can be sure to have a valid ancestry in your graph somewhere, even if you haven't tracked it specifically. That may lie outside of what you mean by "infer an ARG" though, given that it contains a whole bunch of superfluous data as well.

jeromekelleher Jan 25, 2022
Maintainer Author

I think a backwards time WF ARG simulator will be helpful here - I'll put one together ASAP.

hyanwong · 2022-09-16T11:40:36Z

hyanwong
Sep 16, 2022
Maintainer

For reference for future readers, the following discussion talks about why the msprime representation requires 2 (rather than one) recombination node: tskit-dev/msprime#1942

0 replies

Recombination must be encoded with two nodes #41

jeromekelleher Jan 23, 2022 Maintainer

Replies: 11 comments · 34 replies

hyanwong Jan 23, 2022 Maintainer

jeromekelleher Jan 23, 2022 Maintainer Author

hyanwong Jan 23, 2022 Maintainer

hyanwong Jan 23, 2022 Maintainer

jeromekelleher Jan 23, 2022 Maintainer Author

hyanwong Jan 23, 2022 Maintainer

hyanwong Jan 23, 2022 Maintainer

hyanwong Jan 23, 2022 Maintainer

jeromekelleher Jan 24, 2022 Maintainer Author

hyanwong Jan 23, 2022 Maintainer

hyanwong Jan 23, 2022 Maintainer

jeromekelleher Jan 24, 2022 Maintainer Author

hyanwong Jan 24, 2022 Maintainer

jeromekelleher Jan 24, 2022 Maintainer Author

jeromekelleher Jan 24, 2022 Maintainer Author

hyanwong Jan 24, 2022 Maintainer

jeromekelleher Jan 25, 2022 Maintainer Author

hyanwong Jan 25, 2022 Maintainer

jeromekelleher Jan 25, 2022 Maintainer Author

hyanwong Jan 24, 2022 Maintainer

hyanwong Jan 24, 2022 Maintainer

jeromekelleher Jan 24, 2022 Maintainer Author

hyanwong Jan 24, 2022 Maintainer

JereKoskela Jan 24, 2022

hyanwong Jan 24, 2022 Maintainer

jeromekelleher Jan 25, 2022 Maintainer Author

hyanwong Jan 24, 2022 Maintainer

jeromekelleher Jan 25, 2022 Maintainer Author

hyanwong Jan 25, 2022 Maintainer

jeromekelleher Jan 25, 2022 Maintainer Author

JereKoskela Jan 25, 2022

jeromekelleher Jan 25, 2022 Maintainer Author

hyanwong Jan 25, 2022 Maintainer

JereKoskela Jan 25, 2022

jeromekelleher Jan 25, 2022 Maintainer Author

hyanwong Sep 16, 2022 Maintainer

jeromekelleher
Jan 23, 2022
Maintainer

Replies: 11 comments 34 replies

hyanwong
Jan 23, 2022
Maintainer

jeromekelleher Jan 23, 2022
Maintainer Author

hyanwong Jan 23, 2022
Maintainer

hyanwong Jan 23, 2022
Maintainer

jeromekelleher
Jan 23, 2022
Maintainer Author

hyanwong
Jan 23, 2022
Maintainer

hyanwong Jan 23, 2022
Maintainer

hyanwong Jan 23, 2022
Maintainer

jeromekelleher Jan 24, 2022
Maintainer Author

hyanwong
Jan 23, 2022
Maintainer

hyanwong
Jan 23, 2022
Maintainer

jeromekelleher Jan 24, 2022
Maintainer Author

hyanwong Jan 24, 2022
Maintainer

jeromekelleher
Jan 24, 2022
Maintainer Author

jeromekelleher Jan 24, 2022
Maintainer Author

hyanwong Jan 24, 2022
Maintainer

jeromekelleher Jan 25, 2022
Maintainer Author

hyanwong Jan 25, 2022
Maintainer

jeromekelleher Jan 25, 2022
Maintainer Author

hyanwong
Jan 24, 2022
Maintainer

hyanwong Jan 24, 2022
Maintainer

jeromekelleher Jan 24, 2022
Maintainer Author

hyanwong
Jan 24, 2022
Maintainer

hyanwong Jan 24, 2022
Maintainer

jeromekelleher Jan 25, 2022
Maintainer Author

hyanwong
Jan 24, 2022
Maintainer

jeromekelleher Jan 25, 2022
Maintainer Author

hyanwong Jan 25, 2022
Maintainer

jeromekelleher
Jan 25, 2022
Maintainer Author

jeromekelleher Jan 25, 2022
Maintainer Author

hyanwong Jan 25, 2022
Maintainer

jeromekelleher Jan 25, 2022
Maintainer Author

hyanwong
Sep 16, 2022
Maintainer