Complete ARGs must keep all ancestors #61

jeromekelleher · 2022-02-05T18:00:01Z

jeromekelleher
Feb 5, 2022
Maintainer

Consider the following ARG:

    # 3.00┊     5   ┊
    #     ┊  ┏━━┻━┓ ┊
    # 2.00┊  4    ┃ ┊
    #     ┊ ┏┻┓   ┃ ┊
    # 1.00┊ ┃ 2━┳━3 ┊
    #     ┊ ┃   ┃   ┊
    # 0.00┊ 0   1   ┊

You'd assume that this resolves as:

3.00┊  5  ┊  5  ┊
    ┊  ┃  ┊ ┏┻┓ ┊
2.00┊  4  ┊ 4 ┃ ┊
    ┊ ┏┻┓ ┊ ┃ ┃ ┊
1.00┊ ┃ 2 ┊ ┃ 3 ┊
    ┊ ┃ ┃ ┊ ┃ ┃ ┊
0.00┊ 0 1 ┊ 0 1 ┊
    0     1     2

However, if we use an approach based on Hudson's algorithm we actually resolve it like this:

    # 3.00┊     ┊  5  ┊
    #     ┊     ┊ ┏┻┓ ┊
    # 2.00┊  4  ┊ 4 ┃ ┊
    #     ┊ ┏┻┓ ┊ ┃ ┃ ┊
    # 1.00┊ ┃ 2 ┊ ┃ 3 ┊
    #     ┊ ┃ ┃ ┊ ┃ ┃ ┊
    # 0.00┊ 0 1 ┊ 0 1 ┊
    #     0     1     2

The ARG clearly shows that 5 is an ancestor of 4 over the interval 0-1, but Hudson's algorithm ignores everything that happens before the root of a local tree. That's because once a local section of genome has fully coalesced, it's snipped out of the simulation and no longer tracked. Thus, if we want to fully resolve an ARG and retain all information in it, we can't directly use Hudson's algorithm, but must instead use a modification of it that doesn't throw away fully coalesced segments.

Also, if we want to simulate an ARG we can't use Hudson's algorithm either - even if we retain all of the nodes before a full coalescence happens, the simulation will be incomplete from the point that the first full coalescence occurs.

If we want to simulate a complete ARG that contains all the nodes and all relationships between fully coalesced trees and those that have not yet fully coalesced, we must keep the full history for all segments of the genome, and so we must keep simulating all segments until all segments have fully coalesced. Note this is not necessarily back to the GMRCA, so it's not as bad as the Big ARG, but it's certainly much worse than Hudson's algorithm.

It's another reason that an ARG is inherently less efficient to simulate than Hudson's algorithm as well - not only do we have to store all these nodes that nothing happens at within the local trees, we have to store them above the roots too. I'd imagine many of these nodes have no effect on any other local trees either, so they really are pointless to simulate.

It it is potentially useful to know that a particular ancestor that is an internal node in one local tree is also an ancestor of the root of another local tree, so I think the resolve algorithm must be faithful to this information.

We should document that the full_arg option in msprime actually doesn't keep the full ARG, but doesn't record anything above the local roots (it's too much trouble to change that).

Does this have any bearing on the likelihood calculations @JereKoskela?

Any thoughts?

JereKoskela · 2022-02-05T18:43:21Z

JereKoskela
Feb 5, 2022

That all looks correct to me. My mental definition of "full ARG" during the msprime implementation was the smaller graph in which we forget about a site once the local MRCA has been reached. Clarifying that in the docs is a good idea - I agree that we shouldn't change anything in code.

There are no implications for either likelihood calculation. Edges above local roots don't affect genetic diversity in the sample, so that sampling consistency + the Markov property of the ARG mean that we are effectively averaging them out entirely.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete ARGs must keep all ancestors #61

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Complete ARGs must keep *all* ancestors #61

jeromekelleher Feb 5, 2022 Maintainer

Replies: 1 comment

JereKoskela Feb 5, 2022

Complete ARGs must keep all ancestors #61

jeromekelleher
Feb 5, 2022
Maintainer

JereKoskela
Feb 5, 2022