Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support internal samples in simulation through pedigree #1855

Open
jeromekelleher opened this issue Oct 5, 2021 · 22 comments
Open

Support internal samples in simulation through pedigree #1855

jeromekelleher opened this issue Oct 5, 2021 · 22 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

A simplification made in #1846 was to disallow samples that are internal nodes in the pedigree (i.e., samples cannot have children). It would not be difficult to implement, it just needs some thinking through exactly what the semantics are.

@apragsdale
Copy link
Contributor

Ran into the error pointing me to this issue. I am interested in this as well.

@hyanwong
Copy link
Member

hyanwong commented Nov 9, 2021

I am also interested, as I would like some ground-truth examples of simulations with non-contemporaneous samples and known pedigrees to test tsinfer (see tskit-dev/tsinfer#602).

@apragsdale
Copy link
Contributor

I'm thinking about this a bit more. Based on some recent discussion in the msprime Slack channel, I realized that the DTWF model also doesn't allow ancient samples to be direct ancestors of more recent samples, so it's not possible to sample direct (grand)parent-offspring relationships under DTWF. It would be great to be able to do this within the pedigree (e.g. simulate a bunch of trios within a population).

Would there need to be serious reworking of the backend, or is it mostly a matter of testing to make sure labeling ancient individuals as samples within the pedigree is doing what we expect? From #1846, it was hard to tell what the sticking point is here.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Nov 24, 2021

There's a note on it here.

In principle it's simple - we just have to replace the existing segment chain for the sample a with a new chain (0, L, a) (so that we have ancestry everywhere for our sample) and insert edges for every segment x in the old chain, like

for (x = existing_segments_for_a_head; x != NULL; x = x->next) {
     if (x.node != a) {
         add_edge(x.left, x.right, a, x.node)
      }
}

(excuse the mixed up pseudocode!)

Hmm, that seems pretty straightforward actually - fancy having a go at it on the Python algoriths.py version @apragsdale ?

@apragsdale
Copy link
Contributor

apragsdale commented Nov 24, 2021 via email

@abureau
Copy link

abureau commented May 12, 2022

This functionality to sample subjects internal to the pedigree would be very useful to research I am conducting on probabilities of rare variant sharing by related individuals. I hope this is implemented soon in msprime.

@jeromekelleher
Copy link
Member Author

Thanks for letting us know you're interested in this feature @abureau. There's no immediate plans to work on it unfortunately, unless anyone wants help out by doing some coding/testing.

@abureau
Copy link

abureau commented May 18, 2022

I would be happy to test the implementation, but I do not master msprime to the point to code a new feature.

@jeromekelleher
Copy link
Member Author

Thanks @abureau - we'll definitely call on your help if/when it gets implemented. Unfortunately I can't dedicate any of my own time here, but I'm happy to help if someone else wants to do a bit of model coding.

@diegovelizo
Copy link

I just ran into the same error message pointing me to this issue. It would be really useful if this feature was implemented.

@jeromekelleher
Copy link
Member Author

I think the status is the same @diegovelizo - nobody has implemented it yet, unfortunately. I think we need someone who is motivated to code this up in the Python prototyper (algorithms.py) and test it a bit. The actual C implementation would be quite easy then.

@abureau
Copy link

abureau commented May 16, 2024

I secured funding from NSERC that I can use to pay a developer of msprime interested in implementing sampling of subjects internal to the pedigree. That person needs to be based in Canada so I can pay him or her from my NSERC funds. Anyone in Canada interested, please contact me.

@petrelharp
Copy link
Contributor

Wow, congratulations! Great news!

@jeromekelleher
Copy link
Member Author

That's great news @abureau! Would you like to have a chat soon to discuss?

@abureau
Copy link

abureau commented May 23, 2024

Thanks for offering to chat about this implementation @jeromekelleher . That will be useful once I have identified the programmer who will do the job. I will contact you then.

@jeromekelleher
Copy link
Member Author

Sounds good to me @abureau, good luck!

@abureau
Copy link

abureau commented Jul 30, 2024

I know it's unusual to contact someone in a Github issue, but I emailed you last week @jeromekelleher about resolving this issue and I want to make sure my emails were not blocked by your spam filter. If you need more time to reply, no problem.

@jeromekelleher
Copy link
Member Author

I just merged #2321 which I think probably solves the issue, but with minimal testing. What we need to do in order to close this issue is improve testing. There's three broad classes of test here, which I'll go through separately.

C Unit tests.

These are to make sure that we don't make an C level mistakes, like accessing uninitialised memory etc. There is basic C testing in #2321, but I'd like to see an additional test_replicates_internal_samples test, and also a larger example with some internal samples. We may want to update verify_pedigree function to actually check some properties about internal samples also (but I think it's mainly invoking msprime's internal consistency checks, which are thorough).

Python Unit tests

These are the most important tests, and should be thorough. Testing style has evolved over time here, but I'm seeing the value of simple examples recently. I'd like to see some worked out examples of simulating through a pedigree with internal samples in a variety of cases, mainly (1) all nodes are samples; (2) we only have internal samples (not sure what'll happen there, actually; should probably add this to the C tests too); one of the nodes in an internal individual is a sample and the other isn't (not sure what happens there either); we have an internal sample in a disconnected bit of the pedigree. That's all I can think of right now, but thinking through all the awkward corner cases here is the real work.

Statistical tests.

These are done in verification.py, and are run manually.

  • Add a class DtwfVsPedigreeInternalSamples(DtwfVsPedigree) which tests a few examples with internal sampling against the DTWF. It's not clear to me that the DTWF will really support "internal samples", but I guess that's something to characterise and document.
  • Add a test to the DtwfVsRecapitatedPedigree class which includes internal samples

Comparisons with DTWF seem to be our main statistical validation here, so hopefully that'll work.


I think @ThierryAM and @PandaGab are keen to get stuck in with doing the coding here, but are new to the codebase (and area). Others interested who might like to help could suggest some example code or ways to test for the the properties we expect of trees with internal pedigree samples.

Ping @apragsdale @gregorgorjanc @sgravel

@jeromekelleher jeromekelleher added this to the 1.3.3 milestone Sep 7, 2024
@gregorgorjanc
Copy link
Member

@jeromekelleher I am keen to start with some Python tests. Can you please point me to a file and ideally line in the code base where I can see current tests for this functionality?

I am scratching my head how to mark nodes as samples in this setting. In #2321 (comment) I marked an individual as a sample when importing pedigree which then makes both of individual’s nodes samples.

@jeromekelleher
Copy link
Member Author

The test_pedigree.py is where most of the tests would go. The most basic think would probably to add some more cases to TestSimulateThroughPedigree with internal samples, so at least we're running the code. In the simple test I did for the algorithms.py prototype here I marked a specific slice of the nodes as samples.

This would be a good start anyway!

@abureau
Copy link

abureau commented Oct 24, 2024

@jeromekelleher I have modified TestSimulateThroughPedigree and the simulate_pedigree function to specify generations of subjects as internal samples. I tested "(1) all nodes are samples" and it worked fine on 2 generations and "(2) we only have internal samples", and there I get Fatal Python error: Aborted with core dump.

I don't see an interest in simulations where probands are not samples, so I would not mind to return an error in that case. Do you agree?

In terms of submitting my changes, I am not sure whether I should open a pull request with my test that crashes msprime. Should I open one with the setting that works (all nodes are samples)? In the meantime, my version of test_pedigree.py that crashes msprime in my setup is here. Different scenarios are obtained by setting internal_sample_gen, see line 562 for an example.

@jeromekelleher
Copy link
Member Author

Can you open a PR with your updates please @abureau? That'll make it easier to see the changes and we can discuss how best to split it up.

Sounds like you found something we need to fix 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants