-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimation of allele freq with metafounders #148
Comments
In |
As discussed today there might be multiple ways to handle this so let's get back to this once we have other engineering done;) |
A note. I recall that AW reported that estimating allele freq sometimes made genome-wide imputation worse. I find this surprising. After our look at the code yesterday, I think that maybe this observation was/is due to the way AF is estimated - we collect information from any pedigree member and find AF that would best explain their observed genotype status (via Newton optimisation). This approach ignores that some of these individuals are related (imagine we have lots of data from one family and far less from other families - that family will dominate the result too much - we would ideally weight by relationships between individuals) and importantly, it is not estimating the right AF - this is an estimate of AF that pertains to the whole population spanning all the generations, which is not what we need for peeling - for peeling we need AF in founders (or metafounder(s)). There might not be much difference between these in some cases, but I for sure know that there can be quite a bit of difference in some cases - long pedigrees with selection! |
Notes from meeting on estimating the allele frequencies in AlphaPeel:Current approach: Planned updates to code:To include the alternative method in addition to the current method.
So Apologies, this is a little messy. I will review and update this comment later. Just adding the current ideas. |
@gregorgorjanc @RosCraddock Thanks for the summary. There is a point I would like to stress. |
Thank you for this comment, @XingerTang. I can see how re-estimating the alternative allele frequency after each peeling cycle may not lead to any improvement, particularly in pedigrees which are sparsely genotyped and reliant on the original allele frequency estimate. However, I am not sure how this could introduce noise. Is this because accuracy is lowest in the earlier generations? Or potential conflict after grouping and estimating per metafounders? Part of the reason for this alternative method is to account for the relationships between the genotyped individuals and those in the base population in estimating the alternative allele frequency. This will be particularly important in pedigrees with many generations, selection, and/or high random genetic drift. Re-estimating after each peeling cycle would consider this. Alternatively, we could consider the relationship between the genotyped individual and the base population/metafounder in the original estimate prior to peeling (e.g., with the linear method). However, we thought it would be worth testing the alternative method first to see if that would be sufficient. |
The reasons that the reestimation may introduce noise are:
Moreover, our original Baum-Welch algorithm can already do the information propagation across generations. |
@XingerTang we should thing of a generative model. In base population genotypes coded as 0, 1, 2 (skipping distinction between the two hets) comes from a binomial distribution with frequency
Hmm. My understanding is that we would use metafounder
Yes.
Yes.
I don't follow this part. We already use I think we are fine - we will only update Let's test this and see how it works. We can always go back if needed. |
@RosCraddock @XingerTang discussion in #142 (comment) is relevant though - if we have metafounders as individuals in the pedigree and we get their genotype probabilities, then we can also get allele probability easily - if that is the data structure we have, then we don;t need to average over the founders ... Sorry, I am lagging behind the code. @RosCraddock maybe you print out state of variables for a simple example when you dig into this and we take it from there. |
@XingerTang @gregorgorjanc To briefly summarise the steps going forward for metafounders and allele frequency estimation:
|
Noted with thanks! Very good reasoning and a plan forward!!! |
In my perspective, better estimates of genotype probabilities down the pedigree are already achieved by the Baum-Welch. The only way we can generate better estimates is by using the result genotype probabilities from the Baum-Welch and calculating alternative allele frequency, which certainly would lead to a loss of information than the original one generated by the Baum-Welch. I don't know in which way we could make use of the alternative allele frequency to improve the genotype probabilities.
I just realized the founders' genotype probabilities don't have to be phased, my apology.
The key is that our current implementation of the reestimation of the genotype probabilities with the aaf generated by Newton's method is done before the peeling, when the first generation knows nothing about the later generation. So in this case, new information would be introduced with the anterior generated by reestimation with the aaf. The squaring issue for the alternative method affects much more is because we are only going to use the genotype probabilities of the founder generation to generate the alternative allele frequencies which is used to generate new genotype probabilities for the updates of the founders generation . While the squaring issue also occurred in the original implementation, the portion of the squared is much less with many generations genotype probabilities involved. |
We are going in circles convincing each other without any results. Let's try and see;) We are doing this step because we want to accommodate mutliple metafounders, where we will need a way to give different base allele frequencies.
But, our current estimation of MAF is based on genotypes from across the pedigree, so there is a "loop" of reusing the same information in the current system as well. Let's get some tests done and see;) |
@gregorgorjanc @XingerTang I have started the work on estimating alternative allele frequency with metafounders and have a couple of comments/questions that I wanted to note.
I am still going ahead with the previously mentioned steps, but wanted to note these comments down!
|
@RosCraddock yeah well spotted on the |
@RosCraddock indeed, allele freq will matter for small datasets/pedigrees or when most of the data is away from the founders - the typical prior vs likelihood information setting. |
@RosCraddock I advise that you develop a small example where you have two families each coming from a different background: say, a male 1 has genotype 0, female 2 has genotype 1 (hence their population allele frequency is (0+1)/(2 * 2)=0.25, then male 3 has genotype 1, and female 4 has genotype 2 (hence their population allele frequency is (1+2)/(2*2)=0.75). Then create two siblings from each of these parents (genotype some) and cross siblings at random to get a 3rd generation (genotype some). You can then play around estimation of joint base population allele frequency (which would be (0+1+1+2)/(2*4)=0.5) or separate/metafounder allele frequencies (which would hopefully get closer to 0.25 and 0.75). I say "play" as depending on who is genotyped (some 2nd and 3rd generation or also some from the 1st/base generation), you will have more or less information to estimate the base population allele frequency/frequencies. |
I have implemented and started testing the updating of the alternative allele frequency after each peeling cycle, both with and without the Newton-Rapson method. I tested this on a small pedigree of 22 (5 generations) with either one, two, or five metafounders under four genotype missing rates (see Table 1 below). I compared all to a true alternative allele frequency to show the optimal accuracies possible (correlation of true dosage to inferred, individual accuracy as only single-locus). I have also added the default_alt_allele, which uses an allele frequency of 0.5 without updating for all metafounders. The true alternative allele frequency for all founders is 0.45 (close to the default). Table 1: Individual inference accuracy (via correlation)
General takeaways so far:
These observations are only from a small pedigree, so I will do some further testing on some different examples (including the one described by @gregorgorjanc above). Then, I will review the run_acc_test.py for metafounders with est_alt_allele. However, I am keen to try on the Kennel Club data soon to see if we observe any difference in cross-validation. |
@RosCraddock, good job being very systematic! That's the path to learning how these methods work and what is possible in the best setting (your Can I clarify that the |
@gregorgorjanc Thank you! Yes, that is right! |
I repeated the above with the same pedigree, but different true genotypes and thus alternative allele frequency. Without grouping of the founders, the true alternative allele frequency was 0.1 (as opposed to 0.45 in the previous test). Here, updating after each peeling without Newton seems to be the optimum. Interestingly, in 11 missing genos, five MF a higher accuracy (correlation) was achieved than with the true_alt_allele_prob_file. Although, I suspect that was by chance. Table 2: Individual inference accuracy (via correlation) with true alternative allele freq of 0.1.
After I developed another small pedigree as described above by @gregorgorjanc. The true alternative allele frequency without any grouping of the founders was 0.5 for Table 3 and 0.25 for Table 4. Table 3: Individual inference accuracy (via correlation) with true alternative allele freq of 0.5.
Table 4: Individual inference accuracy (via correlation) with true alternative allele freq of 0.25.
Then I tested on a larger pedigree of 1000 individuals with 500 from one population and 500 from another population. These have around 50% missing rate in the observed genotypes. Table 5: Marker accuracy
To summarise: |
I just ran a functional test in which the user inputs the estimated allele frequencies as 0 for all loci, yet all in the pedigree (albeit a small example) have genotype 1 for all loci. I ran this with the Referring to the code for estimating the alternative allele frequencies, the values of maf (i.e., the alternative allele frequency) can range from 0.01 to 0.99 and never reach the extremes of 0 and 1 (unlike the user inputs). So, I redid the above, inputting alternative allele frequencies of 0.01 (instead of 0) with Question: Should we enforce this range (0.01 to 0.99) on the user-inputted alternative allele frequencies, too, or only for estimations ( |
@RosCraddock excellent testing! Yes, when we specify 0 or 1 as allele freq, we get an extreme case that many estimation algorithms will have very hard time to "escape" from - it will depend on the algorithm and data that goes in as you can see from your testing. As you suggest, please do change allele freq to 0.01 if provided lower and to 0.99 if provided higher and issue a warning. Time will tell if this is too restrictive and we could loosen up the limits or make them an argument (with |
@gregorgorjanc @RosCraddock
Originally, while estimating the alternative allele frequency of the whole population, the algorithm looped over the genotypes of all the individuals to calculate the most likely alternative allele frequency.
Now, to estimate the alternative allele frequency of the metafounder, we need to loop over the genotypes of the progeny of the metafounder. Then we need to make a list of progeny for each metafounder.
One way to do that is by going through the pedigree to check the ancestors of each individual, but we might lose some information if we don't do it in the correct order. Is there any better way to do it?
The text was updated successfully, but these errors were encountered: