-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diploSHIC detects too many soft / softlinked regions in the genome #42
Comments
hey there-- can you share your complete simulations parameters for soft sweeps? i.e. the command line you are using. thanks. |
Sure: simsoft.sh arrSelPos=( 0.045454545454545456 0.13636363636363635 0.22727272727272727 0.3181818181818182 0.4090909090909091 0.5 0.5909090909090909 0.6818181818181818 0.7727272727272727 0.8636363636363636 0.9545454545454546 ) for ((i = 0; i < ${#arrSelPos[@]}; ++i)); do |
Thanks for the analysis. I think I might have figured out a potential reason. The samples we have appear to have a pretty high coefficient of kinship distributed around 0.14 as estimated by popkin, but each individual has an inbreeding coefficient near 0. It suggests that these samples are probably related (half sibs?), whose parents are unrelated. I guess the discoal simulations probably represent the parental generation of our samples, but not these samples per se. I am thinking of performing an extra step of simulation by drawing blocks from the discoal simulated haplotypes to form samples, such that the distribution of coefficient of kinship fit our data, before training with diploshic again. |
Hi, Here is the number of windows classified in each of the 5 categories by diploshic for 3 different populations based on 10 different simulation dataset and 10 replicates by simulation, which yield a total of 100replicates. wilding.rawnbPredictionByClass.w110000.reps.pdf Looking at a specific example, the insecticide resistance gene Gste in anopheles gambiae, show a similar result. However, I noticed that adding an extra layers of stringency based on the probability of been neutral drastically decrease the amount of soft and linkedSoft present in the region. Bellow are represented the different output I have got, after considering windows with a probabilities higher than the threshold as neutral. The vertical barre represent the genomic coordinate of Gste. wilding.probaFilter.examplesGste.pdf Based on those observation I am thinking to consider in my own case windows presenting selective sweep only those having a probabilities of been neutral < 0.001. But I was wondering if my method of having 100replicates and increasing the stringency of been a sweep is relevant. |
hi @jdaron-- when i see results like this I again think that the baseline demographic model probably isn't a good fit to your empirical data. have you plotted simulations vs empirical summaries at some point? |
Here is the PCA plot of the simulations, and at the opposite of what you show they don't separate nicely. However, I tried to reproduce the same PCA plot with the data from melop and couldn't neither get the same figure that you show. To make sure I am performing the PCA correctly you use as input .fvec data present in the trainingSets folder right ? Here is the complete simulations parameters I used for soft sweeps. |
i was not using those features (but one could for sure)-- instead i was using other summary stats of the simulations. what i was suggesting however was to plot simulations against the real, empirical data. that will tell you if your simulations are an adequate representation of your data. |
Thanks for your quick answer, here is the plot of the summary stat for the simulated data (divided into the 5 predClass) against the empirical data. |
this is nice @jdaron -- any chance you could plot this in PCA space to see a dimensionality reduced version? also i'm a bit concerned about the soft and linked soft simulations here-- how are those producing nDiplos=0? |
Hi @andrewkern here is the PCA plot of the summary stat plotted above. As you mentioned the mean value of the nDiplos stat for soft sweep is 0.08773807 [+/- 0.0004394302] and range from 0.009433962 to 0.117187500, while nDiplos stat for hard sweep range from 1 to 20. For the empirical data nDiplos range from 2 to 32. |
okay now i'm more confused! the nDiplos stat you are reporting above is then output from diploSHIC using fvecSim mode? the PCA plot above looks okay to me for hard, hard-linked, and neutral, but something strange happened with your soft sweep simulations and / or feature vector calcs I think. Any chance I could see the fvec files created? |
So I've re-perform some analysis and found the issue on the nDiplos stat. As you mentioned all the stat I am reporting are calculated by diplosSHIC fvecSim (makeFeatureVecsForSingleMsDiploid.py) for which I am avoiding the normalization by normalizeFeatureVec to get the real value. After conducting a more careful inspection of all the fvec present in the folder rawFVFiles I realized the issue in the nDiplos stat were introduced during the creation of the training set by diploSHIC makeTrainingSets. I corrected the issue and now all the stat look normal. Sorry for the troubles. Here is the boxplot of all the stat and the PCA reduction. Based on the PCA, I have the feeling that the neutral simulation do not fit well enough the right-up part of the empirical data, which may be causing my problem of over-presence of soft/linkedsoft prediction. |
hey this looks great @jdaron! glad we were able to find the bug together here. i'm curious what this means for the number of sweeps in the empirical data-- there are clearly a bunch now! |
Hello @andrewkern! I’ve decided to plot simulations against empirical data to understand whether they approximate my data well enough and my resulting PCA plot (attached) looks really bad. After that I’ve decided to look at boxplots (also attached) of all features to find features that ruin the picture. It turned out that H statistics in simulations and in real data are very different. I think that H-statistics in simulations look strange, while in real data they seem to be okay. Could you please help me to understand what could have gone wrong? I am working with haploid individuals. In order to do PCA I took files of central windows from outStatsDir for simulations and statFile for the empirical data. Sample size in simulations is (accidentally) 2 haplotypes smaller than in real data. Do I understand correctly that this should not have greatly affected the result? Thank you! |
Hi @MarySelifanova ! Sorry to be so slow to respond to this. I had missed this message. Looking at this it looks to me like H and Theta_w are both elevated in the 'real' data vs your training set. I agree the PCA plot looks way off too. I wouldn't think that the sample size in your sims would greatly affect things here. Generally it looks like you might have to increase mutation rate in your sims? |
I first estimated rho, gene conversion rate and track lengths using mlrho and following the method in Lynch et al (2014; Genetics) from my data consisted of 20 unphased diploid individuals (40 chromsomes). The demographic history was estimated with MSMC2, from 10 unphased individuals, with very recent and ancient histories trimmed since they are not accurately inferred.
The per base pi for our species is only around 0.001, which resulted in very few segregating sites in a 2kb subwindow, which was the size used in the Anopheles example. I thus used a 20kb subwindow size instead.
After simulating with discoal and training, the confusion matrix (attached) shows that false positive rate for calling neutral as non-neutral (mainly soft and soft-linked) is about 12%.
Applying the trained model on the real data, out of 29253 20kb subwindows, diploSHIC classified 8674 as soft, 20217 as linkedSoft, 46 as hard, 234 as linkedHard, and only 82 neutral.
I think these numbers don't look real and may suggest a mismatch between the simulated data and my real dataset.
I would appreciate any pointers.
The text was updated successfully, but these errors were encountered: