Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fvecVCF not creating fvec windows for the whole chromosome #60

Open
rileycorcoran opened this issue Jan 7, 2025 · 4 comments
Open

fvecVCF not creating fvec windows for the whole chromosome #60

rileycorcoran opened this issue Jan 7, 2025 · 4 comments

Comments

@rileycorcoran
Copy link

Hello, I've been trying to get diploS/HIC working with my own data for a while, and while I've fixed many small errors I can't figure out what could be going wrong here. I'm running fvecVCF with a whole genome vcf file and specifying a single chromosome for analysis by name and length. It does run successfully (I think) as the error file seems to analyze the entire chromosome in 5,000bp windows, but the .fvec file (and then also the corresponding .preds from predict) produced only contains seven 5,000bp windows.

I'm running this on a HPC cluster and using this NCBI reference genome (with the edit of removing excess header information to replicate the reference in the Anopheles example)

Current fvecVCF code (sorry for the variables, I decided not to replace them since their names didn't seem useful for debugging):

diploSHIC fvecVcf diploid \
${in_folder}${chr_file}.recode.vcf.gz NC_055975.1 32206363 \
${out_file} --targetPop ${lake_cap} --sampleToPopFileName ${popfile_folder}${lake}_diploshic.txt --winSize 55000 \
--maskFileName ${reference}.changed_header --unmaskedFracCutoff 0 

example from the .err file:

10060001-10065000 num unmasked snps: 20; unmasked frac: 0.989400
10065001-10070000 num unmasked snps: 6; unmasked frac: 0.988600
10070001-10075000 num unmasked snps: 4; unmasked frac: 0.992200
10075001-10080000 num unmasked snps: 2; unmasked frac: 0.989800
10080001-10085000 num unmasked snps: 7; unmasked frac: 0.991600
10085001-10090000 num unmasked snps: 6; unmasked frac: 0.989400
10090001-10095000 num unmasked snps: 6; unmasked frac: 0.988000
10095001-10100000 num unmasked snps: 1; unmasked frac: 0.992600
10100001-10105000 num unmasked snps: 4; unmasked frac: 0.992000
10105001-10110000 num unmasked snps: 3; unmasked frac: 0.986000

The entire corresponding .preds file:

chrom   classifiedWinStart      classifiedWinEnd        bigWinRange     predClass       prob(neutral)   prob(likedSoft) prob(linkedHard)        prob(soft)      prob(hard)
NC_055975.1     10075001        10080000        10050001-10105000       neutral 0.889942        0.052077        0.018843        0.027607        0.011531
NC_055975.1     10080001        10085000        10055001-10110000       neutral 0.972910        0.013924        0.004727        0.006324        0.002114
NC_055975.1     10085001        10090000        10060001-10115000       neutral 0.961374        0.019087        0.007400        0.008936        0.003203
NC_055975.1     10090001        10095000        10065001-10120000       neutral 0.909300        0.039067        0.018986        0.022648        0.010000
NC_055975.1     10095001        10100000        10070001-10125000       neutral 0.959159        0.019877        0.007435        0.009907        0.003621
NC_055975.1     10100001        10105000        10075001-10130000       neutral 0.978151        0.010995        0.003846        0.005324        0.001684
NC_055975.1     10105001        10110000        10080001-10135000       neutral 0.966017        0.017401        0.006379        0.007378        0.002825

Am I trying to run this with too large of a input (i.e. whole chromosome rather than 1mil bp segment)? Am I running out of memory? Is my reference not providing enough unmasked SNPs? Is this caused by an error earlier in the pipeline (i.e. fvecSim or training)?

I can provide any other code or data if necessary, and any thoughts or help on this would be greatly appreciated. Thanks!

@andrewkern
Copy link
Member

Hi @rileycorcoran -- my initial guess is that you are running this using a window size that is so small that there are too few SNPs to calculate stats in most windows. What happens if you increase the window size by 5x?

@rileycorcoran
Copy link
Author

Thank you @andrewkern for the quick reply and the help! Increasing to --winSize 275000 from 55000 did give me many more 25kb rather than 5kb windows in the resulting .fvec and .preds file. I have 3 follow-up questions from this:

  1. I've used 50kb as my window size in other analyses. Is that a reasonable number to use here (assuming I would use --winSize 550000 in order to analyze 50kb sub-windows)? Or is that unnecessarily large?
  2. Should I increase the --totalPhysLen parameter from fvecSim to match whatever I set --winSize to be, or does that size not matter?
  3. Without me going into detail about my filtering parameters first, is there anything you would recommend I do in order to retain more SNPs for diploS/HIC to successfully run with a smaller window size? (I understand if this is too vague of a question without more details about my procedure, but I figured I'd ask regardless).

Thank you again!

@andrewkern
Copy link
Member

Thank you @andrewkern for the quick reply and the help! Increasing to --winSize 275000 from 55000 did give me many more 25kb rather than 5kb windows in the resulting .fvec and .preds file. I have 3 follow-up questions from this:

  1. I've used 50kb as my window size in other analyses. Is that a reasonable number to use here (assuming I would use --winSize 550000 in order to analyze 50kb sub-windows)? Or is that unnecessarily large?

550kb doesn't sound too large to me, but this mostly depends on the recombination rate in your organism. For mosquitoes this was the approximate size we've used in the past. For humans even bigger.

  1. Should I increase the --totalPhysLen parameter from fvecSim to match whatever I set --winSize to be, or does that size not matter?

totalPhysLen should be the total size of your simulated chromsome. Generally that's quite a bit bigger than winSize but it depends on how you set up the simulation.

  1. Without me going into detail about my filtering parameters first, is there anything you would recommend I do in order to retain more SNPs for diploS/HIC to successfully run with a smaller window size? (I understand if this is too vague of a question without more details about my procedure, but I figured I'd ask regardless).

for filtering, i'd recommend not using a MAF filter if sequencing depth is adequate (say 10-20x?). that will retain low frequency variants which will be informative. As for other QC filters it's a bit hard to say without more detail

@rileycorcoran
Copy link
Author

Thank you for the clarification! It's good to know that something around the size of 550kb has been used in other systems. I don't know the recombination rate for my system (non-model), but it seems like 550kb is a fairly versatile size.

After testing out some sizes, it seems like my HPC doesn't have enough memory to go a lot larger on totalPhysLen, but I'll see how much larger than winSize I can comfortably get that parameter. Regardless of totalPhysLen, should I aim to make the simulated .fvec window size match those of my data's .fvec (i.e. if my data is analyzed in 50kb windows, should the simulated data be analyzed in 50kb windows)? Or does it not matter if the simulated data matches my .vcf data in that way? Apologies if this question doesn't really make sense, I haven't created or worked with simulated data before so I'm still trying to wrap my head around the process.

My sequencing depth was ~10x, so it's good to know that I don't need to use a MAF filter. The rest of my QC filters were relatively unrestrictive, so I'm hoping that allowed me to retain enough SNPs for good analyses.

Thank you again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants