Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: [__FUNCTION__] invalid allele frequencies in 1st iteration of ngsF-HMM #7

Open
wul88 opened this issue Jan 6, 2023 · 5 comments

Comments

@wul88
Copy link

wul88 commented Jan 6, 2023

Hello, fgvieira

I tried to run ngsF-HMM to find IBD of samples from different sub-populations. I used the Beagle GL files computed previously as below:

angsd -bam bam_All118.list -ref ${refSeq} -out ${outDir}/${outMaf} \
    -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -C 50 -baq 1 \
    -minMapQ 30 -minQ 30 -setMinDepth 106 -setMaxDepth 1954 -minmaf 0.05 -doCounts 1 \
    -GL 2 -doMaf 1 -skipTriallelic 1 -doGlf 2 -SNP_pval 1e-6 -doMajorMinor 1 -P 8

I have .beagle.gz and .mafs.gz files. I extracted first two columns from .mafs.gz file and removed the header, then saved as pos file. The nsites was computed by "cat $pos | wc -l " . The command to run ngsF-HMM is as follows:

ngsF-HMM --geno ${glf2} --lkl --pos ${pos} --n_ind $NSAMS --n_sites $NSITES --freq 0.1 --freq_est 2 \
    --indF inF_transitPar_118AveAuto.txt --out ${outDir}/${out} --n_threads 8 --seed ${thisSeed} \
    --min_epsilon 1e-9 --verbose 2 --log 1

WHERE
glf2 is the Beagle format glf from angsd run shown at the beginning, and inF_transitPar_118AveAuto.txt is a two-column file: 1st column is the inbreeding coefficient from running ngsF and the second column is 0.1 as transition parameter. one row per individual. ( I also used 0.1, 0.1 as shown in tutorial in different run, received the same error.)
AND
thisSeed=$RANDOM
glf2="all118_MafGlf2_allChrs.beagle.gz"
#maf="all118_MafGlf2_allChrs.mafs.gz"
pos="all118_MafGlf2_allChrs_noHeader.pos"
NSAMS=118
NSITES=$((cat ${pos} | wc -l))

outDir=ngsFHMM/"rep_"$thisRep
out="all118depthSet_inFHMM_"$thisRep
( thisRep is replicate number. I ran 10 replicates but every one had the same error.)

--- one of the .err files ---->

  • thisRep=2

  • thisSeed=18950

  • glf2=all118_MafGlf2_allChrs.beagle.gz

  • pos=all118_MafGlf2_allChrs_noHeader.pos

  • NSAMS=118

  • outDir=ngsFHMM/rep_2

  • NSITES=2205512

  • out=all118depthSet_inFHMM_2

  • ngsF-HMM --geno all118_MafGlf2_allChrs.beagle.gz --lkl --pos all118_MafGlf2_allChrs_noHeader.pos --n_ind 118 --n_sites 2205512 --freq 0.1 --freq_est 2 --indF inF_transitPar_118AveAuto.txt --out ngsFHMM/rep_2/all118depthSet_inFHMM_2 --n_threads 8 --seed 18950 --min_epsilon 1e-9 --verbose 2 --log 1

> Header found! Skipping line...

=====
ERROR: [__FUNCTION__] invalid allele frequencies
=====

	: Numerical argument out of domain

-----  one of the .out file ----->
==> Input Arguments:
	geno: all118_MafGlf2_allChrs.beagle.gz
	pos: all118_MafGlf2_allChrs_noHeader.pos
	lkl: true
	loglkl: false
	n_ind: 118
	n_sites: 2205512
	call_geno: false
	freq: 0.1
	freq_est: 2
	e_prob: 1
	indF: inF_transitPar_118AveAuto.txt
	indF_fixed: false
	alpha_fixed: false
	out: ngsFHMM/rep_2/all118depthSet_inFHMM_2
	log: 1
	log_bin: false
	min_iters: 10
	max_iters: 100
	min_epsilon: 0.0000000010
	n_threads: 8
	verbose: 2
	seed: 18950
	version: 1.1.0 (Jan  4 2023 @ 20:27:07)

==> GZIP input file (not BINARY)
==> Reading data
> Sites coordinates
> GENO data
==> Reading initial inbreeding values from file "inF_transitPar_118AveAuto.txt".
==> Setting initial frequency values to: 0.1
==> Calculating initial emission probabilities
==> Printing current iteration parameters

Iteration 1:
==> Forward Recursion
==> Backward Recursion
==> Marginal probabilities
==> Update inbreeding and transition parameter
==> Estimating allele frequencies and calculating emission probabilities

Could you please give me some advice as what I did wrong? Thank you very much in advance!

@fgvieira
Copy link
Owner

fgvieira commented Jan 6, 2023

Could you try running with --freq_est 1?

@wul88
Copy link
Author

wul88 commented Jan 10, 2023

Hi, fgvieira,

Thank you for the quick response. The option of "--freq_est 1" did run. It took over 3 days to finish. I am checking the inbreeding coefficient data now. Meanwhile I have some questions on these runs as welll as using ngsF-HMMplot.R.

  1. This time I only ran 3 replicates that used random seeds, but all three replicate runs generated the exact same results (3 output files each run. I used "diff" command to compare the output of different replicate run) Is this a suspicious behavior that indicates something not right?

Here is the command I used:

ngsF-HMM --geno ${glf2} --lkl --pos ${pos} --n_ind $NSAMS --n_sites $NSITES --freq 0.1 --freq_est 1 \
    --indF 0.1,0.1 --out ${outDir}/${out} --n_threads 8 --seed ${thisSeed} \
    --min_epsilon 1e-9 --verbose 2 --log 1

where:

thisSeed=$RANDOM
glf2="all118_MafGlf2_allChrs.beagle.gz"
pos="all118_MafGlf2_allChrs_noHeader.pos"
NSAMS=118
outDir=ngsFHMM/"rep_"$thisRep
thisRep=$rep     # ($rep -  number of replicate)
  1. I noticed that in the ngsF-HMM/examples/test.sh running ngsF-HMMplot.R as follows:
Rscript ../scripts/ngsF-HMMplot.R --in_file testF-HMM.$ID.$TYPE.ibd --n_ind $N_IND --n_sites $N_SITES --geno testF-HMM.SIM.geno.gz --path testF-HMM.SIM.path.gz --pos testF-HMM.SIM.pos.gz --marg_prob --out testF-HMM.$ID.$TYPE.pdf

I don't have a .path.gz file in my output. What is that file? How can I produce one? Thank you in advance for your help!

@fgvieira
Copy link
Owner

fgvieira commented Jun 2, 2023

Hi @wul88 and sorry for the late reply.

  1. the random seed only has an effect with random starting values
  2. the --path option just adds another track to the plots. I used it to show the "truth", but can also be used to plot some regions of interest. In your case, you can omit it.

@wul88
Copy link
Author

wul88 commented Aug 22, 2023

Hi, fgvieira,
Thank you for your help! The plot script worked. I have over 100 samples, loading pdf of 15 pages becomes time consuming.
Any suggestion on break-up result?

As running ngsF-HMM, I have following concerns. Please advise me.

  1. I have ran with --indF "0.1,0.1" or "r" twice. The results were the same. It seemed that initial values of indF does not have much effect on the final results. I checked the results from all replicate runs, they are the same. The random seeds also has no effect on the algorithm. Is this normal? Or I did something wrong? Here is the command I used:
nSite=$((`zcat $maf | tail -n+2 | wc -l`))
ngsF-HMM --geno ${glf2} --lkl --pos ${pos} --n_ind ${nInd} --n_sites ${nSite} --freq e --freq_est 1 \
    --indF r --out ${outDir}/${out} --n_threads 8 --seed ${thisSeed} \
    --min_epsilon 1e-9 --verbose 2 --min_iters 20 --max_iters 400
  1. I noticed that the inbreeding coefficients in the output .indF file, were quite different from those from ngsF.
    In addition, the transition parameters for individuals in .indF file were all "10". What does that parameter mean? Is "10" reasonable?

  2. with --log 1 did not produce .log file.

@fgvieira
Copy link
Owner

fgvieira commented Sep 4, 2023

  1. Can you try using --indF "0.1-0.1"
  2. . The estimates could be a bit different, but it can also be a convergence issue. Have you tried running ngsF-HMM with the estimates from ngsF as starting values? The transition parameter represents how often the state changes between IBD and notIBD; high numbers reflect short IBD fragments. What values of F are you getting from ngsF and ngsF-HMM?
  3. This option can be a bit misleading, but it means that it saves the output every X iterations, but it overwrites the file (it is mostly for debug purposes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants