Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to run Unfazed #1

Open
mkohailan opened this issue Sep 22, 2020 · 19 comments
Open

Trying to run Unfazed #1

mkohailan opened this issue Sep 22, 2020 · 19 comments

Comments

@mkohailan
Copy link

Hi,

I am new to programming and I am trying to run Unfazed to phase my DNMs, but could't make it

Here is my code:
unfazed -d My_DNM_File.vcf -s Main_VCF_File_With_All_Variants.vcf.gz -p pedigree.ped -b bamFiles/ -t 2 -o vcf --outfile test.vcf -r ref/hs37d5.fa

My_DNM_File.vcf contains the vcf file header and only the DNM variant I am testing (4 44986884)
Main_VCF_File_With_All_Variants.vcf.gz is a multi-sample sorted/bgzipped/indexed vcf file with all variants of the family (i.e. Not only DNMs)
pedigree.ped is a tab delimited file that shows the relationship of the trio I am testing (Format: Family_ID Sample_ID Father Mother Sex Phenotype)
bamFiles/ the directory that contains the three bam files of the trio (named as recommended by the tool: sample_ID.bam)

Output:
No usable informative sites for variant 4:44986883-44986884

However, I am using this DNM for the test because I am sure it can be phased as there is an informative site in the same read near to it:
CHROM POS Child Father Mother
4 44986884 0/1 0/0 0/0 <-- My DNM site
4 44986893 0/1 0/0 1/1 <-- Informative site

Appreciate your help

@jbelyeu
Copy link
Owner

jbelyeu commented Sep 22, 2020

That is confusing. I don't see any mistakes in your command and I don't think you'd see that particular message if bad inputs had caused an issue (although I could be wrong and will keep thinking about it). The most likely explanation is that the particular informative site you mentioned didn't pass quality control filters built into unfazed. The main QC requirements (for a heterozygous variant) are:

  • Read allele balance between 0.2 and 0.8
  • Minimum genotype quality of at least 20
  • Minimum read depth (both alleles, not counting any third allele) at the locus of at least 10

For homozygous calls the same filters apply except the read allele balance must be between 0.0-0.2 or 0.8-1.0 (depending on whether it's a hom-alt or a hom-ref.

Please let me know if this is helpful. I'd be happy to dig more into it with you, especially if the issue is a bug in my code I can fix.

@arq5x
Copy link

arq5x commented Sep 22, 2020

What variant caller was used to create the VCF? Does it have AB, DP and GQ attributes in the genotype columns?

@mkohailan
Copy link
Author

Thanks for your reply

I am using FreeBayes (default run) to call the variants.
After your comments, I ran FreeBayes with --genotype-qualities option to write the GQ in the genotype field of each sample. So, now I have DP and GQ written in the genotype column for each sample, and the AB in the INFO column.

I ran a test and it worked for some of my variants (including the one in the above example). So thanks for your help

For one of the variants (below), I can see that both the DNM site and the informative site can pass these filters, but still not phased. Are there any other criteria used?

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOTHER FATHER CHILD
9 88075348 . A G 174.68 . AB=0.40625;………. GT:AD:AO:DP:GQ:PL:QA:QR:RO 0/0:37,0:0:38:99:0,111,1311:0:1465:37 0/0:41,0:0:41:99:0,123,1303:0:1486:41 0/1:19,13:13:32:99:248,0,554:448:724:19
9 88075353 . A G 1525 . AB=0.462069;………. GT:AD:AO:DP:GQ:PL:QA:QR:RO 0/0:37,0:0:37:99:0,111,1252:0:1393:37 0/1:23,18:18:41:99:455,0,648:691:857:23 0/1:19,14:14:33:99:259,0,565:496:739:19

By the way, is there an option to play with the quality control filters set by the tool?

@jbelyeu
Copy link
Owner

jbelyeu commented Sep 30, 2020

A possible reason a variant like this one might not be phased could be if there are multiple informative sites in the region that give contradictory results, possibly because a recombination event or sequencing error occurred. If you use the --include-ambiguous option on the command line it will report these variants as well, so that might help.

Quality control filters are not currently exposed to the user, although this may change shortly. I'm also considering ways to improve the information given to the user about why certain variants aren't phased, although I want to avoid overwhelmingly large outputs as well.

@mkohailan
Copy link
Author

Great! looking for the new versions then

Thank you

@tycheleturner
Copy link

Hi,

Thanks for making this tool. I tried it out yesterday on some data and the output for all sites is:

No usable informative sites for variant “variant name.”

We know the parent-of-origin for many of the variants based on assessing the read data, so would expect this program to find at least some of them. Based on your preprint, it looks like the number of variants where parent-of-origin can be determined is higher than in previous papers/methods so that is why we hope to use it.

The command is as follows:
/path/to/unfazed -d childname_denovo.bed -s childname.trio.vcf.gz -p childname.ped --bam-pairs childname:/path/to/childname.cram dadname:/path/to/dadname.cram momname:/path/to/momname.cram -r /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa

The input of the bed file is as follows (only the top line shown for space reasons):
chr1 7326507 7326507 childname SNV

The input of the childname.trio.vcf is a full genome-wide vcf from DeepVariant (the FORMAT field is as follows: GT:DP:AD:GQ:PL:RNC). I have also tried a VCF from GATK and also tried adding the AB in to one of these VCFs. None of these changes worked. I have also tried only giving it the child cram and not the parents … still the same result.

The ped format is standard plink format

Would like to run this tool and any insight is appreciated.

Thanks,

Tychele

@tycheleturner
Copy link

I should note that I have also tried the bed file as follows:

chr1 7326506 7326507 childname SNV

@jbelyeu
Copy link
Owner

jbelyeu commented Feb 5, 2021

Hmm, that's disappointing. It's possible there's an instability I'm not aware of. Could you try running with -t 1? This option requires it to run with just a single thread but allows improved exception handling.

@tycheleturner
Copy link

tycheleturner commented Feb 5, 2021

Okay, added -t 1 to end and now it says

UNFAZED v1.0.0 Traceback (most recent call last): File "/opt/conda/bin/unfazed", line 10, in <module> sys.exit(main()) File "/opt/conda/lib/python3.6/site-packages/unfazed/__main__.py", line 237, in main unfazed(args) File "/opt/conda/lib/python3.6/site-packages/unfazed/unfazed.py", line 648, in unfazed args.split_error_margin, File "/opt/conda/lib/python3.6/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs split_error_margin, File "/opt/conda/lib/python3.6/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing whole_region=False, File "/opt/conda/lib/python3.6/site-packages/unfazed/informative_site_finder.py", line 240, in find dnms, pedigrees, vcf_name, search_dist, threads, build, whole_region File "/opt/conda/lib/python3.6/site-packages/unfazed/informative_site_finder.py", line 668, in find_many whole_region, TypeError: multithread_find_many() missing 1 required positional argument: 'build'

@tycheleturner
Copy link

tycheleturner commented Feb 5, 2021

Also, just added --build 38 as well to see if that solved the build question but it gives the same error. To clarify, the error is the same as the previous one when adding the -t 1.

@jbelyeu
Copy link
Owner

jbelyeu commented Feb 5, 2021

Yes, there is a bug in the code at 668. I have corrected it in the repo and will push a new release with the fix as soon as the tests finish. I will also work on the exception handling and improved tests.

I think the bug I fixed is probably not the same one you ran into before -t 1, unfortunately. We'll see, but I expect it's going to be a different issue.

@tycheleturner
Copy link

Okay, thanks

@jbelyeu
Copy link
Owner

jbelyeu commented Feb 5, 2021

I have created a new release v1.0.1 which should fix the known bug

@tycheleturner
Copy link

Okay, I tried installing with conda but it looks like the new version is not there yet?

Also, I took some time to look through the test files in the GitHub and decided to change my bed from saying SNV or INDEL to saying POINT. Then I ran it on 10 sites and it ran on those sites (with parent-of-origin) for four of them! I will do some more testing but it looks like POINT is the key and not to use SNV or INDEL for the bed. Also, this worked on the VCF did not contain AB if that is helpful.

Thanks,

Tychele

@jbelyeu
Copy link
Owner

jbelyeu commented Feb 5, 2021

Ah, a great catch. I'll work on improving the generality of that term as well, and clarifying the necessary VCF fields (allele balance is calculated from AD in Unfazed). I really appreciate your patience and persistence in working through these new-software issues.

@tycheleturner
Copy link

Hi,

The program is generally working well now (currently using v1.0.1). However, for a subset of the trios in my dataset (n = 48), I keep getting this error:

UNFAZED v1.0.1
Traceback (most recent call last):
  File "/opt/conda/bin/unfazed", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/unfazed/__main__.py", line 237, in main
    unfazed(args)
  File "/opt/conda/lib/python3.7/site-packages/unfazed/unfazed.py", line 648, in unfazed
    args.split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs
    split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing
    whole_region=False,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 240, in find
    dnms, pedigrees, vcf_name, search_dist, threads, build, whole_region
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 650, in find_many
    chrom_ranges[chrom],
KeyError: 'chrX'

When I remove the X chromosome for those trios it runs fine. The children are all males in these particular trios so am assuming the de novo would be primarily on the X chromosome from the mother (a possible exception being the pseudoautosomal regions). However, the whole program crashes because of the KeyError and I don't get the results for the rest of the genome without removing the X chromosome and rerunning. Also, I looked over the bed inputs and they look fine.

Tychele

@jbelyeu
Copy link
Owner

jbelyeu commented Feb 9, 2021

For male samples Unfazed is intended to only phase the PAR for chrX and automatically assign the rest of the chrX variants to maternal origin. Clearly that's not working and I'll figure it out. Thanks for pointing this out!

@tycheleturner
Copy link

Thank you! The tool is working quite well. I compared it with some code we have for phasing using the nearest variant (Whatshap phased vcf plus some custom scripts). Unfazed found more variants and when both approaches looked at the same site there was 100% concordance on the parent-of-origin (awesome!).

Also, wanted to mention one other error that I have seen so far and that was this one:

UNFAZED v1.0.1
Traceback (most recent call last):
  File "/opt/conda/bin/unfazed", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/unfazed/__main__.py", line 237, in main
    unfazed(args)
  File "/opt/conda/lib/python3.7/site-packages/unfazed/unfazed.py", line 648, in unfazed
    args.split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs
    split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing
    whole_region=False,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 303, in find
    dad_idx, ref_depths, alt_depths, genotypes, gt_quals
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 102, in is_high_quality_site
    if min_ab <= allele_bal <= max_ab:
UnboundLocalError: local variable 'min_ab' referenced before assignment

When I switched from the DeepVariant to GATK vcf for this trio everything went great so am not sure if it is a DeepVariant VCF issue (the VCF is valid) or something else.

@jbelyeu
Copy link
Owner

jbelyeu commented Feb 10, 2021

Very helpful error messages. I'll push a new version with fixes asap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants