Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import variants from multi-sample VCF (bgzipped) #330

Open
logust79 opened this issue Apr 8, 2021 · 1 comment
Open

Import variants from multi-sample VCF (bgzipped) #330

logust79 opened this issue Apr 8, 2021 · 1 comment
Assignees

Comments

@logust79
Copy link
Contributor

logust79 commented Apr 8, 2021

The bgzipped VCF files can be found at /SAN/vyplab/UCLex/mainset_May2019/VQSR/both/mainset_May2019_chr*_filtered.vcf.gz
There are around 5500 samples. Sample ids are internal ones (many will overlap with what we already have in the database, in which case they can be ignored).
The format is like this (omit the header starting with ##)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  09G06954_S4_L001_R1_001.fastq.gz        10G00594_S5_L001_R1_001.fastq.gz        10G06637_S2_L001_R1_001.fastq.gz
1       13302   rs180734498     C       T       579047.33       PASS    AC=1844;AF=0.277;AN=6646;BaseQRankSum=0.480;ClippingRankSum=0.055;DB;DP=284259;ExcessHet=0.0004;FS=5.505;InbreedingCoeff=0.0555;MLEAC=1845;MLEAF=0.278;MQ=20.14;MQ0=0;MQRankSum=-3.805e+00;QD=6.17;ReadPosRankSum=-1.485e+00;SOR=1.127;VQSLOD=0.496;culprit=QD  GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0       ./.:0,0:0       ./.:0,0:0
1       13327   rs144762171     G       A,C     6985.43 PASS    AC=24,10;AF=3.209e-03,1.337e-03;AN=7502;BaseQRankSum=-7.270e-01;ClippingRankSum=0.620;DB;DP=255285;ExcessHet=0.0000;FS=4.881;InbreedingCoeff=0.1615;MLEAC=14,6;MLEAF=1.866e-03,7.998e-04;MQ=2.52;MQ0=0;MQRankSum=-7.360e-01;QD=14.64;ReadPosRankSum=0.743;SOR=4.424;VQSLOD=2.57;culprit=InbreedingCoeff GT:AD:DP:GQ:PGT:PID:PL  0/0:24,0,0:24     1/2:1,12,12:25     2/2:1,0,20:21
...

First column: chromosome
2: Position
3: ID (ignore)
4: REF
5: ALT
6: quality (ignore)
7: Filter
8: INFO (ignore)
9: Format (this should be the header for sample genotypes on the same row, delimited by :. However the genotypes don't seem to always follow what the header depicts)
10-: sample genotypes

For each row, it might host multiple variants, in which case we call it multi-allelic site. You can parse the row with something like this:

from collections import Counter
genotype_dict = {
  0: null,
  1: 'het',
  2: 'hom'
}
... parsing a row
for alt_index, alt in enumerate(row['ALT']):
  if '*' in alt:
    # pass on star
    continue
  variant_id = '-'.join([row['#CHROM'], row['POS'], row['REF'], alt])
  for sample in samples:
    genotype = Counter(row[sample].split(':')[0])
    if '.' in genotype:
        genotypes[sample][variant_id] = 'missing'
    else:
        genotypes[sample][variant_id] = genotype_dict[genotype[str(alt_index+1)]]
@pontikos
Copy link
Contributor

@dvarrazzo has partially finished but we need to make we corretly import the quality information also.

@pontikos pontikos added this to the Migration to new db schema milestone May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants