Import variants from multi-sample VCF (bgzipped) #330

logust79 · 2021-04-08T09:07:50Z

The bgzipped VCF files can be found at /SAN/vyplab/UCLex/mainset_May2019/VQSR/both/mainset_May2019_chr*_filtered.vcf.gz
There are around 5500 samples. Sample ids are internal ones (many will overlap with what we already have in the database, in which case they can be ignored).
The format is like this (omit the header starting with ##)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  09G06954_S4_L001_R1_001.fastq.gz        10G00594_S5_L001_R1_001.fastq.gz        10G06637_S2_L001_R1_001.fastq.gz
1       13302   rs180734498     C       T       579047.33       PASS    AC=1844;AF=0.277;AN=6646;BaseQRankSum=0.480;ClippingRankSum=0.055;DB;DP=284259;ExcessHet=0.0004;FS=5.505;InbreedingCoeff=0.0555;MLEAC=1845;MLEAF=0.278;MQ=20.14;MQ0=0;MQRankSum=-3.805e+00;QD=6.17;ReadPosRankSum=-1.485e+00;SOR=1.127;VQSLOD=0.496;culprit=QD  GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0       ./.:0,0:0       ./.:0,0:0
1       13327   rs144762171     G       A,C     6985.43 PASS    AC=24,10;AF=3.209e-03,1.337e-03;AN=7502;BaseQRankSum=-7.270e-01;ClippingRankSum=0.620;DB;DP=255285;ExcessHet=0.0000;FS=4.881;InbreedingCoeff=0.1615;MLEAC=14,6;MLEAF=1.866e-03,7.998e-04;MQ=2.52;MQ0=0;MQRankSum=-7.360e-01;QD=14.64;ReadPosRankSum=0.743;SOR=4.424;VQSLOD=2.57;culprit=InbreedingCoeff GT:AD:DP:GQ:PGT:PID:PL  0/0:24,0,0:24     1/2:1,12,12:25     2/2:1,0,20:21
...

First column: chromosome
2: Position
3: ID (ignore)
4: REF
5: ALT
6: quality (ignore)
7: Filter
8: INFO (ignore)
9: Format (this should be the header for sample genotypes on the same row, delimited by :. However the genotypes don't seem to always follow what the header depicts)
10-: sample genotypes

For each row, it might host multiple variants, in which case we call it multi-allelic site. You can parse the row with something like this:

from collections import Counter
genotype_dict = {
  0: null,
  1: 'het',
  2: 'hom'
}
... parsing a row
for alt_index, alt in enumerate(row['ALT']):
  if '*' in alt:
    # pass on star
    continue
  variant_id = '-'.join([row['#CHROM'], row['POS'], row['REF'], alt])
  for sample in samples:
    genotype = Counter(row[sample].split(':')[0])
    if '.' in genotype:
        genotypes[sample][variant_id] = 'missing'
    else:
        genotypes[sample][variant_id] = genotype_dict[genotype[str(alt_index+1)]]

The text was updated successfully, but these errors were encountered:

pontikos · 2021-05-18T22:27:18Z

@dvarrazzo has partially finished but we need to make we corretly import the quality information also.

logust79 assigned dvarrazzo Apr 8, 2021

pontikos added this to the Migration to new db schema milestone May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import variants from multi-sample VCF (bgzipped) #330

Import variants from multi-sample VCF (bgzipped) #330

logust79 commented Apr 8, 2021

pontikos commented May 18, 2021

Import variants from multi-sample VCF (bgzipped) #330

Import variants from multi-sample VCF (bgzipped) #330

Comments

logust79 commented Apr 8, 2021

pontikos commented May 18, 2021