Description
The bug
golang VCF libraries will not read out VCFs.
According to a strict reading of the last few version of the VCF spec (https://samtools.github.io/hts-specs/VCFv4.2.pdf, https://samtools.github.io/hts-specs/VCFv4.3.pdf) the INFO field has a prescribed format which we are departing from in a number of ways:
- Our
Source
element is not enclosed in double quotes. - Our
Version
element is not enclosed in double quotes. - We have added a non-standard element -
FileDate
To Reproduce
I am using VCF fbe3b136-dc8b-4c8d-bde3-a6390c91b521.vcf
from the COLO-829 analysis analysis_fbe3b136-dc8b-4c8d-bde3-a6390c91b521
for testing my code. The following two lines demonstrate the problems shown above - the first line has all 3 problems and the second line has problems 1 and 2:
##INFO=<ID=GERM,Number=2,Type=Integer,Description="Counts of donor occurs this mutation, total recorded donor number",Source=/mnt/lustre/reference/genomeinfo/qannotate/icgc_germline_qsnp_PUBLIC.vcf,FileDate=null>
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership",Source=/mnt/lustre/reference/dbsnp/141/00-All.vcf,Version=141>
If I cut out the first 1000 lines form this VCF and rectify all 3 problems, then the VCF will parse.
Expected behavior
The golang library appears to be applying the VCF spec strictly and not allowing for the addition of user-defined fields in INFO lines however the spec does not explicitly allow for user-defined fields in INFO lines so I think we should stop using them.
I'm guessing qannotate may be adding these but wherever it is, I'd like to have the quoting fixed. And for the FileDate, we could make Source a composite field that also contained the file date, for example:
##INFO=<ID=GERM,Number=2,Type=Integer,Description="Counts of donor occurs this mutation, total recorded donor number",Source=/mnt/lustre/reference/genomeinfo/qannotate/icgc_germline_qsnp_PUBLIC.vcf;2021-06-14">
If we go down the composite field path, I would suggest that we use semi-colon as the separator because comma is the default separator between subfields within an INFO field.