You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
It has come to my attention while debugging for Issue #5 that depending on the VCF files used as raw input, some META INFO tags in the files are shared in name but not in their definition. For example, in the 1000g dataset, the INFO tag AFR_AF (Allele frequency in AFR population), is defined as follows:
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC an
While the same tag in the SAHGP dataset is defined as follows:
##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">
In this case, they both define the AFR_AF tag as a float (Type=Float), however, 1000g defines it as containing a number per each alternate allele (Number=A) while SAHGP defines the same tag as containing only 1 entry (Number=1). In some cases where one dataset contains an alternate allele which is not present in another, such as 1000g containing an extra Allele and SAHGP not, this will cause bcftools merge command (ALL_COLLATE process) to fail, citing an INFO tag of different lengths which cannot be merged.
The text was updated successfully, but these errors were encountered:
It seems the biggest issue here is the lack of consistency of INFO tag usage between datasets from multiple sources. While neither tag definition is overtly incorrect in its usage, they ARE mutually exclusive in terms of merging in the above-mentioned scenario. This also further highlights that INFO tags are unreliable in merge applications. I think the best solution going forward is to remove all INFO tags for this reason.
We do not use INFO tags in the current pipeline so the information contained therein is redundant.
Their content can always be re-generated as needed so removing them will not incur data loss.
By using vcftools --recode flag, we can re-format the files and invalidate the INFO tags without having to explicitly exclude each and every tag by name in an exhaustive approach. Unfortunately, this does not remove the TAG definition which will cause some issues downstream.
This may be a good excuse to implement a full-scale standardization step before the LIFTOVER process. This would have to include:
Describe the bug
It has come to my attention while debugging for Issue #5 that depending on the VCF files used as raw input, some META INFO tags in the files are shared in name but not in their definition. For example, in the 1000g dataset, the INFO tag AFR_AF (Allele frequency in AFR population), is defined as follows:
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC an
While the same tag in the SAHGP dataset is defined as follows:
##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">
In this case, they both define the AFR_AF tag as a float (
Type=Float
), however, 1000g defines it as containing a number per each alternate allele (Number=A
) while SAHGP defines the same tag as containing only 1 entry (Number=1
). In some cases where one dataset contains an alternate allele which is not present in another, such as 1000g containing an extra Allele and SAHGP not, this will causebcftools merge
command (ALL_COLLATE
process) to fail, citing an INFO tag of different lengths which cannot be merged.The text was updated successfully, but these errors were encountered: