Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] | Clashing META header definitions can cause ALL_COLLATE crash #6

Open
G-kodes opened this issue Jan 9, 2021 · 2 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@G-kodes
Copy link
Member

G-kodes commented Jan 9, 2021

Describe the bug
It has come to my attention while debugging for Issue #5 that depending on the VCF files used as raw input, some META INFO tags in the files are shared in name but not in their definition. For example, in the 1000g dataset, the INFO tag AFR_AF (Allele frequency in AFR population), is defined as follows:

##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC an

While the same tag in the SAHGP dataset is defined as follows:

##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">

In this case, they both define the AFR_AF tag as a float (Type=Float), however, 1000g defines it as containing a number per each alternate allele (Number=A) while SAHGP defines the same tag as containing only 1 entry (Number=1). In some cases where one dataset contains an alternate allele which is not present in another, such as 1000g containing an extra Allele and SAHGP not, this will cause bcftools merge command (ALL_COLLATE process) to fail, citing an INFO tag of different lengths which cannot be merged.

@G-kodes G-kodes added the bug Something isn't working label Jan 9, 2021
@G-kodes G-kodes self-assigned this Jan 9, 2021
@G-kodes
Copy link
Member Author

G-kodes commented Jan 9, 2021

It seems the biggest issue here is the lack of consistency of INFO tag usage between datasets from multiple sources. While neither tag definition is overtly incorrect in its usage, they ARE mutually exclusive in terms of merging in the above-mentioned scenario. This also further highlights that INFO tags are unreliable in merge applications. I think the best solution going forward is to remove all INFO tags for this reason.

  • We do not use INFO tags in the current pipeline so the information contained therein is redundant.
  • Their content can always be re-generated as needed so removing them will not incur data loss.

@G-kodes
Copy link
Member Author

G-kodes commented Jan 9, 2021

By using vcftools --recode flag, we can re-format the files and invalidate the INFO tags without having to explicitly exclude each and every tag by name in an exhaustive approach. Unfortunately, this does not remove the TAG definition which will cause some issues downstream.

This may be a good excuse to implement a full-scale standardization step before the LIFTOVER process. This would have to include:

  • Stripping all INFO tags. (bcftools annotate can do this)
  • Filter variants to remove complex variants we cannot yet analyze. (gatk SelectVariants can do this. Is currently located in LIFTOVER rule)
  • Repair/Validate the VCF Header for downstream use. (picard FixVcfHeader can do this. Also currently in LIFTOVER process.)

This approach would also benefit greatly in terms of reducing code bloat downstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant