Skip to content
jandot edited this page Sep 14, 2010 · 4 revisions

vcf2tsv – algorithm

How does the vcf2tsv function work? In contrast to the perl implementation, this vcf2tsv conserves all INFO and FORMAT tags.

Basically, it first scans the input file to get a unique list of all the INFO and FORMAT tags that are present in it (let’s call these all-info-tags and all-format-tags). The sorted INFO tags will become part of the header. As for the format tags: they are interleaved with each sample name to become part of the header as well. Then to actually process the file, it goes through each line and:

  • creates the bit of the output line that concerns the INFO field
    • creates a map of the INFO field (e.g. “DP=17;GN=BRCA2;CN=INTRONIC” becomes {"DP" “17”, “GN” “BRCA2”, “CN” "INTRONIC})
    • goes through all-info-tags and gets the value from this map; an empty string if that tag is not present in the INFO string.
  • creates the bit of the output line that concerns the FORMAT and sample fields. For each individual:
    • creates a map by interleaving the split FORMAT field with the sample data
    • goes through all-format-tags and gets the value from this map; an empty string if that tag is not present in the sample data
Clone this wiki locally