Skip to content
Mike Lin edited this page Sep 12, 2020 · 6 revisions

Most users should get started using one of the several configuration "presets" built-in for various gVCF variant callers, specified with glnexus_cli --config XXX. See glnexus_cli --help for the available list. GLnexus displays its effective configuration on the console log and in the output pVCF header.

The configuration presets are hard-coded in cli_utils.cc. The remainder of this page documents how to customize the configuration, if needed.

Example configuration YAML

The configuration can be customized by supplying a YAML file glnexus_cli --config YYY.yml, with contents like the following (reflecting the DeepVariantWES preset).

unifier_config:
    min_AQ1: 35
    min_AQ2: 20
    min_GQ: 20
    monoallelic_sites_for_lost_alleles: true
genotyper_config:
    required_dp: 0
    revise_genotypes: true
    liftover_fields:
        - orig_names: [MIN_DP, DP]
          name: DP
          description: '##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">'
          type: int
          combi_method: min
          number: basic
          count: 1
          ignore_non_variants: true
        - orig_names: [AD]
          name: AD
          description: '##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">'
          type: int
          number: alleles
          combi_method: min
          default_type: zero
          count: 0
        - orig_names: [GQ]
          name: GQ
          description: '##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">'
          type: int
          number: basic
          combi_method: min
          count: 1
          ignore_non_variants: true
        - orig_names: [PL]
          name: PL
          description: '##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype Likelihoods">'
          type: int
          number: genotype
          combi_method: missing
          count: 0
          ignore_non_variants: true

Unifier configuration

The allele unifier configuration controls allele sensitivity and representation in the output pVCF.

  • min_AQ1: threshold for allele inclusion.
    • The Allele Quality (AQ) for an allele A is defined in terms of the genotype likelihoods of a gVCF record as the likelihood ratio max(likelihoods of genotypes including allele A)/max(likelihoods of genotypes not including allele A), Phred-scaled.
    • GLnexus includes an allele in the output pVCF if at least one individual in the cohort shows AQ_i >= min_AQ1.
  • min_AQ2: also includes an allele in the output pVCF if at least two indivduals show AQ_i >= min_AQ2.
    • Thus we may have a lower quality threshold for alleles observed in multiple individuals, compared to singletons.
    • All else equal, increasing the min_AQ thresholds increases specificity and reduces sensitivity, and also speeds up the genotyper (by processing fewer weak sites).
  • min_allele_copy_number: only include alleles with at least this many (estimated) copies in the cohort.
  • drop_filtered (true/false): if true, exclude alleles lacking any observations which PASS all the defined VCF FILTERs; even if they pass other quality thresholds.
  • min_GQ: Genotype Quality (GQ) score threshold used in estimating cohort allele copy numbers.
    • Increasing this will bias allele frequency estimates downwards (and conversely decreasing it biases upwards).
    • This affects the output genotypes only insofar as allele frequency estimates factor into revising them. In particular, it is not a hard threshold on output GQ.
    • Suggest setting equal to min_AQ2.
  • monoallelic_sites_for_lost_alleles (true/false): if false, suppress generation of "monoallelic" sites to capture alleles that don't unify cleanly into non-overlapping multi-allelic sites. See Reading GLnexus pVCFs for explanation.
    • Recommend leaving this on, as the monoallelic sites may provide the only representation of certain alleles, and they're easily recognized by the FILTER field.
  • max_alleles_per_site: the maximum number of alleles to include in any one multiallelic site (counts the reference allele).
    • Alleles exceeding this threshold will be "kicked out" into monoallelic sites
  • preference ("common" or "small"): if set to "small", the unifier prefers to merge small alleles (editing the shortest portion of the reference) into multiallelic sites before longer ones, even if the latter are more common in the cohort. This controls the allelic representation and the proportion of alleles and genotypes involved in monoallelic sites.

Genotyper configuration

The genotyper configuration controls genotype revision and many details of calculating the output QC values.

  • revise_genotypes (true/false): enables frequency-based genotype revision
  • min_assumed_allele_frequency (float, default 0.0001): minimum assumed frequency of any allele to use in the revision calculations.
    • Ensures consistent sensitivity once the cohort is large enough to distinguish common and rare variants.
    • Increasing this tends to make the revision less aggressive.
  • required_dp: any called allele will be revised to non-called if supported by fewer than this many reads (per AD or analogous field)
  • allele_dp_format (default "AD"): the gVCF FORMAT field from which to source the allele-specific read depths.
    • Changing this usually requires special-case code to read some variant caller's unique way of recording this information.
  • ref_dp_format (default "MIN_DP"): the gVCF FORMAT field from which to source read depth in reference bands.
  • allow_partial_data (true/false): if true, present pVCF genotypes even if the gVCF records only partially cover the output pVCF site (these would be non-called by default)
  • squeeze (true/false): if true, suppress usually-unnecessary QC detail from output to reduce its size.
    • In entries indicating zero non-reference reads (AD=*,0), report only GT and DP, rounding DP down to a power of two; leave all other FORMAT fields missing.
    • Speeds up the genotyper since it does not spend time calculating these QC values.
    • Use alone or in a pipeline to spVCF.
  • more_PL (true/false): if true, include PL values from reference bands and other cases omitted by default; also populate uninformative PL entries with 0 or 990 instead of missing values.
    • This extra detail can be useful for downstream tools requiring 100.0% of PL values populated.
    • But it inflates and slows down the output for marginal gain of useful information.
  • liftover_fields (list): a list of YAML objects specifying each FORMAT QC field, its entry in the header and how to calculate it from the input gVCF fields.
    • TODO: detailed documentation here