Skip to content

populationgenomics/automated-interpretation-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Talos

Ruff test black

Purpose

A variant prioritisation tool, aiming to achieve maximal sensitivity for variants with diagnostic relevance, while maintaining a high specificity to reduce burden on curators.

Installation

At this time the recommended way to use this tool is via Docker, building using the Dockerfile in this repository.

docker build -t talos:5.1.3 .

Input Data

To run Talos you will need:

  1. A MatrixTable of variants, annotated with VEP

    • Optionally a VEP-annotated VCF, this path is still under evaluation
  2. ClinVar data as generated by ClinvArbitration, both the clinvar_decisions and clinvar_pm5 Hail Tables. This is available from the ClinvArbitration Release Page.

  3. A PED file, describing the pedigree of the participants in the study.

    1. This PED should consist of standard columns 1-6 (Pedigree Reference), with the Individual ID field matching the MT/VCF IDs
    2. The 7th Column (if present) must be an alternative ID for the individual, e.g. if the Data is processed with an anonymised ID (Col 2), but the final reports would benefit from a de-anonymised ID (Col 7). This can be a repetition of Col 2
    3. An arbitrary number of columns can follow from 8 onwards, each containing an individual HPO term. This can be used to provide granular phenotype data for refining the analysis results
  4. A TOML file containing the configuration for the analysis. example_config.toml is a good starting point, with comments explaining each modifiable parameter. Changes you may wish to make to tailor analysis to the cohort under test are using forced_panels to involve additional gene panels in the anslysis, removing or extending the require_pheno_match list, which would mask noisy variants from the base panel, and forbidden_genes to remove genes from the analysis completely.

Example PED structure:

Fam1	IndividualID	FatherID	MotherID	1	2	ExtIndividualID	HP:0002104	HP:0008872	HP:0011398

Usage

Talos consists of the following components:

  • GeneratePED - This is a CPG-specific implementation for generating a PED file
  • GeneratePanelData - Parses the PED file, and generates a per-parcitipant list of panels to be used for this. This also requires a local copy of the HPO ontology, downloadable from here. analysis, writing the result as a JSON
  • QueryPanelapp - Takes the output of GeneratePanelData, Queries PanelApp for the panels selected for the cohort, and writes the result as a JSON
  • RunHailFiltering - Takes the MatrixTable of Variants, the Pedigree file, the panel data from QueryPanelapp, and both ClinVar tables, filters the variants in the MatrixTable, and labels them with categories of interest. This is the most resource-intensive step of the pipeline, but even on 400+GB datasets it has been run successfully on a 8-core, 16GB RAM VM.
  • RunHailFilteringSV - Takes a MatrixTable of Structural Variants, the Pedigree file, the panel data from QueryPanelapp, filters the variants in the MatrixTable, and labels them with categories of interest.
  • ValidateMOI - Takes the result of RunHailFiltering, optionally one or more SV result from RunHailFilteringSV, the Pedigree, and panel data from QueryPanelapp. Checks each categorised variant to determine whether the MOI associated with the relevant gene fits within the family structure where it occurs. Generates a JSON file from all variants which pass the MOI tests.
  • CreateTalosHTML - Generates a report from the results of the ValidateMOI.
  • GenerateSeqrFile - Parses the result of ValidateMOI, generates a file for ingestion by Seqr.

example_usage.sh demonstrates a full execution of Talos.

Strategy

Talos analyses consist of two separate phases

  1. Filter and categorise variants, identifying which deserve further processing based on consequence annotations.
  2. Check each of those variants against the family structure of the participants in which it was found.

Variants only reach the final results both expected to be damaging, and the inheritance pattern shown fits the MOI associated with the gene it is found in.

Categories

The variant labelling stage of Talos implements a number of independent categories. Each category represents a decision tree, using variant annotations to decide if a category label should be assigned. Each of these categories has been designed to represent a block of curation logic - "if these criteria are all fulfilled, this could be relevant to diagnosis"

These categories are each independent, providing a framework for adjustment, configuration, or extension to include more variations in the future.

Category 1

CategoryBoolean1

If variant is rated as Pathogenic/Likely Pathogenic in ClinVar, minimum 1 'gold star' of associated evidence, we want to flag that variant for review.

This category is exceptional in the sense that Cat.1 variants are always processed under a partial-penetrance model - even if the variant isn't a strict fit with the family or phenotype, we would want to be alerted (e.g. to look for a second-hit by another method)

Category 2

CategoryBoolean2

A key reason for recovered diagnoses during reanalysis is the evolution of gene-disease understanding over time. This Category aims to identify these variants by carrying state from the previous analysis (see below) and flagging where a variant of at least moderate impact is newly associated with a disease. High Impact consequence here is based on the VEP definition of HIGH consequence.

Category 3

CategoryBoolean3

This Category leverages the work of LOFTEE, a tool for identifying variants likely to create loss of function with high confidence. When reviewing variants we require a High Impact consequence is present, combined with either LOFTEE or a Clinvar P/LP rating (any number of stars).

Category 4

CategoryBoolean4

Here we accept a milder consequence ( any HIGHconsequence + Missense), but only when present in the family with evidence of being de novo. This leverages the built-in de novo method in Hail, which is itself an implementation of Kaitlin Samocha's de novo caller. This implementation doesn't naively accept the trio genotypes, but also applies some probability modelling to the genotype likelihoods, and searches for alt-supporting reads at low levels in parents, before treating a de novo variant call as validated.

Category 5

CategoryBoolean5

A simple category, here we use SpliceAI to identify variants with a strong possibility of disrupting/adding splice junctions. This category will come under fire if the underlying SpliceAI tool becomes monetised.

Category 6

CategoryBoolean6

A simple cateogry, here we use AlphaMissense to identify missense variants predicted to have a strong effect on the folded protein. A variant passes this test if the AM-assigned 'class' is likely_pathogenic. We have ambitions to make a second category here with a higher threshold applied to the AM continuous score, instead of taking the ~0.56 threshold AlphaMissense natively uses to determine likely pathogenic.

Category PM5

A little bit of secret sauce here - a piece of work twinned with Talos involved developing ClinvArbitration - a re-summary of ClinVar data using altered heuristics to aggregate multiple submissions for a variant. After creating new ClinVar results, we annotated the pathogenic SNVs with VEP, and then re-index the results on Protein & Codon. In line with the ACMG evidence criteria PM5 (this variant is a missense, and another missense at this same codon has a Pathogenic rating in ClinVar), we use these re-indexed results at runtime to apply the PM5 category.

This category is applied in the form categorydetailsPM5=27037::1+27048::1

  • this is a + delimited list of entries (can be null)
  • each entry is ClinVar allele ID :: ClinVar Star rating

This is processed upon variant ingestion, and back filtered to remove any associated ClinVar IDs which are this exact variant.

Reanalysis

The heart of Talos's utility is in enabling explicit re-analysis, which is done in two key ways:

Gene Panel

  • each time we query for gene panel/ROI data from PanelApp, we record the results
  • if a gene features on a panel where it was previously absent, all variants in the gene will be eligible for Category2 for the current run (if the participant in question has the panel applied)
  • if a panel is applied in this analysis when it was previously not used, all variants in all genes on that panel are eligible for Category2 for the current run (if the participant in question has the panel applied)

As we only search for 'Green' genes in PanelApp (those with sufficient evidence of disease association), a gene being upgraded from Red or Amber-rated to Green will be picked up as a new gene.

Example:

  • In the previous analysis, panel number 42 (phenotype: Boneitis) was applied, containing geneX and geneY
  • In the current analysis, panel 42 newly features geneZ
  • PatientA was phenotype-matched to panel 42, and has a geneZ variant marked as Category2, so this can reach the report
  • PatientB was not phenotype-matched to panel 42, but has a geneZ variant marked as Category2. This category is stripped off, as panel 42 was not applied to this participant.

Variant Results

  • each time Talos runs, a minimised representation of the results is made; for each participant:
    • list the variants that were reported
    • the categories those variants were annotated with
    • and the date the category was first applied
    • the most recent date that the evidence changed (new category labels were applied)
  • before the current report is written to file, the latest history file (if one exists) is checked:
    • if this variant & category was seen before, the first_tagged date is taken from the history file
    • if the variant was Pathogenic in ClinVar, the number of evidence stars is recorded. In future we may want to highlight improved evidence in the report as a review priority
    • if a category is assigned for the first time, the evidence_last_updated date is set to today

Together, these allow us to create reports which are easily filtered for events which occurred for the first time in this latest run, removing all variants which were previously flagegd.

n.b. we do not currently hard-filter these results to remove previously-seen, we just enable that action by others

Example:

  • If a variant appears as only Cat.1, and was previously a Cat.1, it will have first_tagged and evidence_last_updated set to the date of first appearance
  • If a variant has been seen as a Cat.1 before, and now is both Cat.1 & Cat.2, first_tagged will be set to the date of first appearance, but evidence_last_updated will be set to today. The history file will be updated to show that Cat.2 was applied today
  • If a variant was never seen before and is now a Cat.1, first_tagged and evidence_last_updated will be set to today, and the history file will be updated to show that this variant was seen as a Cat.1, today