A variant prioritisation tool, aiming to achieve maximal sensitivity for variants with diagnostic relevance, while maintaining a high specificity to reduce burden on curators.
At this time the recommended way to use this tool is via Docker, building using the Dockerfile in this repository.
docker build -t talos:5.1.3 .
To run Talos you will need:
-
A MatrixTable of variants, annotated with VEP
- Optionally a VEP-annotated VCF, this path is still under evaluation
-
ClinVar data as generated by ClinvArbitration, both the
clinvar_decisions
andclinvar_pm5
Hail Tables. This is available from the ClinvArbitration Release Page. -
A PED file, describing the pedigree of the participants in the study.
- This PED should consist of standard columns
1-6 (Pedigree Reference),
with the
Individual ID
field matching the MT/VCF IDs - The 7th Column (if present) must be an alternative ID for the individual, e.g. if the Data is processed with an anonymised ID (Col 2), but the final reports would benefit from a de-anonymised ID (Col 7). This can be a repetition of Col 2
- An arbitrary number of columns can follow from 8 onwards, each containing an individual HPO term. This can be used to provide granular phenotype data for refining the analysis results
- This PED should consist of standard columns
1-6 (Pedigree Reference),
with the
-
A TOML file containing the configuration for the analysis. example_config.toml is a good starting point, with comments explaining each modifiable parameter. Changes you may wish to make to tailor analysis to the cohort under test are using
forced_panels
to involve additional gene panels in the anslysis, removing or extending therequire_pheno_match
list, which would mask noisy variants from the base panel, andforbidden_genes
to remove genes from the analysis completely.
Example PED structure:
Fam1 IndividualID FatherID MotherID 1 2 ExtIndividualID HP:0002104 HP:0008872 HP:0011398
Talos consists of the following components:
GeneratePED
- This is a CPG-specific implementation for generating a PED fileGeneratePanelData
- Parses the PED file, and generates a per-parcitipant list of panels to be used for this. This also requires a local copy of the HPO ontology, downloadable from here. analysis, writing the result as a JSONQueryPanelapp
- Takes the output ofGeneratePanelData
, Queries PanelApp for the panels selected for the cohort, and writes the result as a JSONRunHailFiltering
- Takes the MatrixTable of Variants, the Pedigree file, the panel data fromQueryPanelapp
, and both ClinVar tables, filters the variants in the MatrixTable, and labels them with categories of interest. This is the most resource-intensive step of the pipeline, but even on 400+GB datasets it has been run successfully on a 8-core, 16GB RAM VM.RunHailFilteringSV
- Takes a MatrixTable of Structural Variants, the Pedigree file, the panel data fromQueryPanelapp
, filters the variants in the MatrixTable, and labels them with categories of interest.ValidateMOI
- Takes the result ofRunHailFiltering
, optionally one or more SV result fromRunHailFilteringSV
, the Pedigree, and panel data fromQueryPanelapp
. Checks each categorised variant to determine whether the MOI associated with the relevant gene fits within the family structure where it occurs. Generates a JSON file from all variants which pass the MOI tests.CreateTalosHTML
- Generates a report from the results of theValidateMOI
.GenerateSeqrFile
- Parses the result ofValidateMOI
, generates a file for ingestion by Seqr.
example_usage.sh demonstrates a full execution of Talos.
Talos analyses consist of two separate phases
- Filter and categorise variants, identifying which deserve further processing based on consequence annotations.
- Check each of those variants against the family structure of the participants in which it was found.
Variants only reach the final results both expected to be damaging, and the inheritance pattern shown fits the MOI associated with the gene it is found in.
The variant labelling stage of Talos implements a number of independent categories. Each category represents a decision tree, using variant annotations to decide if a category label should be assigned. Each of these categories has been designed to represent a block of curation logic - "if these criteria are all fulfilled, this could be relevant to diagnosis"
These categories are each independent, providing a framework for adjustment, configuration, or extension to include more variations in the future.
If variant is rated as Pathogenic/Likely Pathogenic in ClinVar, minimum 1 'gold star' of associated evidence, we want to flag that variant for review.
This category is exceptional in the sense that Cat.1
variants are always processed under a partial-penetrance model -
even if the variant isn't a strict fit with the family or phenotype, we would want to be alerted (e.g. to look for a
second-hit by another method)
A key reason for recovered diagnoses during reanalysis is the evolution of gene-disease understanding over time. This
Category aims to identify these variants by carrying state from the previous analysis (see below) and flagging where a
variant of at least moderate impact is newly associated with a disease. High Impact
consequence here is based on the VEP definition of
HIGH
consequence.
This Category leverages the work of LOFTEE, a tool for identifying variants likely
to create loss of function with high confidence. When reviewing variants we require a High Impact
consequence is
present, combined with either LOFTEE or a Clinvar P/LP rating (any number of stars).
Here we accept a milder consequence (
any HIGH
consequence + Missense
), but
only when present in the family with evidence of being de novo. This leverages the built-in de
novo method in Hail, which is itself an implementation
of Kaitlin Samocha's de novo caller. This implementation doesn't
naively accept the trio genotypes, but also applies some probability modelling to the genotype likelihoods, and searches
for alt-supporting reads at low levels in parents, before treating a de novo variant call as validated.
A simple category, here we use SpliceAI to identify variants with a strong possibility of disrupting/adding splice junctions. This category will come under fire if the underlying SpliceAI tool becomes monetised.
A simple cateogry, here we use AlphaMissense to identify
missense variants predicted to have a strong effect on the folded protein. A variant passes this test if the AM-assigned
'class' is likely_pathogenic
. We have ambitions to make a second category here with a higher threshold applied to the
AM continuous score, instead of taking the ~0.56 threshold AlphaMissense natively uses to determine likely pathogenic.
A little bit of secret sauce here - a piece of work twinned with Talos involved developing
ClinvArbitration - a re-summary of ClinVar data using altered
heuristics to aggregate multiple submissions for a variant. After creating new ClinVar results, we annotated the
pathogenic SNVs with VEP, and then re-index the results on Protein & Codon
. In line with the ACMG evidence criteria
PM5 (this variant is a missense, and another missense at this same codon has a Pathogenic rating in ClinVar), we use
these
re-indexed results at runtime to apply the PM5 category.
This category is applied in the form categorydetailsPM5=27037::1+27048::1
- this is a
+
delimited list of entries (can be null) - each entry is
ClinVar allele ID
::ClinVar Star rating
This is processed upon variant ingestion, and back filtered to remove any associated ClinVar IDs which are this exact variant.
The heart of Talos's utility is in enabling explicit re-analysis, which is done in two key ways:
- each time we query for gene panel/ROI data from PanelApp, we record the results
- if a gene features on a panel where it was previously absent, all variants in the gene will be eligible
for
Category2
for the current run (if the participant in question has the panel applied) - if a panel is applied in this analysis when it was previously not used, all variants in all genes on that panel are
eligible for
Category2
for the current run (if the participant in question has the panel applied)
As we only search for 'Green' genes in PanelApp (those with sufficient evidence of disease association), a gene being upgraded from Red or Amber-rated to Green will be picked up as a new gene.
Example:
- In the previous analysis, panel number 42 (phenotype: Boneitis) was applied, containing geneX and geneY
- In the current analysis, panel 42 newly features geneZ
- PatientA was phenotype-matched to panel 42, and has a geneZ variant marked as Category2, so this can reach the report
- PatientB was not phenotype-matched to panel 42, but has a geneZ variant marked as Category2. This category is stripped off, as panel 42 was not applied to this participant.
- each time Talos runs, a minimised representation of the results is made; for each participant:
- list the variants that were reported
- the categories those variants were annotated with
- and the date the category was first applied
- the most recent date that the evidence changed (new category labels were applied)
- before the current report is written to file, the latest history file (if one exists) is checked:
- if this variant & category was seen before, the
first_tagged
date is taken from the history file - if the variant was Pathogenic in ClinVar, the number of evidence stars is recorded. In future we may want to highlight improved evidence in the report as a review priority
- if a category is assigned for the first time, the
evidence_last_updated
date is set totoday
- if this variant & category was seen before, the
Together, these allow us to create reports which are easily filtered for events which occurred for the first time in this latest run, removing all variants which were previously flagegd.
n.b. we do not currently hard-filter these results to remove previously-seen, we just enable that action by others
Example:
- If a variant appears as only
Cat.1
, and was previously aCat.1
, it will havefirst_tagged
andevidence_last_updated
set to the date of first appearance - If a variant has been seen as a
Cat.1
before, and now is bothCat.1
&Cat.2
,first_tagged
will be set to the date of first appearance, butevidence_last_updated
will be set totoday
. The history file will be updated to show thatCat.2
was appliedtoday
- If a variant was never seen before and is now a
Cat.1
,first_tagged
andevidence_last_updated
will be set to today, and the history file will be updated to show that this variant was seen as aCat.1
,today