All notable changes to this project will be documented in this file.
Suggested headings per release (as appropriate) are:
Added
for new features.Changed
for changes in existing functionality.Deprecated
for soon-to-be removed features.Removed
for now removed features.Fixed
for any bug fixes.Security
in case of vulnerabilities.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[6.0.0] - 2024-10-21
- MakePhenopackets stage, which generates a Phenopacket file from Metamist
- Renamed
GenerateSeqrFile
toMinimiseOutputForSeqr
- Renamed
vcf_to_mt
toVcfToMt
- The 'extended PED' file format is no longer used - instead we use a normal 6-col PED file, and a separate file with
phenotypic data and external IDs for each participant in a GA4GH-compliant Cohort/Phenopacket format
- An adapter remains in talos/CPG to convert the old format to the new format
- GeneratePED has been deleted
[5.0.0] - 2024-07-08
- Throwing all the source code in a
/src
folder - Principal scripts are renamed, and accessible as command-line entrypoints
- GeneratePED
- GeneratePanelData
- QueryPanelapp
- RunHailFiltering
- RunHailFilteringSV
- ValidateMOI
- CreateTalosHTML
- extensive removing of the pre-commit exclusions, and consequent reformatting to pass new tests
- documentation updated
[4.0.0] - 2024-06-09
- Name AIP -> Talos
- Pre-commit tooling changed again (linting with Ruff instead of Black)
- Removes use of QoB, and relies on 'local' Hail Query operations
- Drops Peddy in favour of Peds for a more minimal install
- Splits metamist querying into a separate script, with all other downstream operations using an extended PED format (6 cols + ext IDs + HPO IDs)
MOI testing
- Uses the altered pedigree representation for MOI
- Each variant is interpreted relative to the immediate Trio (i.e. a quad is interpreted as 2 separate trios, not as a single variant/variants solving all presentations in a family)
- An adapter to take a VEP-annotated VCF and restructure in Hail to match the VEP-from-JSON annotated version we natively expect from the seqr loader pipeline
- Everything to do with VEP... This does not contain a vep installation, and no longer determines how/if VEP is to be run. Except for ClinVar reanalysis - still in the process of dog-fooding that from ClinvArbitration instead of reimplementing here
- Removes even an example of analysis outcome registration from this repository, not even we use that functionality...
[3.2.2] - 2024-04-12
Huge re-working of the pre-commit tooling, resulting in reduced overall line count
[3.0.0] - 2023-12-08
This was a MASSIVE refactor affecting almost all parts of the codebase. Instead of characterising data in various places using type hinting, I've now introducted Pydantic models to formally represent the data used in each part of the application (see reanalysis/models.py). This has the benefit of making the code more readable, and also allows for the use of Pydantic's validation and parsing.
A key note is that all serialised/deserialised data is now done through a Pydantic validation layer, forcing the data to conform to the model. This may require some objects (e.g. state/history files) to be reformatted, as during this work I identified some inconsistencies in the data.
- Data models for every aspect of the analysis
- Pydantic validation layer for all serialised/deserialised data
- Specifically, all variants inherit from VariantCommon, and Structural and Small Variants each have a fundamentally different object representation, with a common interface but some internal methods specific to each variant type. This is extensible (e.g. further inheritance for STRs)
- All minimial-case representations of Variants or Reportable events used in unit tests are now represented as full Pydantic models for the corresponding data type
- So many little things, this was a +3000/-2000 line change
[2.1.0] - 2023-11-20
This change involved the introduction of SV data, and the co-processing of SV and Small Variant data in a unified analysis process. The SV data format expected is based on the GATK-SV pipeline and annotation.
This also includes the addition of the sv1
category, which represents LoF structural variants
- SV data is now processed alongside small variants
sv1
category specific to SV data
- Some of the AbstractVariant loading is modified to allow for SV data (e.g. defaults for depths)
[2.0.0 & 2.0.1] - 2023-10-26
This bump is less substantial than the version number suggests, but in hindsight the previous version should probably have been a major bump due to wholesale changes in how input is provided.
- CategoryBoolean6 (AlphaMissense) is now active, and should be implemented in a backwards compatible way (no failure if annotation is absent)
- VCF export includes
am_class
andam_pathogenicity
(defaulting to empty String) - If these fields are not missing, they should render in the report's variant drop-down drawer
- The Hail Labelling stage also takes a dataset argument (used to generate a dataset-specific path for the temporary files/checkpoints)
- Removed a second config file from the test framework (consolidate all fields into 1 file)
- MyPy and Ruff are implemented as replacement linting tools. Stricter, and a billion times faster
- Flake8 and PyLint are removed as linting tools
[1.2.0] - 2023-10-16
- a lot. Like... a lot
- All the Cohort-specific content in the configuration files has been removed from the main config file. For CPG usage we have some sensitivities around the open publication of the cohorts and details of the analysis carried out. As a precaution (although this config does not contain sample-level data) we have removed the cohort-specific content from this repository and moved it to a private repository.
- Structurally the cohort-specific region of the config file has changed. This is to better facilitate running of Talos
through our main analysis pipeline, requiring less manual intervention to run exomes and genomes separately. The
utils methods
get_cohort_config
andget_cohort_seq_type_conf
have been updated to reflect this.config.cohorts.COHORT
contains details general to a whole cohort - blacklisted genes, additional panels, solved casesconfig.cohorts.COHORT.GENOME
contains details specific to the genome analysis of a cohort - the Seqr project name, mapping between individual and seqr family IDs, and any labelled variants for this groupconfig.cohorts.COHORT.EXOME
is the counterpart to.GENOME
- Hail Labelling/Filtering now takes an output path, to be more easily included in a Hail Batch workflow
- The run preparation and HPO-panel script now use GQL queries, which drops a few methods
- The external_lookup file is no longer used - the same Internal-External ID mapping is noq stored alongside phenotype data in the HPO panel file.
--plink
argument to interpretation_runner is renamed to--pedigree
- Methods to reduce a cohort by a fixed percentage have been deleted - was useful as a potential test/train set, but has not been used recently
[1.1.5] - 2023-08-28
- reanalysis/interpretation_runner.py creates two reports; the first report is all standard content from the run, the second report is variants new within this latest run
- Relevant interfaces with the output file registration and report hunter scripts are altered
- The report hunter script is altered to create two separate index pages; one for the standard reports, one to only look for the latest-variant report
[1.1.4] - 2023-06-28
- Updated to use the latest models in Metamist/sample metadata DB
- Queries updated to use GraphQL
[1.1.3] - 2023-06-22
- Adds a helper script which exports a file formatted specifically for ingestion into Seqr
[1.1.2] - 2023-05-31
- Adds a number of changes to how ClinVar pathogenic variants are processed
- See this PR
[1.1.1] - 2023-05-08
- See this PR
- Adds the PM5 category, and the VEP content to create the relevant annotation data
[1.0.1] - 2023-04-01
- Added this changelog
- Allow for addition of non-Metamist metadata registrars, selectable in config
- Added Variant Tagging in report, adopted from MW@MSFT's fork - #271
- Add family ID into top level variant table
- Separated out metadata registration into a separate file & job
- File registration is set to
always_run
, if indicated in config - De Novo calling (hail method) can have
max_parent_ab
ratio overridden in config - Establishes a method-contract for altering metadata registrars and HTML scripts