Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collecting recently failed variants as a list. please add #545

Open
Peter-J-Freeman opened this issue Sep 13, 2023 · 139 comments
Open

Collecting recently failed variants as a list. please add #545

Peter-J-Freeman opened this issue Sep 13, 2023 · 139 comments

Comments

@Peter-J-Freeman
Copy link
Collaborator

chr5:112839840_112839842delGGCinsTGA b38

@Peter-J-Freeman
Copy link
Collaborator Author

Peter-J-Freeman commented Sep 13, 2023

19-40397933-ATCT-A b38
11-118505219-TTC-T b38
'5-177248182-G-A b38
chr10:g.102360218C>G b38

All seem to be the same error

Traceback (most recent call last):
File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/VariantValidator/modules/vvMixinCore.py", line 739, in validate
toskip = self._get_transcript_info(my_variant)
File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/VariantValidator/modules/vvMixinCore.py", line 1611, in _get_transcript_info
variant.gene_symbol = entry['hgnc_symbol']

NOTE: These are now fixed

@leicray
Copy link
Contributor

leicray commented Oct 9, 2023

The attached text file contains a long list of variants that have triggered ERROR messages from the interactive validation tool since the start of September this year.

Some of these might now be handled correctly since the recent patches.

variants that trigger error messages.txt

GRCh37 variants fixed

@leicray
Copy link
Contributor

leicray commented Oct 9, 2023

It looks like a user is trying to validate NM_024496.4:c.369_374del which does validate correctly in the interactive tool.

However, the error message says:

Internal Server Error: /bed/

TypeError at /bed/
create_bed_file() missing 5 required positional arguments: 'chromosome', 'build', 'genomic', 'vcf', and 'version'

That looks like the vcf2hgvs tool is being used. However, that would require the user to place the variant in a text file and then upload that file to the vcf2hgvs tool. Possible, but unlikely.

@Peter-J-Freeman
Copy link
Collaborator Author

Variant: 1-156138613-C-T

Hello, I'm having a problem validating the synonymous variant in LMNA (ClinVar ID 14500) - NM_170707.4(LMNA):c.1824C>T p.(Gly608=). I tried different ways, including chr1(GRCh38):g.156138613C>T and 1-156138613-C-T. Message error: Unable to validate the submitted variant against the GRCh38 assembly Thank you in advance.

@Peter-J-Freeman
Copy link
Collaborator Author

It looks like a user is trying to validate NM_024496.4:c.369_374del which does validate correctly in the interactive tool.

However, the error message says:

Internal Server Error: /bed/

TypeError at /bed/
create_bed_file() missing 5 required positional arguments: 'chromosome', 'build', 'genomic', 'vcf', and 'version'

That looks like the vcf2hgvs tool is being used. However, that would require the user to place the variant in a text file and then upload that file to the vcf2hgvs tool. Possible, but unlikely.

This is the code trying to create a UCSC link I believe. Not VCF. Thanks for logging it

@leicray
Copy link
Contributor

leicray commented Oct 31, 2023

Here is another one that ought not to trip up the system: NM_000179.3:c.4083dup

It generates error messages from the interactive service and submission to the batch tools also fails. The reference sequence is the MANE Select transcript for the MSH6 gene.

The traceback message for failure to validate via the batch tool is:

Traceback (most recent call last):
File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinCore.py", line 752, in validate
toskip = mappers.transcripts_to_gene(my_variant, self, select_transcripts_dict_plus_version)
File "/local/py3Repos/variantValidator/VariantValidator/modules/mappers.py", line 643, in transcripts_to_gene
protein_dict = validator.myc_to_p(hgvs_coding, variant.evm, re_to_p=False, hn=variant.hn)
File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinInit.py", line 535, in myc_to_p
start_aa = utils.one_to_three(aa_seq[0])
IndexError: string index out of range

In addition, this triggers a further exception:

Traceback (most recent call last):
File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/celery/app/trace.py", line 704, in protected_call
return self.run(*args, **kwargs)
File "/local/VVweb/web/tasks.py", line 60, in batch_validate
output = validator.validate(variant, genome, transcripts)
File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinCore.py", line 1462, in validate
raise fn.VariantValidatorError('Validation error')
VariantValidator.modules.utils.VariantValidatorError: Validation error

@Peter-J-Freeman
Copy link
Collaborator Author

Peter-J-Freeman commented Oct 31, 2023 via email

@leicray
Copy link
Contributor

leicray commented Oct 31, 2023

What do you mean by "add it"? This is the report.

@Peter-J-Freeman
Copy link
Collaborator Author

Peter-J-Freeman commented Oct 31, 2023 via email

@leicray
Copy link
Contributor

leicray commented Nov 1, 2023

Here is another one that trips up the interactive and batch validators:

11:2587692del (GRCh38)

@Peter-J-Freeman
Copy link
Collaborator Author

Thanks @leicray . Realised its a git email this time. I'm gonna do a little debugging now. Need time away from grant writing

@leicray
Copy link
Contributor

leicray commented Nov 2, 2023

And another one:

chr5:g.125887814C>T (GRCh37)

@Peter-J-Freeman
Copy link
Collaborator Author

Will come back to this one NG_059281.1:g.4962G>C (GRCh38). It's a database issue. Missing records

@Peter-J-Freeman
Copy link
Collaborator Author

This one too NG_061374.1:g.11229T>C (b38)

@Peter-J-Freeman
Copy link
Collaborator Author

So, the issue was that RefSeq are not maintaining RefSeqGene lookup tables. I added code to get the data from the API on fails. These variants are not fixed, but will not be fixed live until I do a new database build

@Peter-J-Freeman
Copy link
Collaborator Author

or at least do a interim update on the live servers which may be quicker for now.

@Peter-J-Freeman
Copy link
Collaborator Author

Here is another one that trips up the interactive and batch validators:

11:2587692del (GRCh38)

I don't know if I have the words.

@leicray
Copy link
Contributor

leicray commented Nov 2, 2023

I did wonder about that one. However, there is a genome build provided, a chromosome, a nucleotide number, and the nature of the change to that nucleotide. In a sense, it's little different from chr17:50198002C>A. What am I missing?

@Peter-J-Freeman
Copy link
Collaborator Author

It's not that sample sadly. I will need to figure out where to pus a Regex to catch it. I'm sure it'll fit. Hopefully with the code that allows chr17:50198002C>A. The difference is that chr17:50198002C>A is derived as art of pseudo VCF re-formatting. The description 11:2587692del is a bit different because 50198002C>A comes from 50198002:C:A. 11:2587692del should be derived from somethign like 50198002:CC:C not "del". Hopefully its a quick tweak though. Fun times! At least you came up with a reasonable explanation as to where the description came from

@Peter-J-Freeman
Copy link
Collaborator Author

NC_000023.11:r.650_831del

@leicray
Copy link
Contributor

leicray commented Jan 17, 2024

chr11:g,108121787G>A GRCh37

The anonymous submitter also tried GRCh38 and that failed too, of course.

This should be easy to trap and correct as the comma just needs to be replaced by a full stop.

@Peter-J-Freeman
Copy link
Collaborator Author

Will get this one done asap. Easy one hopefully

@leicray
Copy link
Contributor

leicray commented Jan 23, 2024

An anonymous user has tried to validate LRG_199p1:p.? and it has failed, generating an error message.

If I rewrite the variant description as LRG_199p1:p.Met1Ala I receive the expected warnings:

- LRG_199p1:p.Met1Ala automapped to equivalent RefSeq record NP_003997.1:p.Met1Ala

- Protein level variant descriptions are not fully supported due to redundancy in the genetic code

- NP_003997.1:p.Met1Ala is HGVS compliant and contains a valid reference amino acid description

Ought to be easy to trap.

@ifokkema
Copy link
Collaborator

If I rewrite the variant description as LRG_199p1:p.Met1Ala I receive the expected warnings:

I might be wrong, but are you suggesting that is valid syntax? Because a change to the first codon leads to an unpredictable result. The docs say:

Do not use descriptions like "p.Met1Thr", this is for sure not the consequence of the effect on protein translation.

(source)

@leicray
Copy link
Contributor

leicray commented Jan 23, 2024

You are quite correct. I simply wanted generate a variant description that would not cause the validator to fall over. I have no idea what comes next after Met1 in the DMD protein sequence, so pushed on with that.

Of course, there ought to be an additional warning that p.Met1Ala is not valid and ought to be written as p.(Met1?). Even that might be wrong.

@Peter-J-Freeman
Copy link
Collaborator Author

This should be triggering the warning and I wonder if it is trying to and failing. Will look into it

@ifokkema
Copy link
Collaborator

You are quite correct. I simply wanted generate a variant description that would not cause the validator to fall over. I have no idea what comes next after Met1 in the DMD protein sequence, so pushed on with that.

Ah, OK, you were just testing the reference sequence 😅 Never mind me!

@Peter-J-Freeman
Copy link
Collaborator Author

I'm still worried that the Met1 warning wasn't generated. So 2 fixes here. A chance to increase code coverage :P

@Peter-J-Freeman
Copy link
Collaborator Author

@leicray @ifokkema. Ok, here I put a spanner in the works. p.Met1Ala could actually be correct wheres p.(Met1Ala) would be p.(Met1?)

@ifokkema
Copy link
Collaborator

Hmm... I don't think that has ever been observed in humans... ClinVar reports this variant, but ClinVar always lies when it comes to protein descriptions 🙄
Are you thinking of ever providing full protein description validation? If not, I would personally ignore the near-zero chance of any substitution in the Met1 codon. While translation has been proven to sometimes start at non-CTG start codons, we're actually talking about the situation where a canonical transcript by default started with ATG but now also tolerates a non-ATG start induced by a variant. CTG being the most common non-ATG start codon, in theory, a p.Met1Leu could occur. Googling around allowed me to find one paper mentioning this, but at the same time, the variant also lowered translation considerably, so even then, p.Met1Leu wouldn't actually be the correct description.

@leicray
Copy link
Contributor

leicray commented Oct 14, 2024

A user has tried to validate the description NP_003485.1:c.89-2A>G and that triggers an ERROR message to the sysops. The on-screen warning is Unable to validate the submitted variant NP_003485.1:c.89-2A>G against the GRCh38 assembly.

This type of error where there is a mismatch between the reference sequence type (NP_) and the variant type (c.) ought to be immediately trapped and a more informative warning message be displayed on-screen.

@ifokkema
Copy link
Collaborator

A user has tried to validate the description NP_003485.1:c.89-2A>G and that triggers an ERROR message to the sysops. The on-screen warning is Unable to validate the submitted variant NP_003485.1:c.89-2A>G against the GRCh38 assembly.

This type of error where there is a mismatch between the reference sequence type (NP_) and the variant type (c.) ought to be immediately trapped and a more informative warning message be displayed on-screen.

We could also do better. Our tool is too focused on DNA only, I suppose. It says: The reference sequence could not be recognised. Supported reference sequence IDs are from NCBI Refseq, Ensembl, and LRG.

Screenshot_20241014-092758~2.png

@leicray
Copy link
Contributor

leicray commented Oct 20, 2024

An anonymous user has tried to validate the description NC_000009.12:g.92474742del and that triggers an ERROR message to the sysops. The on-screen warning is Unable to validate the submitted variant NC_000009.12:g.92474742del against the GRCh38 assembly.

This variant is deeply intronic in the CENPP gene and it may also lie within the ASPN gene on the opposite strand. However, there is only an Ensembl transcript for ASPN that spans the variant site. The tested variant corresponds to NM_001012267.3(CENPP):c.564+94883del which does validate correctly in the interactive validator, albeit with a warning about the redundant gene symbol.

To state the obvious, both the genome-based and transcript-based descriptions should validate.

@leicray
Copy link
Contributor

leicray commented Nov 17, 2024

An anonymous user has tried to validate the description chr14:2388944dup (GRCh37 & all transcripts) and that triggered an ERROR message to the sysops.

The relevant lines of the ERROR message are:

variant = 'chr14:2388944dup'
select_transcripts = ''
genomebuild = 'GRCh37'
refsource = 'refseq'
pdf_request = 'False'

If I resubmit the same validation request (logged in) the corresponding lines are:

variant = 'chr14:2388944dup'
transcripts = 'transcripts'
genomebuild = 'GRCh37'
pdf_request = 'False'
refsource = 'refseq'

Notice that the line order is slightly different and the different values for transcripts.

These ERROR notices were generated at Wed 13/11/2024 06:18 and at Sun 17/11/2024 14:51 respectively, just in case that might be relevant because of patches to the system between the two dates and times.

Now, just to make it interesting, I submitted the job again and specified "mane" for the transcripts, even though I have no idea if any MANE transcripts span the duplication for GRCh37. This timed out but did pop up the on-screen message to "...resubmit as a batch process". Just for good measure, it also resulted in an ERROR message to the sysops.

@Peter-J-Freeman
Copy link
Collaborator Author

Peter-J-Freeman commented Nov 18, 2024

Thanks @leicray . I'm not sure this format will be handled at the moment. can be added in.

Edit, no, this looks like somethig weird.. The variant is updated to NC_000014.8:g.2388944dup. So now I will look at the transcripts

OK, this is what is causing the problem. This is a bad region. A huge run of N bases. So in real terms, the variant description is not valid since dupN in a run of Ns is not somethign that would make sense. Cant vary unknown data

NC_000014.8:g.19000000dupN

@Peter-J-Freeman
Copy link
Collaborator Author

@leicray @John-F-Wagstaff

I propose we handle g. variants that map to N bases as follows

{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev729+g86e62d8.d20241105",
        "vvdb_version": "vvdb_2024_8",
        "vvseqrepo_db": "VV_SR_2024_09/master",
        "vvta_version": "vvta_2024_09"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "chr14:2388944dup",
        "transcript_description": "",
        "validation_warnings": [
            "This is not a valid HGVS variant description, because no reference sequence ID has been provided",
            "UncertainReferenceError: The submitted variant description NC_000014.8:g.2388944dupN refers to a genomic reference region with an uncertain base composition (N)"
        ],
        "variant_exonic_positions": null
    }
}

I could use some help wording the error and setting the error flag

@John-F-Wagstaff
Copy link
Collaborator

John-F-Wagstaff commented Nov 18, 2024

"UncertainReferenceError" seems good as a flag. We could use "UncertainSequenceError" instead, but part of the issue with N bases is that they also interfere with any attempts to validate the position of in/del type variants too, which is exacerbated by the fact that they usually turn up in long stretches or not at all when the genome is concerned, which biases me towards sticking with the original.

We might want to add something like " and thus neither the sequence nor the position can be accurately validated." to the end of you current error message? just to be more specific about why we did not validate further. It should not be needed for the more alert/clued in users, but explaining it could end up reducing user frustration and save us some questions later.

@Peter-J-Freeman
Copy link
Collaborator Author

An anonymous user has tried to validate the description NC_000009.12:g.92474742del and that triggers an ERROR message to the sysops. The on-screen warning is Unable to validate the submitted variant NC_000009.12:g.92474742del against the GRCh38 assembly.

This variant is deeply intronic in the CENPP gene and it may also lie within the ASPN gene on the opposite strand. However, there is only an Ensembl transcript for ASPN that spans the variant site. The tested variant corresponds to NM_001012267.3(CENPP):c.564+94883del which does validate correctly in the interactive validator, albeit with a warning about the redundant gene symbol.

To state the obvious, both the genome-based and transcript-based descriptions should validate.

This is not just deep intronic. It is at a gap site. I have got the code working but need to validate the output. Not is it obvious because it was missed in the query :)

>>> import json
>>> import VariantValidator
>>> vval = VariantValidator.Validator()
>>> variant = 'NC_000009.12:g.92474742del' # variant 2
>>> genome_build = 'GRCh38'
>>> select_transcripts = 'all'
>>> transcript_set = 'refseq'
>>> validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
NM_001193335.3:c.152_154+3delATGAGGinsATGAG
NM_001193335.3:c.154+3_155=
NM_017680.6:c.152_154+3delATGAGGinsATGAG
NM_017680.6:c.154+3_155=
>>> validation = validate.format_as_dict(with_meta=True)
>>> print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "NM_001012267.3:c.564+94883del": {
        "alt_genomic_loci": [
            {
                "grch38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472del",
                    "vcf": {
                        "alt": "T",
                        "chr": "HG1012_PATCH",
                        "pos": "309470",
                        "ref": "TC"
                    }
                }
            },
            {
                "hg38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472del",
                    "vcf": {
                        "alt": "T",
                        "chr": "NW_025791788.1",
                        "pos": "309470",
                        "ref": "TC"
                    }
                }
            }
        ],
        "annotations": {
            "chromosome": "9",
            "db_xref": {
                "CCDS": "CCDS35063.1",
                "ensemblgene": null,
                "hgnc": "HGNC:32933",
                "ncbigene": "401541",
                "select": "MANE"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": true,
            "map": "9q22.31",
            "note": "centromere protein P",
            "refseq_select": true,
            "variant": "1"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS69618",
                "CCDS35063"
            ],
            "ensembl_gene_id": "ENSG00000188312",
            "entrez_gene_id": "401541",
            "hgnc_id": "HGNC:32933",
            "omim_id": [
                "611505"
            ],
            "ucsc_id": "uc004arz.5"
        },
        "gene_symbol": "CENPP",
        "genome_context_intronic_sequence": "NC_000009.12(NM_001012267.3):c.564+94883del",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_001012267.1:p.?",
            "tlr": "NP_001012267.1:p.?"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_001012267.3:c.564+94883del",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001012267.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001012267.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NC_000009.12:g.92474742del",
        "transcript_description": "Homo sapiens centromere protein P (CENPP), transcript variant 1, mRNA",
        "validation_warnings": [],
        "variant_exonic_positions": {
            "NC_000009.11": {
                "end_exon": "5i",
                "start_exon": "5i"
            },
            "NC_000009.12": {
                "end_exon": "5i",
                "start_exon": "5i"
            }
        }
    },
    "NM_001193335.3:c.153delinsTGA": {
        "alt_genomic_loci": [
            {
                "grch38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472delinsTCA",
                    "vcf": {
                        "alt": "TCA",
                        "chr": "HG1012_PATCH",
                        "pos": "309472",
                        "ref": "C"
                    }
                }
            },
            {
                "hg38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472delinsTCA",
                    "vcf": {
                        "alt": "TCA",
                        "chr": "NW_025791788.1",
                        "pos": "309472",
                        "ref": "C"
                    }
                }
            }
        ],
        "annotations": {
            "chromosome": "9",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": null,
                "hgnc": "HGNC:14872",
                "ncbigene": "54829",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "9q22.31",
            "note": "asporin",
            "refseq_select": false,
            "variant": "2"
        },
        "gene_ids": {
            "ccds_ids": [],
            "ensembl_gene_id": "ENSG00000106819",
            "entrez_gene_id": "54829",
            "hgnc_id": "HGNC:14872",
            "omim_id": [
                "608135"
            ],
            "ucsc_id": "uc004ase.3"
        },
        "gene_symbol": "ASPN",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_001180264.1:p.(E51Dfs*41)",
            "tlr": "NP_001180264.1:p.(Glu51AspfsTer41)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_001193335.3:c.153delinsTGA",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001180264.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001193335.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NC_000009.12:g.92474742del",
        "transcript_description": "Homo sapiens asporin (ASPN), transcript variant 2, mRNA",
        "validation_warnings": [
            "Submitted description does not represent a true variant because it is an artefact of aligning NM_017680.6 with NC_000009.12 (genome build GRCh38)",
            "NM_001193335.3 contains 3 fewer bases between c.152_153 than NC_000009.12",
            "NM_001193335.3:c.152_154delinsATGAG automapped to NM_001193335.3:c.153delinsTGA"
        ],
        "variant_exonic_positions": {
            "NC_000009.12": {
                "end_exon": "2",
                "start_exon": "2"
            }
        }
    },
    "NM_001286969.1:c.228+94883del": {
        "alt_genomic_loci": [
            {
                "grch38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472del",
                    "vcf": {
                        "alt": "T",
                        "chr": "HG1012_PATCH",
                        "pos": "309470",
                        "ref": "TC"
                    }
                }
            },
            {
                "hg38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472del",
                    "vcf": {
                        "alt": "T",
                        "chr": "NW_025791788.1",
                        "pos": "309470",
                        "ref": "TC"
                    }
                }
            }
        ],
        "annotations": {
            "chromosome": "9",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": null,
                "hgnc": "HGNC:32933",
                "ncbigene": "401541",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "9q22.31",
            "note": "centromere protein P",
            "refseq_select": false,
            "variant": "2"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS69618",
                "CCDS35063"
            ],
            "ensembl_gene_id": "ENSG00000188312",
            "entrez_gene_id": "401541",
            "hgnc_id": "HGNC:32933",
            "omim_id": [
                "611505"
            ],
            "ucsc_id": "uc004arz.5"
        },
        "gene_symbol": "CENPP",
        "genome_context_intronic_sequence": "NC_000009.12(NM_001286969.1):c.228+94883del",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_001273898.1:p.?",
            "tlr": "NP_001273898.1:p.?"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_001286969.1:c.228+94883del",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001273898.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001286969.1"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NC_000009.12:g.92474742del",
        "transcript_description": "Homo sapiens centromere protein P (CENPP), transcript variant 2, mRNA",
        "validation_warnings": [],
        "variant_exonic_positions": {
            "NC_000009.11": {
                "end_exon": "4i",
                "start_exon": "4i"
            },
            "NC_000009.12": {
                "end_exon": "4i",
                "start_exon": "4i"
            }
        }
    },
    "NM_017680.6:c.153delinsTGA": {
        "alt_genomic_loci": [
            {
                "grch38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472delinsTCA",
                    "vcf": {
                        "alt": "TCA",
                        "chr": "HG1012_PATCH",
                        "pos": "309472",
                        "ref": "C"
                    }
                }
            },
            {
                "hg38": {
                    "hgvs_genomic_description": "NW_025791788.1:g.309472delinsTCA",
                    "vcf": {
                        "alt": "TCA",
                        "chr": "NW_025791788.1",
                        "pos": "309472",
                        "ref": "C"
                    }
                }
            }
        ],
        "annotations": {
            "chromosome": "9",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": null,
                "hgnc": "HGNC:14872",
                "ncbigene": "54829",
                "select": "MANE"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": true,
            "map": "9q22.31",
            "note": "asporin",
            "refseq_select": true,
            "variant": "1"
        },
        "gene_ids": {
            "ccds_ids": [],
            "ensembl_gene_id": "ENSG00000106819",
            "entrez_gene_id": "54829",
            "hgnc_id": "HGNC:14872",
            "omim_id": [
                "608135"
            ],
            "ucsc_id": "uc004ase.3"
        },
        "gene_symbol": "ASPN",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_060150.4:p.(E51Dfs*41)",
            "tlr": "NP_060150.4:p.(Glu51AspfsTer41)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_017680.6:c.153delinsTGA",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000009.11:g.95237024del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "95237022",
                    "ref": "TC"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000009.12:g.92474742del",
                "vcf": {
                    "alt": "T",
                    "chr": "chr9",
                    "pos": "92474740",
                    "ref": "TC"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_060150.4",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_017680.6"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NC_000009.12:g.92474742del",
        "transcript_description": "Homo sapiens asporin (ASPN), transcript variant 1, mRNA",
        "validation_warnings": [
            "Submitted description does not represent a true variant because it is an artefact of aligning NM_017680.6 with NC_000009.12 (genome build GRCh38)",
            "NM_017680.6 contains 3 fewer bases between c.152_153 than NC_000009.12",
            "NM_017680.6:c.152_154delinsATGAG automapped to NM_017680.6:c.153delinsTGA"
        ],
        "variant_exonic_positions": {
            "NC_000009.12": {
                "end_exon": "2",
                "start_exon": "2"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev729+g86e62d8.d20241105",
        "vvdb_version": "vvdb_2024_8",
        "vvseqrepo_db": "VV_SR_2024_09/master",
        "vvta_version": "vvta_2024_09"
    }
}

@Peter-J-Freeman
Copy link
Collaborator Author

The gene is sense orientated wrt the genome. The delins position in the NM_ is NM_001193335.3:c.153delinsTGA (153), the gap is between c.152_153 and 3 bases are missing from the transcript (i.e. the genome has an extra codon). So we del 153, and add in TGA, which is the correct sequence WRT the genome, so this is now fixed and will update when we push up.

@John-F-Wagstaff , does this make sense. I think the output is correct above

@Peter-J-Freeman
Copy link
Collaborator Author

"UncertainReferenceError" seems good as a flag. We could use "UncertainSequenceError" instead, but part of the issue with N bases is that they also interfere with any attempts to validate the position of in/del type variants too, which is exacerbated by the fact that they usually turn up in long stretches or not at all when the genome is concerned, which biases me towards sticking with the original.

I like UncertainSequenceError.

@John-F-Wagstaff
Copy link
Collaborator

John-F-Wagstaff commented Nov 18, 2024

I like UncertainSequenceError.

if it looks good to you then either are accurate descriptions of the issue.

does this make sense.

It does, if I want to validate further I would have to do a deep dive on the gap code, CIGAR in hand as it were, but it looks logical to me.

@leicray
Copy link
Contributor

leicray commented Nov 26, 2024

A user has submitted the variant description NM_000179.3:r.3646_3646+1insugagauaugcauag which failed validation. The key issue is that 3646+1 is invalid in the context of an RNA. If the description is corrected to NM_000179.3:r.3646_3647insugagauaugcauag then it does validate.

There needs to be improved parsing of r. variant description submissions to ensure that they do not contain intronic coordinates.

@Peter-J-Freeman
Copy link
Collaborator Author

Very true. I will add in

@Peter-J-Freeman
Copy link
Collaborator Author

how about this?

{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev729+g86e62d8.d20241105",
        "vvdb_version": "vvdb_2024_8",
        "vvseqrepo_db": "VV_SR_2024_09/master",
        "vvta_version": "vvta_2024_09"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_000179.3:r.3646_3646+1insugagauaugcauag",
        "transcript_description": "",
        "validation_warnings": [
            "VariantSyntaxError: RNA (r.) reference sequences do not contain introns. Intronic descriptions are described in the context of a c. description"
        ],
        "variant_exonic_positions": null
    }
}

Also @leicray can you comment on this issue so we can mark it as completed and update the text as necessary

#545 (comment)

@ifokkema
Copy link
Collaborator

Related information:

  • The HVNC only relatively recently concluded that intronic sequences are not part of the RNA molecule, even though examples on the website still list this, e.g., r.2949_2950ins[2950-30_2950-12;uuag]. We are currently deciding what the correct syntax would be. Likely, this will be r.2949_2950ins[c.2950-30_2950-12;uuag] (note the addition of "c.".
  • Related; recently Alex noticed that the IUPAC nucleotide codes for RNA are actually now ACTG and not acug. Therefore, likely, soon the HVNC will vote on changing the recommendations, which will then turn the example above into r.2949_2950ins[c.2950-30_2950-12;TTAG]. Nothing is final yet, but so far, nobody opposes the idea. This will also affect VariantValidator, so I thought to give you a heads-up.

@leicray
Copy link
Contributor

leicray commented Nov 26, 2024

I would change the second part of the warning to read: Variant descriptions containing intronic locations are only valid in the context of a c. description.

@Peter-J-Freeman
Copy link
Collaborator Author

thanks @leicray . Done.

@ifokkema. thanks for the heads up. Reomving the lowercase and u will be an easy job and will reduce the amount of processing. We will need to, however, uppercase and convert U > T.

This seems like a strange decisision by IUPAC. please let us know when the HGVS votes on this

@ifokkema
Copy link
Collaborator

@ifokkema. thanks for the heads up. Reomving the lowercase and u will be an easy job and will reduce the amount of processing. We will need to, however, uppercase and convert U > T.

This seems like a strange decisision by IUPAC. please let us know when the HGVS votes on this

I honestly don't know when IUPAC made this change, but a quick check online doesn't show me any pages still using the old nomenclature. So it's probably been a while. I'll keep you updated on the vote!

@leicray
Copy link
Contributor

leicray commented Nov 28, 2024

A user submitted the incorrect variant description NM_000314.8:c.(209+1_254-1)del five times. When the unnecessary brackets are removed, the variant validates correctly. Can the initial parsing step be adapted to sort the rogue brackets and post a warning message?

Peter-J-Freeman added a commit that referenced this issue Dec 2, 2024
…e and intrins in r. descriptions as referred to in #545
@ifokkema
Copy link
Collaborator

ifokkema commented Dec 2, 2024

A user submitted the incorrect variant description NM_000314.8:c.(209+1_254-1)del five times. When the unnecessary brackets are removed, the variant validates correctly. Can the initial parsing step be adapted to sort the rogue brackets and post a warning message?

I doubt NM_000314.8:c.209+1_254-1del, a simple omission of the parentheses, was meant. More likely, NM_000314.8:c.(209+1_210-1)_(253+1_254-1)del was meant, or in the old notation, NM_000314.8:c.210-?_253+?del. I understand that probably, the best interpretation for this to be validated on the sequence level is indeed just to drop the parentheses, I don't think VV should return NM_000314.8:c.209+1_254-1del as the "fixed" variant.

@leicray
Copy link
Contributor

leicray commented Dec 2, 2024

I agree with everything that you say regarding NM_000314.8:c.209+1_254-1del but it's impossible to know what was actually in the mind of the user.

I always try to respond to user "errors" such as this if the user has logged in so that I can figure out their email address from their login ID. I did that in this case too and asked the user to get back to me with more info. I have received no reply and that's what happens in the majority of cases. Most users are rather unthinking (rude) when invited to respond.

@leicray
Copy link
Contributor

leicray commented Dec 5, 2024

A user has tried unsuccessfully (and three times to validate the variant description NP_000483.3:c.579+3A>G. Initial parsing of this variant ought to notice the NP_/c. mismatch and generate a warning to the user.

@ifokkema
Copy link
Collaborator

ifokkema commented Dec 5, 2024

A user has tried unsuccessfully (and three times to validate the variant description NP_000483.3:c.579+3A>G.

The dev version of the LOVD HGVS syntax checker says: "Protein reference sequences are not supported. Please submit a DNA variant using a DNA reference sequence."

@Peter-J-Freeman
Copy link
Collaborator Author

I have added the following

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NP_000483.3:c.579+3A>G.'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev709+g6340024",
        "vvdb_version": "vvdb_2024_8",
        "vvseqrepo_db": "VV_SR_2024_09/master",
        "vvta_version": "vvta_2024_09"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NP_000483.3:c.579+3A>G.",
        "transcript_description": "",
        "validation_warnings": [
            "Protein reference sequence input as Nucleotide (:c.) variant."
        ],
        "variant_exonic_positions": null
    }
}

Will work for both NP_ and ENSP and with variant types g. c. r. n.

@ifokkema not ignoring your email about requests for the LOVD API. We need to meet up and discuss integration with you

@leicray
Copy link
Contributor

leicray commented Dec 9, 2024

This is not a "failed variant" issue but it probably belongs here anyway as it's an input-parsing issue of a sort.

An anonymous user has twice tried to search for transcripts for a gene using the HGNC gene ID. The first submitted 7491 and then tried HGNC:7419. The gene2trans tool is supposed to accept HGNC gene IDs. The 7491 ID corresponds to the mitochondrial gene MT-CO1. Could that be the basis of the problem?

@leicray
Copy link
Contributor

leicray commented Dec 9, 2024

An anonymous user has twice tried to validate the variant description RAD51C:c.(404+1_405-1)_(571+1_572-1)del which failed on both occasions. To compound the issue, the user specified NM_58216 as the transcript for reporting. The MANE Select transcript for the RAD51C gene is NM_058216.3.

This looks like a failure to properly parse the input.

@leicray
Copy link
Contributor

leicray commented Dec 19, 2024

An anonymous user has tried to validate the variant description DNA(B)-CALR-gen (LRG_828t1:c.1054_1254). There is so much wrong here that I do not know where to begin.

This looks like a failure to properly parse the input.

@leicray
Copy link
Contributor

leicray commented Jan 2, 2025

A user has submitted the variant description NM_001613.4:r.3646_3647insuaaaauaugcauaaggaaguaacucaaacag which triggers an ERROR message.

The basic problem is that position 3646_3647 is out of range for transcript NM_001613.4. It ought to be possible to trap errors of this type and provide a helpful warning message to the user.

@leicray
Copy link
Contributor

leicray commented Jan 8, 2025

An anonymous user tried to validate the variant description NC_000020.10(RTEL1):g.62305406_62305482delN[77]insN[168] which failed, creating an ERROR message.

If corrected to NC_000020.10:g.62305406_62305482delinsN[168], the variant validates.

The gene symbol is redundant, as is the number and "identity" of the deleted nucleotides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants