Explore kinase mutation annotation in ChEMBL #11

schallerdavid · 2021-08-11T14:36:32Z

We need to check how well ChEMBL annotates mutations. Mutant information can be found in the columns Assay Variant Mutation and Assay Variant Accession.

Example data point:

https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/molecule_chembl_id%3A(%22CHEMBL3354189%22)%20AND%20standard_type%3A(%22Ki%22)

The text was updated successfully, but these errors were encountered:

corey-taylor · 2021-09-02T14:39:18Z

With respect to this data point (CHEMBL3354189), target wt sequences are stored according to the CHEMBL schema in component_sequences.sequence. i.e. no matter what is stored in variant_sequences.mutation, the wt sequence is repeated in component_sequences.sequence.

The mutated sequence is found in variant_sequences.sequence. However, if the record for the target has a wt sequence or if it's any type of mutation other than a substitution mutation (e.g. deletion), this field is left blank.

The field variant_sequences.mutation seems reliable with respect to variant_sequences.sequence as there only seems to be a record for the mutation in the former if it is accompanied with a sequence in the latter. And when there is a wt sequence in component_sequences.sequence, as with variant_sequences.sequence, this field is generally left blank.

CAVEATS:

CHEMBL3354189 was tested against a double mutant (L858R,T790M) of EGFR, as found in variant_sequences.mutation. However, both mutations were found in different positions in the mutation sequence in variant_sequences.sequence (L865R,T798M) - a difference of 8 residues in each case. So it may be that whilst this field is reliable to confirm that there were mutations, it may not be a reliable indicator if where those mutations occurred in all cases.

With respect to deletion/addition mutations, for CHEMBL3354189, there is a value of UNDEFINED MUTATION in variant_sequences.mutation and assays.description contains information confirming it is a deletion mutation. This is not always the case with other data points so it would appear that there is no reliable correspondence between assays.description and variant_sequences.mutation to fully delineate deletion mutations.

So, to pcode this out, when retrieving sequence data for model training,

if protein sequence == WT

select component_sequences.sequence

elif protein sequence == SUBSTITUTION_MUTATION

select variant_sequences.sequence

Where WT can be determined if there is a record in component_sequences.sequence and SUBSTITUTION_MUTATION can be determined if there is both a record in component_sequences.sequence and variant_sequences.sequence, as it seems CHEMBL enforces that one can only input a mutation sequence if one has already inputted a wt sequence. Other types of mutation seem like they would be determined only unreliably from, say, assays.description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore kinase mutation annotation in ChEMBL #11

Explore kinase mutation annotation in ChEMBL #11

schallerdavid commented Aug 11, 2021

corey-taylor commented Sep 2, 2021 •

edited

Loading

Explore kinase mutation annotation in ChEMBL #11

Explore kinase mutation annotation in ChEMBL #11

Comments

schallerdavid commented Aug 11, 2021

corey-taylor commented Sep 2, 2021 • edited Loading

corey-taylor commented Sep 2, 2021 •

edited

Loading