Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore kinase mutation annotation in ChEMBL #11

Open
schallerdavid opened this issue Aug 11, 2021 · 1 comment
Open

Explore kinase mutation annotation in ChEMBL #11

schallerdavid opened this issue Aug 11, 2021 · 1 comment

Comments

@schallerdavid
Copy link
Contributor

We need to check how well ChEMBL annotates mutations. Mutant information can be found in the columns Assay Variant Mutation and Assay Variant Accession.

Example data point:

https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/molecule_chembl_id%3A(%22CHEMBL3354189%22)%20AND%20standard_type%3A(%22Ki%22)

@corey-taylor
Copy link
Contributor

corey-taylor commented Sep 2, 2021

With respect to this data point (CHEMBL3354189), target wt sequences are stored according to the CHEMBL schema in component_sequences.sequence. i.e. no matter what is stored in variant_sequences.mutation, the wt sequence is repeated in component_sequences.sequence.

The mutated sequence is found in variant_sequences.sequence. However, if the record for the target has a wt sequence or if it's any type of mutation other than a substitution mutation (e.g. deletion), this field is left blank.

The field variant_sequences.mutation seems reliable with respect to variant_sequences.sequence as there only seems to be a record for the mutation in the former if it is accompanied with a sequence in the latter. And when there is a wt sequence in component_sequences.sequence, as with variant_sequences.sequence, this field is generally left blank.

CAVEATS:

CHEMBL3354189 was tested against a double mutant (L858R,T790M) of EGFR, as found in variant_sequences.mutation. However, both mutations were found in different positions in the mutation sequence in variant_sequences.sequence (L865R,T798M) - a difference of 8 residues in each case. So it may be that whilst this field is reliable to confirm that there were mutations, it may not be a reliable indicator if where those mutations occurred in all cases.

With respect to deletion/addition mutations, for CHEMBL3354189, there is a value of UNDEFINED MUTATION in variant_sequences.mutation and assays.description contains information confirming it is a deletion mutation. This is not always the case with other data points so it would appear that there is no reliable correspondence between assays.description and variant_sequences.mutation to fully delineate deletion mutations.

So, to pcode this out, when retrieving sequence data for model training,

if protein sequence == WT

select component_sequences.sequence

elif protein sequence == SUBSTITUTION_MUTATION

select variant_sequences.sequence

Where WT can be determined if there is a record in component_sequences.sequence and SUBSTITUTION_MUTATION can be determined if there is both a record in component_sequences.sequence and variant_sequences.sequence, as it seems CHEMBL enforces that one can only input a mutation sequence if one has already inputted a wt sequence. Other types of mutation seem like they would be determined only unreliably from, say, assays.description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants