Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add miRBase groundings to GroundingMapper #918

Closed
wants to merge 10 commits into from

Conversation

cthoyt
Copy link
Collaborator

@cthoyt cthoyt commented Jul 21, 2019

References #255

I did my best to follow the same procedure as when handling the UniProt grounding.

Also switched some staticmethods over to classmethods to make the references a bit more elegant

I did my best to follow the same procedure as when handling the UniProt grounding. 

Also switched some staticmethods over to classmethods to make the references a bit more elegant
@cthoyt cthoyt marked this pull request as ready for review July 21, 2019 22:56
@cthoyt cthoyt changed the title Add MIRBASE groundings to GroundingMapper Add miRBase groundings to GroundingMapper Jul 21, 2019
@bgyori
Copy link
Member

bgyori commented Jul 22, 2019

Great, thanks! One question is whether we really want to standardize the names of microRNAs to their gene name, e.g., in the test you added, hsa-let-7a-1L is renamed to MIRLET7A1. My preference would be to standardize to the miRBase official name, i.e., hsa-let-7a-1 in case an Agent has both HGNC and MIRBASE grounding. What do you think?

Also, given that we can now reliably standardize the names and groundings of microRNAs, we can take this into account in the Agent.get_grounding method.

Finally, the fact that micro-RNAs will now all have HGNC grounding (which they were never given before from any of our sources) may cause unexpected behaviors in some downstream applications where HGNC grounding is often assumed to be associated with UP (protein) grounding. So we'll need to review and test some of these to see if anything needs to be done.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Jul 22, 2019

I think I may have misunderstood the directionality of the test - I agree that using the miRBase name would be preferred over the HGNC gene name.

However, I think it's still meaningful to get the HGNC grounding, even though this might cause a bit of pain while checking what INDRA code makes assumptions about the existence of an HGNC gene identifier leading to the assumption that it's coding for a protein. The world is much weirder than that :)

I'll take a look into Agent.get_grounding() and make an update there if I can

Now this has higher priority than HGNC or UP in order to avoid possible conflicts with when the agent has both MIRBASE and HGNC identifiers in the db_refs
@bgyori
Copy link
Member

bgyori commented Jul 22, 2019

The name standardization issue is just that we check for HGNC first, and never reach

elif db_ns == 'MIRBASE'

if there is an HGNC grounding that was added during db_refs standardization.

@bgyori
Copy link
Member

bgyori commented Jul 22, 2019

Great, thanks for the changes! So then the only remaining issue is for me to think a little bit about downstream effects.

@johnbachman
Copy link
Member

@bgyori and just discussed this some more. Some conclusions:

  1. if a microRNA has an HGNC identifier, we should use the HGNC name rather than the MIRBASE name. To do otherwise would create a problematic situation where an agent with HGNC grounding would have a name that was not its official HGNC symbol.

  2. Most existing code that filters statements involving HGNC groundings would still be OK, as in most cases the assumption is not that the agent is a protein, but rather that the agent represents a gene product (i.e., not a biological process, chemical, metabolite) and is human.

  3. If in any downstream analysis it is necessary to restrict to proteins only, then the user can filter to include only agents that have a HGNC and Uniprot grounding, or filter out agents that have HGNC and MIRBASE grounding.

  4. We are already getting information about microRNAs from readers where the microRNAs are grounded to HGNC names and not to MIRBASE (e.g., query the DB for MIR21). The readers will have to standardize to adding MIRBASE groundings for the HGNC identifiers where appropriate.

  5. Conversely, statements coming from miRTarBase (via Pathway Commons) will need to have HGNC identifiers added where possible. One complication is that it appears that miRTarBase grounds agents to the mature/processed forms of the microRNA, which exist in a completely different namespace (mirbase.mature). To align information involving both the stem-loop and mature forms, we could either a) re-ground all statements involving mature/processed forms to the (less-specific) stem-loop form, or b) Incorporate a hierarchy for microRNAs in the entity hierarchy that would treat the mature form as a refinement/member of the stem-loop form.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Aug 5, 2019

I think 1-4 are all reasonable. For 5 I think it would be interesting to have hierarchy information about miRNAs - reading from the literature likely suffers from the same problem that proteins did before you introduced FamPlex.

@bgyori
Copy link
Member

bgyori commented Aug 5, 2019

Yes, I agree about 5 in principle, though it's not an immediate priority. In the meantime, I did #926, which I will merge soon. It might be a bit inconvenient to rebase this branch once that is merged due to the refactoring but otherwise it will fit in well in the new structure too.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Aug 5, 2019

Okay so the last commit addressed 1, 2-4 don't need action and 5 might get addressed later. @bgyori either I can wait until you merge #926 then take care of reorganizing all of this, or you can move the code from here over into #926 yourself

@bgyori
Copy link
Member

bgyori commented Aug 5, 2019

Oh, per https://github.com/sorgerlab/indra/blob/master/CONTRIBUTING.md#git-forkpr-workflow we never merge master into branches, rather, rebase. I know it's not convenient in this case - I'm happy to sort out the git issues.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Aug 5, 2019

@bgyori sorry about that. I appreciate the help - I'll check out that link so I don't make the same mistake next time.

Good news is everything is upgraded and the tests are passing locally. Just waiting for travis now.

@bgyori
Copy link
Member

bgyori commented Jul 30, 2020

This feature is now implemented as part of the more general db_refs and name standardization process via the ontology. Note that in case there is a one-to-one human gene correspondence for the given microRNA, its name will be standardized to HGNC, otherwise MIRBASE.

@bgyori bgyori closed this Jul 30, 2020
@cthoyt cthoyt deleted the patch-8 branch July 31, 2020 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants