-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GO-CAM taxon ID's are possibly incorrect for 4 models #4248
Comments
Hi @nmarkari Sorry about the long delay in responding to this - I dont see any Sus scrofa protein IDs in 62183af000000536, 6205c24300000880 - can you check? Likewise for 60ad85f700000058 - I only see mouse IDs. For the Sars-CoV-2 pathway, 5e72450500004019 , I dont see either which two taxon IDs you refer to ? Can you please have another look? It'd be great to understand what is not clear, if only to add documentation about this. Thanks, Pascale |
Hi @pgaudet The gocam itself, not the proteins, is assigned to both homo sapiens and sus scrofa. See line 46 in the ttl file: https://github.com/geneontology/noctua-models/blob/2ada32d7bfbc6afe8df0821713b1ade01ab7d41e/models/62183af000000536.ttl
Or, see the following query on noctua which retrieves that model as the 7th result when filtering by "Sus scrofa" for organism http://noctua.geneontology.org/workbench/noctua-landing-page/?offset=0&limit=50&taxon=NCBITaxon:9823&expand&debug The other examples I listed are all of the same nature. |
Looking at example 62183af000000536 (http://noctua.geneontology.org/editor/graph/gomodel:62183af000000536).
This is a hand-created model.
We'll need feedback from @balhoff here. |
@balhoff Casting a wider net:
Sampling these, some are intended (multi-species/gut bacteria); some are internal tests; some are as above; some seem random. |
There is some code added by Ben which looks at the in_taxon links in NEO and inserts the taxon annotations when a model is saved:
Have we ever had bugs in the taxon assignments in NEO? |
The way Minerva works, I think if the wrong version of a gene is ever entered in a model, then saved, its taxon will be added, then even if the gene is corrected, the taxon will never be removed. This could happen without git history evidence. This is just a guess at what could have happened. |
@balhoff Hm. Something that is likely to keep happening then. I guess either a one-off cleanup periodically or code that purges, recalculates, and re-adds. I'm not sure what the overhead of something like would be. |
I think it's recalculating on every save, so maybe it should just purge existing triples each time. |
Ah, yeah, that sounds like the right path then. We'll still need to do a one-time bulk cleanup, but that makes it easier. |
I made a Minerva issue: geneontology/minerva#503 |
Can we close the issue here then ? |
@pgaudet We still need to get a fix into production and purge the current model set. |
Sorry, I meant, the issue is now in the Minerva tracker, not here? |
The fix in minerva is currently being tested and we'll be talking to @vanaukenk soon. Otherwise, we'll still need to have a planfor updating the bad data, which is potentially separate from anything in minerva. |
@vanaukenk I have a reduced list looking at "production" and iteratively filtering out
I am down to 81 models for examination (from 224). If you'd like to sample them, I can continue filtering. Let me know the URL you'd like links to resolve to and I can create a list for you on another channel. |
@kltm I've looked at the first 25 models on the list of 81 models and the species assignments for those models are all legit, i.e. the taxa listed with the model are all represented in existing annotations. |
@vanaukenk close? |
@vanaukenk @pgaudet is this still current? |
Okay, in discussion with @pgaudet , it seems not to be a current issue, closing for now. Reopen if it comes up again. |
The following are all cases of production models where one gocam has two different taxon IDs assigned to it. Most of the cases are because they represent processes from viral infections, and one taxon ID is for Homo sapiens and the other is for the virus, but 4 cases naively seem to possibly be mistakes.
In two cases (62183af000000536, 6205c24300000880) however, Sus scrofa (wild boar) and homo sapiens are both listed, but the gocam title itself says it is for human, and all the proteins appear to be human gene products based on the noctua visualization. I looked in the ttl file and indeed the taxon ID for sus scrofa is included. I'm not sure if these were errors during curation, or if one of the papers used as evidence utilized Sus scrofa, or if the model really was intended to represent both human and boar versions of the same pathway.
There's also one case where both human and mouse are listed: 60ad85f700000058. I'm not sure about this one.
Lastly, there's an odd case with 5e72450500004019 which represents a covid pathway, but there are two taxon IDs presumably representing the virus, one which is the ID for the virus and the other which is a uniprot ID for one of the viral proteins that is listed as a taxon ID for some reason. I'm unsure if that is a way of specifying a specific variant of covid or not. Naively, these 4 cases seem like mistakes, but I just wanted to pass this information along to the curation team to take a look! @vanaukenk @pgaudet
The text was updated successfully, but these errors were encountered: