Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GO-CAM taxon ID's are possibly incorrect for 4 models #4248

Closed
nmarkari opened this issue Jul 28, 2022 · 21 comments
Closed

GO-CAM taxon ID's are possibly incorrect for 4 models #4248

nmarkari opened this issue Jul 28, 2022 · 21 comments
Assignees
Labels

Comments

@nmarkari
Copy link

The following are all cases of production models where one gocam has two different taxon IDs assigned to it. Most of the cases are because they represent processes from viral infections, and one taxon ID is for Homo sapiens and the other is for the virus, but 4 cases naively seem to possibly be mistakes.

In two cases (62183af000000536, 6205c24300000880) however, Sus scrofa (wild boar) and homo sapiens are both listed, but the gocam title itself says it is for human, and all the proteins appear to be human gene products based on the noctua visualization. I looked in the ttl file and indeed the taxon ID for sus scrofa is included. I'm not sure if these were errors during curation, or if one of the papers used as evidence utilized Sus scrofa, or if the model really was intended to represent both human and boar versions of the same pathway.

There's also one case where both human and mouse are listed: 60ad85f700000058. I'm not sure about this one.

Lastly, there's an odd case with 5e72450500004019 which represents a covid pathway, but there are two taxon IDs presumably representing the virus, one which is the ID for the virus and the other which is a uniprot ID for one of the viral proteins that is listed as a taxon ID for some reason. I'm unsure if that is a way of specifying a specific variant of covid or not. Naively, these 4 cases seem like mistakes, but I just wanted to pass this information along to the curation team to take a look! @vanaukenk @pgaudet

@pgaudet
Copy link
Contributor

pgaudet commented Oct 11, 2022

Hi @nmarkari

Sorry about the long delay in responding to this - I dont see any Sus scrofa protein IDs in 62183af000000536, 6205c24300000880 - can you check?

Likewise for 60ad85f700000058 - I only see mouse IDs.

For the Sars-CoV-2 pathway, 5e72450500004019 , I dont see either which two taxon IDs you refer to ?

Can you please have another look? It'd be great to understand what is not clear, if only to add documentation about this.

Thanks, Pascale

@nmarkari
Copy link
Author

Hi @pgaudet

The gocam itself, not the proteins, is assigned to both homo sapiens and sus scrofa. See line 46 in the ttl file: https://github.com/geneontology/noctua-models/blob/2ada32d7bfbc6afe8df0821713b1ade01ab7d41e/models/62183af000000536.ttl

<https://w3id.org/biolink/vocab/in_taxon> <http://purl.obolibrary.org/obo/NCBITaxon_9606> , <http://purl.obolibrary.org/obo/NCBITaxon_9823>

Or, see the following query on noctua which retrieves that model as the 7th result when filtering by "Sus scrofa" for organism http://noctua.geneontology.org/workbench/noctua-landing-page/?offset=0&limit=50&taxon=NCBITaxon:9823&expand&debug

The other examples I listed are all of the same nature.

@pgaudet
Copy link
Contributor

pgaudet commented Oct 13, 2022

Thanks for the clarification @nmarkari
This is really strange, I dont see any pig sequences in that model, or anywhere else in the .ttl file, but just, that species in incorrectly mentioned in the model.

@kltm or @balhoff Can you look into where that data may be coming from?

Thanks, Pascale

@kltm
Copy link
Member

kltm commented Oct 13, 2022

Looking at example 62183af000000536 (http://noctua.geneontology.org/editor/graph/gomodel:62183af000000536).
Marked by two model-level annotations:

https://w3id.org/biolink/vocab/in_taxon: NCBITaxon:9606
https://w3id.org/biolink/vocab/in_taxon: NCBITaxon:9823

This is a hand-created model.
https://github.com/geneontology/noctua-models/blob/master/models/62183af000000536.ttl
This annotation has existed in all versions, so was either added manually by the curator (probably not) or automatically added by minerva (likely). So I guess the questions are:

  1. how did minerva make this mistake
  2. how frequent was this mistake
  3. how to fix the mistake (fix going forward)
  4. how to bulk update to clear the mistake (historical fix)

We'll need feedback from @balhoff here.

@kltm kltm changed the title GO CAM taxon ID's are possibly incorrect for 4 models GO-CAM taxon ID's are possibly incorrect for 4 models Oct 13, 2022
@kltm
Copy link
Member

kltm commented Oct 13, 2022

@kltm
Copy link
Member

kltm commented Oct 13, 2022

@balhoff Casting a wider net:

sjcarbon@moiraine:~/local/src/git/noctua-models/models[master]$:) grep -o "obo/NCBITaxon_[0-9]*" *.ttl | sort | uniq | cut -d ':' -f 1 | uniq -c | grep -v "1 " | wc -l
223

Sampling these, some are intended (multi-species/gut bacteria); some are internal tests; some are as above; some seem random.

@balhoff
Copy link
Member

balhoff commented Oct 14, 2022

The way Minerva works, I think if the wrong version of a gene is ever entered in a model, then saved, its taxon will be added, then even if the gene is corrected, the taxon will never be removed. This could happen without git history evidence. This is just a guess at what could have happened.

@kltm
Copy link
Member

kltm commented Oct 14, 2022

@balhoff Hm. Something that is likely to keep happening then. I guess either a one-off cleanup periodically or code that purges, recalculates, and re-adds. I'm not sure what the overhead of something like would be.
As far as I know, we've never had issues with bad taxon assignments.

@balhoff
Copy link
Member

balhoff commented Oct 14, 2022

I think it's recalculating on every save, so maybe it should just purge existing triples each time.

@kltm
Copy link
Member

kltm commented Oct 14, 2022

Ah, yeah, that sounds like the right path then. We'll still need to do a one-time bulk cleanup, but that makes it easier.

@balhoff
Copy link
Member

balhoff commented Oct 14, 2022

I made a Minerva issue: geneontology/minerva#503

@pgaudet
Copy link
Contributor

pgaudet commented Oct 17, 2022

Can we close the issue here then ?

@kltm
Copy link
Member

kltm commented Oct 17, 2022

@pgaudet We still need to get a fix into production and purge the current model set.

@pgaudet
Copy link
Contributor

pgaudet commented Oct 17, 2022

Sorry, I meant, the issue is now in the Minerva tracker, not here?

@kltm
Copy link
Member

kltm commented Oct 17, 2022

The fix in minerva is currently being tested and we'll be talking to @vanaukenk soon. Otherwise, we'll still need to have a planfor updating the bad data, which is potentially separate from anything in minerva.

@kltm
Copy link
Member

kltm commented Oct 18, 2022

@vanaukenk I have a reduced list looking at "production" and iteratively filtering out

NCBITaxon:10299
NCBITaxon:2697049 
NCBITaxon:471871
NCBITaxon:301447
NCBITaxon:10254
NCBITaxon:265872
NCBITaxon:83334
NCBITaxon:83333
NCBITaxon:243365

I am down to 81 models for examination (from 224). If you'd like to sample them, I can continue filtering. Let me know the URL you'd like links to resolve to and I can create a list for you on another channel.

@vanaukenk
Copy link
Contributor

@kltm I've looked at the first 25 models on the list of 81 models and the species assignments for those models are all legit, i.e. the taxa listed with the model are all represented in existing annotations.

@ValWood
Copy link
Contributor

ValWood commented Aug 29, 2023

@vanaukenk close?

@ValWood
Copy link
Contributor

ValWood commented Aug 31, 2023

@vanaukenk @pgaudet is this still current?

@kltm
Copy link
Member

kltm commented Sep 7, 2023

Okay, in discussion with @pgaudet , it seems not to be a current issue, closing for now. Reopen if it comes up again.

@kltm kltm closed this as completed Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants