Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mondo-base.obo: MONDO IDs do not match between $xref and $id #6873

Closed
bschilder opened this issue Nov 15, 2023 · 14 comments
Closed

mondo-base.obo: MONDO IDs do not match between $xref and $id #6873

bschilder opened this issue Nov 15, 2023 · 14 comments
Assignees
Labels

Comments

@bschilder
Copy link

bschilder commented Nov 15, 2023

I'm using the "mondo-base.obo" object read into R as an ontologyIndex object.

In working with this object, I've noticed some unexpected features. Namely, the MONDO IDs in the $xref slot do not seem to match up with those listed in the$id slot. I'd expect that all MONDO IDS in any of the slots should at least be present in the $id slot.

Mondo term (ID Label)

xref_ids.csv

Bug/Typo/Error description

remotes::install_github("neurogenomics/HPOExplorer")
mondo <- HPOExplorer::get_mondo()

# Only 18 MONDO IDs are missing names, which is ok
>     sum(is.na(mondo$name))
[1] 18
# 10273 MONDO IDs are missing definitions; less ok but still manageable
>     sum(is.na(mondo$def))
[1] 10273
> 

# Get MONDO IDs listed in `$xref`
> xref_ids <- unique(names(unlist(mondo$xref)))
# Get total number of xref IDs
> length(xref_ids)
[1] 101231
# Almost all of these xref MONDO IDs (~100k) are missing in the `$id` slot! in theory all IDs should be in the `$id` slot
>     sum(!xref_ids %in% names(mondo$id))
[1] 98468
# Same missing rate in the name slot
>     sum(!xref_ids %in% names(mondo$name))
[1] 98468
# Same missing rate in the def slot
>     sum(!xref_ids %in% names(mondo$def))
[1] 98468
> 

Your nano-attribution (ORCID)
https://orcid.org/0000-0001-5949-2191

Thanks in advance for your help.

@bschilder
Copy link
Author

Hey, I think this may have gotten buried amongst the other Issues, but wanted to check in and see if anyone has looked into this yet. @twhetzel

Thanks for your help!

-Brian

@twhetzel
Copy link
Collaborator

twhetzel commented Dec 5, 2023

Hi Brian - thanks for the ping on this issue. I've been looking into this and have a few questions and comments.

What is the source of the Mondo ontology file used? I've looked at the number of terms without labels or definitions in the latest release of Mondo and do not get the same count as mentioned in the initial post.

In the "xref_ids.csv" file, the Mondo IDs vary in length from 13, 14, and 15 characters long while they should be 13 characters long (MONDO: and then a 7 digit number). Amongst the IDs that are 13 characters long, there are at least a few obsolete terms in the file, e.g. MONDO:8000034. Deprecated terms in MONDO have the annotation owl:deprecated with value true [1] and of the Mondo terms with xrefs in the latest release, 3,210 of these are obsoleted.

For other valid IDs in the "xref_ids.csv" file, e.g. MONDO:0000005, what are the values you get for $xref and $id?

[1] https://mondo.readthedocs.io/en/latest/editors-guide/merging-and-obsoleting/#obsolete-a-class-manually

@twhetzel
Copy link
Collaborator

twhetzel commented Dec 8, 2023

@matentzn have you taken over this ticket since you created a meeting with Brian?

@matentzn
Copy link
Member

matentzn commented Dec 8, 2023

I Never saw this, and I didn't realise @bschilder was interested in Mondo at all! Yep I can discuss it with him when I meet him.

@twhetzel
Copy link
Collaborator

twhetzel commented Dec 8, 2023

Oh, I thought your ping to him here (neurogenomics/RareDiseasePrioritisation#33 (comment)) was related to Mondo as well

@matentzn
Copy link
Member

matentzn commented Dec 8, 2023

I guess now it it but originally no, it was only referring to uPheno!

@twhetzel twhetzel assigned matentzn and unassigned twhetzel Dec 8, 2023
@matentzn
Copy link
Member

matentzn commented Dec 8, 2023

We determined this is not a Mondo related problem, but related to the r toolkit!

@matentzn matentzn closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2023
@bschilder
Copy link
Author

bschilder commented Dec 8, 2023

Indeed, it seems something strange is going on within ontologyIndex::get_OBO which HPOExplorer uses to import the latest MONDO ontology from GitHub releases into R. Specifically, this OBO file:
https://github.com/monarch-initiative/mondo/releases/download/v2023-09-12/mondo-base.obo

For example, the Mondo IDs within $xref seemed to be a mix of obsolete and completely made up MONDO IDs! (according to @matentzn who scanned through the CSV of missing MONDO IDs i attached above).

While I'm still trying to sort out the exact reason for this issue, I think ontologyIndex::get_OBO is a pretty good lead. I'll reach out to those authors and let you know what the outcome is.

@twhetzel
Copy link
Collaborator

twhetzel commented Dec 8, 2023

Thanks both for the update. Please see other issues with the IDs as mentioned earlier #6873 (comment)

@matentzn
Copy link
Member

matentzn commented Dec 8, 2023

Yeah I confirmed that with Brian in the meeting, many ids don't even exist. He will look into it!

@bschilder
Copy link
Author

bschilder commented Dec 8, 2023

For other valid IDs in the "xref_ids.csv" file, e.g. MONDO:0000005, what are the values you get for $xref and $id?

@twhetzel here's a reprex you can quickly run in R:

> if(!require("ontologyIndex")) install.packages("ontologyIndex")
> mondo <- ontologyIndex::get_OBO("https://github.com/monarch-initiative/mondo/releases/download/v2023-09-12/mondo-base.obo", extract_tags = "everything")

 > mondo$xref["MONDO:0000005"]
$`MONDO:0000005`
[1] "OMIMPS:203655"

> mondo$id["MONDO:0000005"]
  MONDO:0000005 
"MONDO:0000005" 

@bschilder
Copy link
Author

Tagging who i think may be the author of ontologyIndex @daniel-jg

@twhetzel
Copy link
Collaborator

@bschilder that looks correct for MONDO:0000005. Since Nico mentioned that it was determined the issue is with the R toolkit, can this ticket remain closed since it is not an issue with Mondo?

Screenshot 2023-12-10 at 6 26 25 PM

@bschilder
Copy link
Author

@bschilder that looks correct for MONDO:0000005. Since Nico mentioned that it was determined the issue is with the R toolkit, can this ticket remain closed since it is not an issue with Mondo?

Screenshot 2023-12-10 at 6 26 25 PM

Sure thing, I've already started working on alternative methods for converting IDs across ontologies using some of Monarch's resources. I'll keep your team posted on how things progress:
https://github.com/neurogenomics/KGExplorer/blob/29eccbbd33fd18d9ce85b0ae72b47d485d97faee/R/map_mondo.R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants