-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in match-mondo-sources-all-lexical.py
#394
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,19 @@ | ||
# Basic matching pipeline that takes in | ||
"""Match mondo sources, all lexical | ||
Basic matching pipeline that takes in | ||
|
||
# Input: | ||
# 1. MERGED_ONTOLOGY = tmp/merged.owl | ||
# 2. SSSOM_CONFIG = metadata/mondo.sssom.config.yml | ||
# 3. OUTPUT_SSSOM = mapping/mondo-sources-all-lexical.sssom.tsv | ||
Input: | ||
1. MERGED_ONTOLOGY = tmp/merged.owl | ||
2. SSSOM_CONFIG = metadata/mondo.sssom.config.yml | ||
3. OUTPUT_SSSOM = mapping/mondo-sources-all-lexical.sssom.tsv | ||
|
||
# I would try some basic things first: | ||
|
||
# Use synonymiser | ||
# Use oak.mapping() pipeline | ||
I would try some basic things first: | ||
|
||
Use synonymiser | ||
Use oak.mapping() pipeline | ||
""" | ||
import logging | ||
from pathlib import Path | ||
from curies import Converter | ||
from oaklib.resource import OntologyResource | ||
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation | ||
from oaklib.utilities.lexical.lexical_indexer import ( | ||
|
@@ -25,11 +27,11 @@ | |
import yaml | ||
import pandas as pd | ||
|
||
from sssom.constants import SUBJECT_ID, OBJECT_ID, PREDICATE_MODIFIER | ||
from sssom.constants import SUBJECT_ID, OBJECT_ID | ||
from sssom.util import filter_prefixes, is_curie, is_iri | ||
from sssom.parsers import parse_sssom_table | ||
from sssom.writers import write_table | ||
from sssom.io import get_metadata_and_prefix_map, filter_file | ||
from sssom.io import filter_file | ||
from bioregistry import curie_from_iri | ||
|
||
SRC = Path(__file__).resolve().parents[1] | ||
|
@@ -49,6 +51,7 @@ | |
) | ||
|
||
|
||
# todo: duplicated code fragment w/ lexmatch-sssom-compare: solution, move to a lexmatch_utils.py and import to both | ||
@click.group() | ||
@click.option("-v", "--verbose", count=True) | ||
@click.option("-q", "--quiet") | ||
|
@@ -83,11 +86,11 @@ def main(verbose: int, quiet: bool): | |
) | ||
@output_option | ||
def run(input: str, config: str, rules: str, rejects: str, output: str): | ||
# Implemented `meta` param in `lexical_index_to_sssom` | ||
|
||
meta = get_metadata_and_prefix_map(config) | ||
"""Run the script""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
# Get metadata config | ||
with open(config, "r") as f: | ||
yml = yaml.safe_load(f) | ||
converter = Converter.from_extended_prefix_map(yml.pop('extended_prefix_map', {})) | ||
|
||
# Get mondo.sssom.tsv | ||
mapping_msdf = parse_sssom_table(SSSOM_MAP_FILE) | ||
joeflack4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
@@ -108,8 +111,6 @@ def run(input: str, config: str, rules: str, rejects: str, output: str): | |
# .reset_index(drop=True) | ||
# ) | ||
|
||
prefix_of_interest = yml["subject_prefixes"] | ||
|
||
resource = OntologyResource(slug=f"sqlite:///{Path(input).absolute()}") | ||
oi = SqlImplementation(resource=resource) | ||
ruleset = load_mapping_rules(rules) | ||
|
@@ -118,9 +119,9 @@ def run(input: str, config: str, rules: str, rejects: str, output: str): | |
save_lexical_index(lexical_index, OUT_INDEX_DB) | ||
|
||
if rules: | ||
msdf = lexical_index_to_sssom(oi, lexical_index, ruleset=ruleset, meta=meta) | ||
msdf = lexical_index_to_sssom(oi, lexical_index, ruleset=ruleset, prefix_map=converter) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bugfix:
|
||
else: | ||
msdf = lexical_index_to_sssom(oi, lexical_index, meta=meta) | ||
msdf = lexical_index_to_sssom(oi, lexical_index, prefix_map=converter) | ||
|
||
# msdf.prefix_map = sssom_yaml['curie_map'] | ||
# msdf.metadata = sssom_yaml['global_metadata'] | ||
|
@@ -131,8 +132,9 @@ def run(input: str, config: str, rules: str, rejects: str, output: str): | |
# msdf.df[OBJECT_ID] = msdf.df[OBJECT_ID].apply( | ||
# lambda x: iri_to_curie(x) if x.startswith("<http") else x | ||
# ) | ||
prefixes_of_interest = yml["subject_prefixes"] | ||
msdf.df = filter_prefixes( | ||
df=msdf.df, filter_prefixes=prefix_of_interest, features=[SUBJECT_ID, OBJECT_ID] | ||
df=msdf.df, filter_prefixes=prefixes_of_interest, features=[SUBJECT_ID, OBJECT_ID] | ||
) | ||
msdf.remove_mappings(mapping_msdf) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bugfix~:
oio
removed fromprefixes.csv
@matentzn@matentzn It looks like this has unintended consequences; may want to revert.
Details
Overview
The advantage of removing this is that lexmatch succeeds without needing the changes that Charlie and I introduced to
clean_prefix_map()
, namelyconverter.get_prefixes(include_synonyms=True)
.The negative is that now OAK is not checking for lexical matches on synonyms identified by the
oio:hasExactSynonym
field.You can see a sample of the differences in the outputs by looking at the "b4" and "after" spreadsheets in the outputs GSheet. Comparing the before/after for the first of the outputs, called "oak output prefiltered", the 'after' version has 8 less entries of a total 198 entries in the 'before' set, a reduction of 4%.
Choices
a. Leave
oio
out ofprefixes.csv
. Forfeit matching onoio:hasExactSynonym
.b. Keep
oio
inprefixes.csv
in tandem with [Charlie and Joe's bugfix](the changes that Charlie and I introduced) which fixes the perceived missingoio
prefix (which is now a prefix synonym ofoboInOwl
).c. Change OAK so that even if we remove
oio
fromprefixes.csv
, there is no lossiness (more below).I think we should choose 'b', and possibly do 'c' separately, unless matching on
oio:hasExactSynonym
is considered unimportant, in which case we can just go with 'a'.Additional discussion
I want to expand on 'c', possible OAK changes. I think we want OAK to somehow treat
oio
andoboInOwl
as the same, but this is tricky and I can only begin to think about what to be done here. I can only guess and can't speak to the full nature of why, whenoio
is left out ofprefixes.csv
, lexmatches onoio:hasExactSynonym
go away.I wonder if it might have something to do with this, at the top of OAK's
lexical_indexer.py
:If you guys think OAK or SemanticSQL needs any sort of update, maybe you should tag Chris here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes no sense. The
oio:hasExactSynonym
should not exist, if the input prefix map does not contain it.I will look into it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matentzn If I understand you correctly, that's what I'm saying. If
prefixes.csv
containsoio
, then when I run the pipeline, it does matches on just onrdfs:label
but alsooio:hasExactSynonym
, and you can see that reflected in my subsample outputs. And if I removeoio
fromprefixes.csv
I no longer see those in the outputs.So if we want to keep matching on
oio:hasExactSynonym
, it seems that we should keepoio
inprefixes.csv
, even though that causes it to be non-bijective. It's only passed to SemanticSQL anyway. Alternative is to ask OAK to renameoio
tooboInOwl
.Also, if you change your mind and want to keep
oio
inprefixes.csv
, then I think we need to merge mapping-commons/sssom-py#485 first, otherwise error will be thrown. So maybe best to merge this PR now and then if we want to putoio
back, we can do it in a subsequent PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today here's where we are:
Nico has looked into further and opened: INCATools/ontology-access-kit#698
Joe looked to see other places in which
merged.db
is used for identifying possible side effects of changingoio
tooboInOwl
inprefixes.csv
when constructingmerged.db
. These other usages were found, but do not know if unintended side effects could occur:lexmatch-sssom-compare.py
mappings.ipynb
oak-lexmatch-sssom-compare.ipynb
oak-lexmatch-sssom-compare-checkpoint.ipynb
remove-mappings-test.ipynb
synchronizeMappingsWithMondo-checkpoint.ipynb
Nico said he would put in some time to solve this, so Joe is pausing activity on this.