-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ontology support #1
Comments
FYI the link above to the bitbucket issue is offline |
I would like to reopen this discussion, because I think supporting ontologies is going to be critical for the scalability of NWB. I understand that raw strings were used as placeholders, but if we don't change that before the 2.0 release we could end up with a big mess. First things first: Should we enforce specific ontologies? I think we should, but I can also see the trade-offs. Maybe someone comes along and says that e.g. the Allan Atlas is insufficient to describe their anatomical labeling, and they require some other ontology. Should we give people the ability to use their own ontologies, like the current framework for extensions? On the other hand, then we open the floodgates for people to use any old garbage. I'm on the side of enforcing ontologies and providing text fields in case our choice is insufficient. I think that's an important design decision though that ought to be discussed and I'd be interested in your thoughts. I would like to establish an ontology for:
my criteria for ontologies would be:
For mouse brain region I think the Allen Mouse Atlas makes the most sense. Are there any other candidates? Human brain labeling is an enormous can of worms. Allen's human atlas seems good to me but there are other contenders-- lets worry about that later. For species ontology I think the scientific community have pretty much reached consensus but maybe there are ontological debates I don't know about. For cell type I don't really know about the available ontologies. I hear Allen is working on this too? |
@neuromusic my web scraping skills aren't what they used to be. Would it be easy to get us the atlas label tree from the mouse brain atlas link above as a json? |
best approach here IMO would be to defer to the experts at NIF like @tgbugs https://bioportal.bioontology.org/ontologies/NIFSTD |
Short version.Let me know how I can help. Ontology integration is not entirely straight forward, so more than happy to help get the relevant parties pointed in the right direction. You've caught me right as I'm starting to write up what we've done in a higher level way, so all I have at the moment is the flood of information that follows. The import closure of NIFSTD is quite large, so we tend to provide the whole ontology via SciGraph webservices. I have a python client for accessing them. If it would be helpful I can create a mini file that can be used to import the subset of the ontology that is relevant for NWB. Our main repo is SciCrunch/NIF-Ontology, with supporting utilities at tgbugs/pyontutils. Thoughts on enforcing ontologiesI suggest that a reasonable approach would be to include a set of good defaults and also allow organizations or individuals to provide and/or enforce if it fits their use case. If you provide defaults very few folks are likely to go out of their way to find a bad ontology. The format probably should not enforce (in the strict sense) since the number of use cases is quite large and terminology changes over time in ways that the format should not be responsible for maintaining. That said, someone validating NWB files should be able to specify which set of ontology identifiers they want to be used for tagging certain things so that checking can be automated for the user so they don't have to wait to submit a file for feedback but can get it immediately via whatever interface they are interacting with when they create the file. As an additional note the number of terms that could potentially need to be used in an NWB file is quite a bit larger than is practical to embed in an application. The most frequently used could be, but the tail is quite long, to the point where I think it is not unreasonable to imagine the format hitting web services for terms or having a process by which the terms can be imported and updated along side an installation. Specific domains.
All five of these areas are currently under active development in collaboration with groups at HBP, BBP, and now with Allen as well, so happy to help in any way. Footnotes
|
@tgbugs wow, this is some great info, thanks! Regarding your thoughts on enforcing ontologies, I think your suggestion to offer good defaults and allow extensions for other ontologies makes a lot of sense. What do you imagine as the workflow for someone writing an NWB file with ontology integration? Referencing an ontology would presumably require users to enter which ontology they are referencing as well as the exact name or id for the category. Maybe something like:
We could create a validation step where the ontology and specifier are checked first against a small local database and then if it's not there, against a remote database via an API. A small set of properties can then be written to the file. Something like What would be the workflow for adding a custom ontology? Maybe: class NewCellOntology(CellOntology):
def lookup(name=None, id=None, abbr=None):
...
return CellOntologicalReference(name=name, id=id, abbr=abbr)
def lookup_remote(name=None, id=None, abbr=None):
...
return CellOntologicalReference(name=name, id=id, abbr=abbr) I'm hoping we can build this in such a way that users are not bogged down in ontological details, so I'm glad to see there is already an issue looking at incorporating colloquial names. |
This ended up being quite a bit more elaborate than I expected. PS I have not run any of the code so there are loads of bugs. Questions
ContextI imagine that there are three ways that ontology identifiers could be stored in an NWB file.
APIWith that context and the caveat that I am not familiar with the API conventions for pynwb here are my thoughts. There are four major parts that need to be provided by the API. I will explicate them in order, following a brief justification for the layout. I will end with some concerns about this approach.
Separation of concernsI think that terms and query need to be distinct because ontologies can change and it is important to have the exact identifier that the investigator used recorded in the python source and reproducible regardless of any changes to what a query would return. Services APIAfter having gone through this a few times the right approach seems to be to create ontology services that wrap different backends and provide a normalized query interface. Note that the implementation examples use the local and remote backends that I am most familiar with, but it is possible to use other backends. class OntService:
""" Base class for ontology wrappers that define setup, dispatch, query,
add ontology, and list ontologies methods for a given type of endpoint. """
def __init__(self):
self._onts = []
self.setup()
def add(self, iri): # TODO implement with setter/appender?
self._onts.append(iri)
raise NotImplementedError()
@property
def onts(self):
yield from self._onts
def setup(self):
raise NotImplementedError()
def dispatch(self, prefix=None, category=None): # return True if the filters pass
raise NotImplementedError()
def query(self, *args, **kwargs): # needs to conform to the OntQuery __call__ signature
raise NotImplementedError()
pass
class SciGraphRemote(OntService): # incomplete and not configureable yet
def add(self, iri): # TODO implement with setter/appender?
raise TypeError('Cannot add ontology to Remote')
def setup(self):
self.sgv = scigraph_client.Vocabulary()
self.sgg = scigraph_client.Graph()
self.sgc = scigraph_client.Cyper()
self.curies = sgc.getCuries() # TODO can be used to provide curies...
self.categories = sgv.getCategories()
self._onts = self.sgg.getEdges(relationType='owl:Ontology') # TODO incomplete and not sure if this works...
def dispatch(self, prefix=None, category=None): # return True if the filters pass
# FIXME? alternately all must be true instead of any being true?
if prefix is not None and prefix in self.curies:
return True
if categories is not None and prefix in self.categories:
return True
return False
def query(self, *args, **kwargs): # needs to conform to the OntQuery __call__ signature
# TODO
pass
class InterLexRemote(OntService): # note to self
pass
class rdflibLocal(OntService): # reccomended for local default implementation
graph = rdflib.Graph()
# if loading if the default set of ontologies is too slow, it is possible to
# dump loaded graphs to a pickle gzip and distribute that with a release...
def add(self, iri, format):
def setup(self):
pass # graph added at class level
def dispatch(self, prefix=None, category=None): # return True if the filters pass
# TODO
raise NotImplementedError()
def query(self, *args, **kwargs): # needs to conform to the OntQuery __call__ signature
# TODO
pass Users using the defaults would never have to deal with this. Query API
I think the keywords can be used to enable a wide range of query options (for simplicity sake I'm basically matching the functionality that SciGraph already provides). class OntQuery:
def __init__(self, *services, prefix=None, category=None): # services from OntServices
# check to make sure that prefix valid for ontologies
# more config
self.services = services
def __iter__(self): # make it easier to init filtered queries
yield from self.services
def __call__(term=None, # put this first so that the happy path query('brain') can be used, matches synonyms
prefix=None, # limit search within this prefix
category=None, # like prefix but works on predefined categories of things like 'anatomical entity' or 'species'
label=None, # exact matches only
abbrev=None, # alternately `abbr` as you have
search=None, # hits a lucene index, not very high quality
id=None, # alternatly `local_id` to clarify that
curie=None, # if you are querying you can probably just use OntTerm directly and it will error when it tries to look up
limit=10)
kwargs = dict(term=term,
prefix=prefix,
category=category,
label=label,
abbrev=abbrev,
search=search,
id=id,
curie=curie)
# TODO? this is one place we could normalize queries as well instead of having
# to do it for every single OntService
out = []
for service in self.onts:
if service.dispatch(prefix=prefix, category=category):
# TODO query keyword precedence if there is more than one
for result in service.query(**kwawrgs):
out.append(OntTerm(query=service.query, **result))
if len(out) > 1:
for term in out:
print(term)
raise ValueError('More than one result')
else return out[0] Examples query = OntQuery(localonts, remoteonts1, remoteonts2) # provide by default maybe as ontquery?
query('brain')
query(prefix='UBERON', id='0000955') # it is easy to build an uberon(id='0000955') query class out of this
query(search='thalamus') # will probably fail with many results to choose from
query(prefix='MBA', abbr='TH')
uberon = OntQuery(*query, prefix='UBERON')
uberon('brain') # -> OntTerm('UBERON:0000955', label='brain')
species = OntQuery(*query, category='species')
species('mouse') # -> OntTerm('NCBITaxon:10090', label='mouse') If OntQuery is implemented in this way one thing that it must do is fail loudly when it gets more than one result so that the user can select which term they want. That failure will need to provide them with the options to choose from, probably formatted as It might be possible to use a ranking of preferred prefixes based on some additional criteria for users that didn't want to specify Notes on SciGraph
Term APIThis is basically just a lightweight wrapper around an
class OntTerm(OntID):
# TODO need a nice way to pass in the ontology query interface to the class at run time to enable dynamic repr if all information did not come back at the same time
def __init__(self, query=None, **kwargs): # curie=None, prefix=None, id=None
self.kwargs = kwargs
if query is not None:
self.query = query
super().__init__(**kwargs)
# use properties to query for various things to repr
@property
def subClassOf(self):
return self.query(self.curie, 'subClassOf') # TODO
def __repr__(self): # TODO fun times here
pass Example brain = OntTerm(curie='UBERON:0000955')
brain.subClassOf # -> OntTerm('UBERON:0000062', label='organ') Ontology ID API
class OntID(rdflib.URIRef): # superclass is a suggestion
def __init__(self, curie_or_iri=None, prefix=None, id=None, curie=None, iri=None, **kwargs)
# logic to construct iri or expand from curie to iri or just be an iri
super().__init__(uri) This allows construction via curie or iri without the user having the fight the API if they need to interact directly. Alternatives. Using an rdflib namespace it is possible to enter an uberon identifier as follows ConcernsMy primary concern with this approach is how to communicate to the user what ontologies are available by default. Discoverability is not easy in this context. rdflib can be quite slow to load large graphs on a stock cpython interpreter, but if you can use pypy3 it is quite a bit faster. Not sure if this is relevant but thought I would mention it. Footnotes
Curies can also be created against a full url, so for example a user that is tired of typing electrode_location = ['brain:'] A more compact way to define user specific curies would be defLocalName('brain', 'UBERON:0000955') or even defLocalName('brain', OntQuery(prefix='UBERON', label='brain')) To take this approach to API building to, shall we say, non-pythonic, ends the full extension of this is to take full url curies and turn them into python identifiers like I do in phenotype namespaces for neuron lang (implemented here). This would let a user write electrode_location = [ontNames.brain] or with myOntologyNames:
electrode_location = [brain] or setLocalNames(myOntologyNames)
electrode_location = [brain] This which introduces more complexity than is worth it becuase it is basically just a check on whether the identifier has been defined and on average only saves 3 chars ( The cognitive overhead of working in this way may not be worth it though since the implementation to pull that off does nasty things like inspecting the stack to set and unset/restore globals which can cause confusing bugs. |
First, to answer your questions:
It seems to me we can break this conversation into two interrelated but separable questions:
Your system for 1 seems really well thought out. I like the idea of having a service framework capable of searching across ontology databases with a single query, and the keyword argument parameters would allow a user to easily hone in on the relevant terms. It seems like this API system would be best as a stand-alone project. It seems like it would require a decent amount of maintenance, with API calls to multiple databases. Other projects would clearly benefit from an API like this, and it doesn't make sense to make them import So for 2, if I understand, we need to store the following attributes for OntTerms:
Is that it? I think the ideal implementation for NWB would be to be interoperable with the querying package but not strictly depend on it, since we want to keep strict dependencies pretty light. To accomplish that, we could make a this_term = pynwb.OntTerm(NeuroOnt.OntTerm('UBERON:0000955', label='brain')) This would require Another option for the user would be to input the info manually: this_term = pynwb.OntTerm(database='Uberon', id=0000955, label='brain', curie_func=curie_func) which they could technically do without Maybe it would make sense to store a few of the most common terms like "mouse" and "CA1" this_term = pynwb.OntTerm('mouse') |
@tgbugs @bendichter thanks for the interesting pointers and discussion. Let me add @lydiang to this thread. We have been discussing the topic for supporting ontologies on several occasions and it is certainly part of our future road map for NWB. I agree that this is an important topic, but due to the complexity and effort required, I believe that we will not be able to resolve this issue in time for NWB 2.0 but that this will need to be a new feature for 2.x. I would suggest that we discuss this topic with the TAB and other at the upcoming Allen Hackathon. |
Please let me know if my thinking and explanations here make sense. Apologies if you already know about some of the things I discuss below. Summary.
@oruebel This definitely introduces complexity, but maybe not as much as it seems (having been thoroughly nerd sniped by @neuromusic (notch one up) I may just implement this this weekend). However, I suspect in light of the issues in my summary it might force the issue of whether/how the pynwb core will depend on non conda core libraries. Not knowing your internal processes or timelines, I imagine that could delay things quite a bit even if all the code were already written. Questions
ThoughtsI think that there is a third issue of how to validate ontology identifiers entered into an NWB file ValidationValidation could be part of querying but if querying remotes is completely removed from the NWB core Basically if you want validation for your ontology terms pynwb.OntTerm has to have a way to Unfortunately this means that a call like this_term = pynwb.OntTerm(NeuroOnt.OntTerm('UBERON:0000955', label='brain')) is not a viable solution. If pynwb does not incorporate the machinery to validate Validation for ontology terms usually amounts to picking a set of ontologies that an iri must exist in I propose a solution below in implementation details which is to flag terms that have not been validated so QueryingQuerying is not as complex as I made it out to be. Determining which ontologies will be supported and providing infrastructure for hosting the remote is a separate concern. Choosing the set of sane defaults can then be dealt with as two questions.
This does not mean that some facet of NWB would have to actually run one of those remotes (unless it wanted to). To this point, in my I run SciGraph as a 'remote' service for the NIF ontology on all Decoupling?I agree that a standalone query plugin would be useful to many projects. However, despite my initially though that complete decoupling was possible, having thought more about it, If no local functionality is desire I think the solution is equivalent to what is needed to As mentioned above, external queries cannot return values into pynwb.OntTerm, pynwb.OntTerm must The coordination issue here if pynwb.OntTerm is maintained without depending on an external package If ontology functionality is desired in the decoupled state, then pynwb will need to replicate I see two options in light of this.
Other solutions either require the user to rewrite their code when they want additional functionality, Therefore I think a standalone repo that has no dependencies might make the most sense so that class Graph():
""" I can be pickled! And I can be loaded from a pickle dumped from a graph loaded via rdflib. """
def __init__(self, triples=tuple()):
self.store = triples
def add(triple):
self.store += triple
def subjects(self, predicate, object): # this method by iteself is sufficient to build a keyword based query interface via query(predicate='object')
for s, p, o in self.store:
if (predicate is None or predicate == p) and (object == None or object == o):
yield s
def predicate_objects(subject): # this is sufficient to let OntTerm work as desired
for s, p, o in self.store:
if subject == None or subject == s:
yeild p, o
class BasicService(OntService):
""" A very simple services for local use only """
graph = Graph()
predicate_mapping = {'label':'http://www.w3.org/2000/01/rdf-schema#label'} # more... from OntQuery.__call__ and can have more than one...
def add(self, triples):
for triple in triples:
self.graph.add(triple)
def setup(self): # inherit this as `class BasicLocalOntService(ontquery.BasicOntService): pass` and load the default graph during setup
pass
def query(self, iri=None, label=None): # right now we only support exact matches to labels
if iri is not None:
yield from self.graph.predicate_objects(iri)
else:
for keyword, object in kwargs.items():
predicate = self.predicate_mapping(keyword)
yield from self.graph.subjects(predicate, object)
# Dispatching as describe previously is dispatch on type where the type is the set of query
# features supported by a given OntService. The dispatch method can be dropped from OntQuery
# and managed with python TypeErrors on kwarg mismatches to the service `query` method
# like the one implemented here. Stored representationMy summary is to store
Implementation details.First a clarification about how I imagine If the iri cannot be found via available services then the values will be preserved In general the interface to Given the issues discussed above with validation, this_term = pynwb.OntTerm(NeuroOnt.OntTerm('UBERON:0000955', label='brain')) can be reduced to pynwb.OntCurie('UBERON', 'http://purl.obolibrary.org/obo/UBERON_') # could be loaded by default
# ...
this_term = pynwb.OntTerm('UBERON:0000955') or just this_term = pynwb.OntTerm('http://purl.obolibrary.org/obo/UBERON_0000955')
# or in the minimalist case
this_term = pynwb.OntID('http://purl.obolibrary.org/obo/UBERON_0000955') if the query plugin is loaded, or if uberon's brain is included in the minimal local set then the following would be possible
if a query plugin is NOT loaded and there is no local store
is what would happen. This would also happen if there were a small local store and the user sets OntTerm acceptance level to 'open' (open world) Footnotes
|
@ tgbugs I have not head a chance yet to carefully read all of your last comment but let me quickly answer your questions.
Currently I think the answer is yes, although not necessarily using h5py. E.g., some folks are using HDFView to explore the low-level structure of the format in HDF5. As folks get more familiar with the format and are using it for analysis (rather than trying to develop converters) I think this will change, but my guess is that for debugging purposes folks will sometimes use low-level tools (e.g., h5py).
Yes. The core concepts of NWB like the specification language are programming-language agnostic and HDF5 is available for most popular programming languages as well (C,C++, Fortran, Matlab, R, Python, ...). In terms of APIs, e.g., there is also the MatNWB Matlab API that is in active development.
There are a couple of ways you could do this: a) as plain string, b) as a YAML or other document string, c) as a series of attributes/datasets that would describe the id, type etc. Generally, options b or c are best because they are more self-describing. The advantage of c is that you can describe it easily in NWB-schema without introducing a custom "sub-format" but depending on use I think b) can be a good option too. I'll have a closer look at this thread later this week. @KrisBouchard @ajtritt @lydiang could you please also have a look at this issue. |
I have a basic implementation up and working at https://github.com/tgbugs/ontquery though at the moment to get any real functionality it still depends on https://github.com/tgbugs/pyontutils. I have been dogfooding different ways of using it, some of which can be seen here https://github.com/tgbugs/pyontutils/blob/master/pyontutils/methods.py#L61. The actual workflow of looking up terms using OntTerm and replacing the code with a version that has an identifier can't be seen in the file as it is now because the errors would prevent it from running. The way I incorporate the functionality into pyontutils can be seen here https://github.com/tgbugs/pyontutils/blob/master/pyontutils/core.py#L744-L750. |
@tgbugs This is great, Tom! I'll test this out and let's work together to make this integrate smoothly with NWB in the future. I see you have used def __setitem__(self, key, value):
raise ValueError('Cannot set results of a query.') to enforce the validation against a database. That could work, but we have to be extra careful that the package works as intended, because we've taken away some of the ability of a user to go in and fix it. I suppose a truly stubborn user might get around this by cloning Let's move discussion over to the ontquery issues page. |
Here is the json file describing the brain structure graph used for the mouse Allan atlas: http://api.brain-map.org/api/v2/structure_graph_download/1.json There is more information on the Allan brain atlas ontologies here: http://help.brain-map.org/display/api/Atlas+Drawings+and+Ontologies#AtlasDrawingsandOntologies-StructuresAndOntologies |
I pull the Allen into the NIF Ontology here. Working on a new version of the ingest that automatically does them all in one shot. |
@bendichter - i'm sure we can use anything that has been developed. no reason to reinvent wheels ;) i do think there needs to be an interface that bridges between the experiment and the nwb files, since in many cases the metadata will be consistent across nwb files from a given experiment. so any effort should focus on file specific metadata vs across files metadata. |
The NWB schema and HDMF have added support for ontologies (here called "external resources" and HERD - HDMF External Resources Data structure). There are lingering questions about whether the HERD data is embedded within the NWB file or outside the NWB file before upload to DANDI. Otherwise, HERD is ready for use. See hdmf.readthedocs.io/en/stable/tutorials/plot_external_resources.html . We are working on a tutorial with pynwb in NeurodataWithoutBorders/pynwb#1781. Closing this. Feel free to open a new issue to discuss features or issues with HERD. |
Originally reported by: Andrew Tritt (Bitbucket: ajtritt, GitHub: ajtritt)
The text was updated successfully, but these errors were encountered: