Skip to content

20171106 Ontology Improvement Call

marijane white edited this page Nov 6, 2017 · 2 revisions

Date: 06 Nov 2017

Attendees: Christian Hauschke, Marijane White, Anna Kazprzik, Graham Triggs, Mike Conlon, Muhammad Javed, Juliane Schneider

Agenda:

Note from Marijane: I had a hard time keeping up in spots this week, there is some stuff I wasn't able to capture.

Mike: I think there are two big questions, as there have been for some time, how we should do our work, and what we should work on. I would like to tackle the how we work first, I would like to suggest that we work in GitHub and that we commit changes to files in a way that can be tracked there. I don't know if everyone is comfortable using GitHub, we can provide some instruction if necessary, but I wanted to ask Graham if it would make sense to have a separate branch. We're considering changing the contents of the files in the filegraph, for example, trying to create an ontology file, we've noticed concerned with the files we have, they're mixed ontology and application configuration, so we'd like to clean some of that up, and to do that it seems we should be doing that in GitHub with a branch from vivo-project, but I'd like to hear Graham's thoughts on that.

Graham: Yeah, that sounds right. Normally you have branches to separate out bits of work, such as maintenance branches, the Jena 3 upgrade happened in a separate branch, and then eventually gets merged back into develop. We have some experimental changes we're looking at that should be done in a branch.

Mike: Yes, the changes we want to make need to be tested. Do you have a naming convention to suggest?

Graham:

Mike: I was thinking of a fork in the actual VIVO project, and the reason why is that we're nowhere near ready to work in the OpenRIF project, because our ontological context is scattered, so it seems like the first step is to gather this in the files that are part of the VIVO project.

Graham: Then I would suggest creating a fork, do the work in a branch, and then

Christian: Will the development take place in the same branch or would there be separate branches for each feature?

Graham: That's a fairly theoretical question to answer. I think the focus right now is reorganizing the files

Mike: and just to get very clear about this, we should be forking 1.9.3 or the develop branch or what should we be forking?

Graham: Hang on a minute, we're talking about forking the VIVO code and not OpenRIF.

Mike: Correct. We have to reorganize the files in the VIVO project.

Graham: That would be a branch from develop.

Mike: That makes sense, there's been a lot of downstream work and we want to be in line with the current release.

Graham: if we were going to do, say, a 1.9.4, it should be with the ontologies we have, not with a reorganized set of ontologies.

Mike: Yes, the ontology work should be for a future release.

Graham: Not an existing release.

Mike: Ok, so if we create a branch off develop, and we give it a name that has something to do with ontologies, so that everyone knows it's the ontology work branch, and that we're not changing code, we're changing the files we've talked about. Is everyone clear

Juliane: We are still struggling with building eagle-i, we think because of a file structure reorg?

Marijane:

Juliane:

Mike: All the changes we're talking about will happen in the VIVO project code.

Marijane: I see a long term goal of figuring out how to extract the reorganized file from the OpenRIF repository, but that is a long way off.

Mike: We're trying to straighten out the RDF information that is distributed with VIVO.

Provide files that are clearly data, clearly ontology, and clearly application configuration, naming them appropriately and putting them in appropriate places. There's a certain amount of housekeeping required there. That work also immediately allows us to start improving the ontology assertions in particular, such as the labels issues that Javed has been working on. Javed, can you tell us a little bit about what you did?

Javed: I looked into the files in those 2-3 folders, and I found two things, one was InitialTboxAnnotations.n3, that file has rdfs:labels for a number of entities that do not exist in the main ontology files, which I had to extract from there to put them in the main ontology file. Second is individual instances such as VivoDocumentStatus instances, those exist in an a-box file as well as in our merged filegraph.owl file. Also the DateTimePrecision instances. I believe these instances should not exist in filegraph.owl. So I wrote some code to extract all the labels from the annotations files. Then I discovered some of the entities do not exist. So I believe the entities were removed but the annotations were not. Last week I sent an email of the file, it's in RDF, and I believe if we want it in OWL I may need to convert my Jena RDF model to OWL.

Mike: Yeah, I think that might be preferred. I think as we move to GitHub, and do this work in GitHub, we're going to have an issue for every commit. So, if we have a set of files and we've checked them out and made changes to be committed back in, there will have to be an issue describing the commit, so the commit can refer back to it, and we know what work was being done in that commit.

So, working in GitHub with commits and issues, generates sort of an obvious next question, which is how do we test our changes, how do we know that we did not affect the running app, or did we affect it in a positive way that we intended. An example, we have an open issue where several entities in VIVO have parents that is skos:Concept, such as Award and AwardReciept. These are information entities of some kind, they are not skos:Concepts, but because the ontology says they are skos:Concepts, you get a list of awards when you search for a list of concepts, which makes a mess of the search results. And the happy thought is that when we fix this and fire up VIVO, you'll no longer find these things in the list of concepts. I'm pretty sure this will work out well, but I'm not 100% it will, so it needs to be tested.

Anna: On the other hand, if people have been complaining about this for years, and the change breaks something, you'll get immediate feedback.

Mike:

Anna: Would there be any systematic way to find out?

Mike: Well, we can test it, we can load the files in a running vivo and see what happens.

Anna: without side-effects.

Mike: That's my problem, I am not sure what thorough testing would look like.

Graham: We could run the acceptance tests, which may be written to test the current behavior rather than the desired behavior, but they would generate errors that indicate either things that need to be fixed in the test or in the code.

Mike: Yes, you would have to examine the test to make sure it's still valid.

Graham:

Mike: So you have to be careful to execute your smokescreen testing properly. If you rebuild VIVO you're going to get a new index, right?

Graham: No, it comes down to whether you clear out the data or not. We clear out the triplestore pretty often, without touching the Solr index. So things show up in the UI until Solr reindexes.

Mike: This is all an argument for better sample data. I have some, with an award and concepts in it, you can see the probelm in the sample data. The potential for unintended consequences are there. It is a complex app, and the code makes assumptions about the state of the ontology.

Graham: The kind of changes we are talking about, the kind of issues we're talking about, would be if the code assumes a certain structure and the hierarchy has changed.

Mike: And this particular change is not very complex.

Graham: Note for these kind of changes, we should be documenting them early, and to whatever extent is possible, make it clear that you should not be relying on the incorrect superclass.

Mike: Yes, it's hard to explain that in general, especially to the community, but in this particular case it's not expected that Awards and Concepts are supposed to be together. So we can suggest this change, I think that will be roundly applauded, but I don't think we're going to know that it's safe to do.

Graham: You can't be sure that people aren't going to make adjustments they need, but I don't think it's problematic to tell them ahead of time.

Mike: I think our process says we will do that.

Graham:

Javed: but for the first phase, we're not making any ontology code changes.

Mike: Yes, we're just reorganizing the files. And this could have unintended consequences, but I'm pretty sure it will be ok. This gets at whether the code expects triples to be in specific graphs, and I don't think it does that.

Javed: We should look at which entities and labels are accessed from which folder of the RDF.

Graham: I think the biggest distinctions are, is something in the content, is it in the abox or the tbox. There does appear to be some things that are read in from specific files and placed into graphs unrelated to those file names, so we might have to be careful about how these things are being loaded in during startup, otherwise I think the file and graph names are fairly arbitrary.

Mike: and not actually used by the application, right?

Graham: only to compare the graph on disk to the graph in the triplestore. It might need to reinference.

Mike: but it doesn't affect the function of the application.

Graham: Yes.

Mike: But again, some of this will have to be tested.

This is sort of off-topic, but I'm going to ask anyway. There are a lot of configuration files in a lot of directories, and it's not really related to ontology, but is there any rationale or purpose for that?

Graham: Well, I can't speak historically, but my understanding is that the config directory is separate from the RDF directory that has some specific information like where to find the triplestores. The RDF directory is organized a certain way, and there is a split between things that are only ever loaded once, the first time, and things that are loaded every time it starts. So the actual files, to the extent it should be that some of those files exist to separate based on ontology, you'd want to keep FOAF info separate from BIBO, and so on.

Mike: Eventually, but I'm asking specifically about the config information, which is spread across a dozen directories and files within them, I'm not sure why that is?

Graham: Can you reference a specific example?

Mike: I have notes somewhere but I can't find them at the moment. I'll recreate the file. The question was generated because we were trying to find ontology assertions, and the issue we found, and that led to the question about why the config files are so distributed. But the goal here is to reorganize the ontology files.

Graham: I think here it makes sense to have some organization so you can find the assertions you're looking for.

Mike: yes, they should be functionally grouped, but it seems to be more than that.

OK, so, we're trying to create an ontology file and get those labels in there. The notes in GitHub from a couple meetings back have a good description of what we think will have to be done.

So, my understanding is that Javed is a committer here, is that right?

Javed: Yes.

Mike: so maybe you could follow up with Graham about what this branch will be called, and let us know about it so we can do some work in it.

Javed: Yes, sure.

Mike: I think this is what we need to make changes, work certain issues, and eventually commit them.

Javed: We'll start with the tbox annotation file and move from there.

Mike: and as the work progresses we can discuss whether we even need certain files. RIght now we are heading for a single TBox file, which might be called vivo.owl to start with, and that would mean we don't need these other files, except when they have configuration assertion info, in which case they should come out of the tbox and go somewhere else. As we go through this we are going to find some of the contents will need new homes.

Javed: So you're saying in the tbox filegraph folder, we need to examine the files one by one to see what's in it?

Mike: Well, when we merged them with ROBOT, we found Vitro assertions that have nothing to do with the ontology, they indicate things about how information is displayed in the UI.

Javed: Did you remove those when you made the filegraph?

Mike: I believe so, but they are still in the files in the tbox.

Javed: I'm talking about, you built the filegraph.owl file, and then we realized it has these annotations in it. So either they exist or they don't. If they still exist, we need to take them out.

Mike: That is exactly what I'm saying.

Javed: Similarly, I have extracted the rdfs:labels, which is fine, and they already exist elsewhere, so we don't have to find them a home. So next I will look into filegraph.owl, make sure there are no more Vitro assertions, and if they do exist, we need to figure out where they go.

Mike: I think when we're done, there should be only three owl files. Vitro, Vitro-Public, and the VIVO owl file.

Javed:

Mike: These are other ontologies that the app uses.

Graham: I think there should be more files, like the geopolitical ontology should be in its own file.

Mike: Yes and I believe we discussed that a couple meetings ago.

Javed: So if I look in the filegraph folder, you built a merge of these files, so where did the Vitro and Vitro-public come from?

Mike: I did not include them, they are outside the domain.

Javed: They are defined under the Vitro namespace. Yeah, you are right, they will be defined under something Vitro, but if there are vitro-specific annotations, they shoudl still exist in the tbox annotations folder.

OK, I will look into this. I think the main question for me is that does the filegraph.owl have vitro annotations. If it doesn't, where did they go? If it does, we need to pull them out.

Mike: And I can't use it in a running app yet because these required configuration application assertions don't have a proper home yet.

Marijane: So basically, the work done so far focused on the domain ontology, and now we need to go back and clean up the application ontology.

Mike: That's basically it in a nutshell.

Graham pointed out that there are situations, we talked about in the Oct 9 meeting, where something we would like to do requires both data and ontology assertions. One question is, how much is an ontology question, how much do you want to have come into the VIVO ontology, and then another question about how you want the files organized. WE know we have FOAF assertions in the ontology, there is a separate question about whether we want them in a foaf.owl file, and whether we want them organized by namespace in the files. There is a sort of different conversation whether the geopolitical assertions are part of the ontology or part of the dataset. When we last discussed this we decided we wanted a really clean separation between ontology and data. It's not an obvious thing, it could go either way, some things like DateTimePrecision don't feel like data, they feel like ontology, but they're defined as individuals.

Graham: I think this an issue of risk assessment. So if we are for example bringing in the FOAF ontology,

so the idea that these things are separate and don't change, and then there are other things we do change, so we don't risk making changes we don't want.

Javed: So there are two things here, how we organize the files in the software, and the second thing is what we call the file, when people want the VIVO ontology, we want to give people one single file. But we have the option to have single files in the source code.

Graham:

Marijane:

Mike: I think there are two reasons, one is cultural reasons, when people ask for the ontology file, we give them one file, and the other is the URI where these things resolve.

Javed: and one more reason is that the ontology is our model of the world, built from different ontologies. If they are provided separately, it's not a model, it's a bunch of entities living in different files.

Anna: also, then we can check the consistency. I was just looking at this, do you know the Ontology Pitfalls scannder? I attended ISWC in Vienna, and I caught up to this Ontology Pitfalls scanner and loaded the Java file into it. Let me give you the link. I don't know if this is any good, but you can check yourself. http://oops.linkeddata.es/ I just pasted the file in there, and one should say some of the pitfalls are not justified, but it could be useful to go through the things it thinks its found.

Graham: I just wanted to ask that because there's no application benefit to how we structure it.

Mike: I consider that we're free to go, and we should do it the best way for ontologists.

Graham: Something we don't do at the moment, we could be more intelligent about how we reload ontology files and re-inference based on changes. For example, there is a typo in the current development branch. If we change that and reload it, it's going to reinference everything. But a typo fix isn't going to have any inferencing impact, but the reload forces a reinference of every individual in the system, so that is something we might think about loading more intelligent.

Mike: I think that is a fine thing you should be thinking about.

Graham: So we could optimize for changing only the affected individuals, but that could be more complicated if we only have one file.

Mike: Well, I think we're only going to have one file for now, and we know that changes may trigger reinferencing, and I know that can be painful, but this isn't going to be a rapid-fire activity. The development should be done against test data, and not like 50 million triples. We should work in test harnesses where the changes can be validated. But if you come up with suggestions that would make things easier, that would be fine. I know when we do a release, if we change the ontology, a certain amount of reinferencing is unavoidable, and it would be good to know what changes we are talking about have impact because it required reinferencing, but I was under the impression that everything required it.

Graham: Yes, at the moment, if anything changes at all, everything has to be reinferenced.

Mike: so until there is an alternative we have to go in with our eyes open. It's an interesting point, I do want to update our description of impact, because even our lowest impact changes are going to require reinferencing.

Marijane: Does this seem like a good place to stop this week?

Mike: Yes, we have new tools, and a new process to try out.

The VIVO-ISF ontology is an information standard for representing scholarly work.

Additional Resources

Clone this wiki locally