Skip to content

20180123 Ontology Change Improvement Call

marijane white edited this page Feb 6, 2018 · 1 revision

Date: 2018.01.23

Attendees: Mike Conlon, Andrew Woods, Marijane White, Javed Muhammed

Agenda: (From Mike's email to openrif-dev) We have some catching up to do, and perhaps Javed can fill us in on his efforts with respect to creating a single file for us to work from. I can share updates on VIVO development.

I’d like each of us to think of the simplest change we might make to the ontology — you will recall that a goal for this group is to demonstrate an ontology change process — that means taking changes all the way to the delivered code. I would like us to consider creating a pull request that can become part of the next release. We might add annotations, or fix a misspelling, or other simple improvement. It is important that we are able to create changes that make their way into the delivered release. Please think of what change you can propose we make.

Minutes: Mike started off by recapping the last year for Andrew, because this is his first Ontology Improvement call.

Mike: This group was created almost a year ago to create a process by which the ontology can be modified. It hasn't been modified since 2013 because of a catastrophic event where the whole data model changed and broke every query, members quit, it was really brutal. Some members never upgraded, or waited multiple years and it was very painful. UFL estimated the cost of their upgrade was $200,000. Lots of custom queries and tools, it was a lot of work.

Andrew: [missed his question]

Mike: What' we've been studying here is which changes are low impact, moderate impact, and the dramatic difficult changes we made in 2013. And if you're going to make changes, what kind of lead time, what kind of support, what kind of heads up you need before that. There is also an underlying debate/conversation about why these changes should be made. There are changes that are simple additions, adding a new kind of book for example, which has zero impact on the community. A change like adding assertions for licensing would again be a simple "pure add" that would have little to no impact on the community. Then there are changes that are obscure, such as a change in some ontological model that not many people are using. I can't think of an example at the moment, but there are backwaters in the ontology, we have a lot of ways of saying different things, and if we cleaned those up it might affect 2 or 3 sites, versus changing the Authorship model that would affect every site deeply. So the impact on the community depends on who is using the assertions. Then there are changes that we think have little impact, but we're not sure. In terms of usage we do have actual data from Dave Eichmann pulled from every site, we have spreadsheets tallying them.

Marijane: Dave's data also tells us about local extensions people have made.

Mike: It's great that people can add things so easily, but that makes the data incomparable.

Marijane: Some of the local extensions could also be candidates to add to the main ontology.

Mike: Yeah. A lot of what people have added are local identifiers, so they can tie their VIVO to their other local systems. There are things in the ontology that ontologists would look at and consider inelegant or incorrect. The people who are not ontologists don't care, they just want something that works, but the ontologists are embarrassed to show this work to other ontologists.

Andrew: I don't want to waste Marijane and Javed's time recapping things they are familiar with.

Mike: I just want to plant a seed here, we're trying to figure out how ontological changes might make their way into the software.

Andrew: so maybe agenda item 3 is how can we make low-impact changes.

Mike: yes, and I asked in my email for people to nominate such a change, something very low impact like adding a label.

Javed: Our final goal, agenda item #3, is that we have a mechanism to apply ontology changes. We started this work, but when we started planning for this demonstration, we examined what we have in the ontology, in the software, is not the same as the ontology available on the web. And when we looked for the version on the web, the core version, if you look in the software there are around 40 files

Mike: (laughing) for no good reason, apparently.

Javed: so we were looking for how to create one single file that can both live on the web and be used in the application. When we looked into that, we realized there are something defined, called annotations, which are defined all over, some in the T-box, some in other files, and we decided we needed to bring these Vitro annotations somewhere, so we did that, and then we decided we should have the schema separate from the instance-level T-box data, because some A-box data was in the T-box. And then the labels, they were not in the ontology files, they were in a different annotation file. So I extracted those labels out of the annotation files and moved them into the ontology files. We have this branch, vivo-ontology-lab.

Mike: It's a GitHub organization, and I'm not sure why we need it.

Javed: Hang on, let me find the email from Graham about it.

Mike: This is like a separate project.

Marijane: If you make a change in one of these forks, you could make a pull request to the main VIVO project.

Mike: but if development is moving, and it's moving from the main project, you don't have all the most recent changes.

Javed: well, my changes are in this new org/project. If you go to the link I shared in the chat, I can tell you what I did:

  • Removed instance data from filegraph file.
  • Application config entities are defined in application-config.owl
  • I also cleaned up filegraph.owl a bit, OBO stuff.
  • InitialTboxAnnotations.n3 file had all the labels, so I wrote a program to get them out of this file and into filegraph.owl. So now every resource is in filegraph.owl, and the annotations are in the Vitro file.

I believe this file is ready to test.

Mike: So everything is there, just in a different place.

Javed: Yes, I brought together all the annotations into one file, where they used to be in multiple files. I believe the changes I made should work.

Mike: So we have to build this out of the vivo-ontology-lab. Download and build and run. I have one immediate question, which is I believe that you're gonna end up with the assertions in different named graphs in VIVO because the filenames changed, so the graph names will be different in VIVO.

Javed: What I remember from our previous discussion is that each file had its own graph, and now we will have one graph.

Mike: Right, we have one single graph, and that is much better. When you open the Manage Models UI you will see many fewer graphs.

Javed: When you're curating the data it doesn't matter what graph things are in.

Mike: because the application uses the union graph in its SPARQL queries.

Javed: [missed what he said]

Mike: You can put in as an action item that I will test this. But there's another question here, which is, let's say it does boot up, we're not sure that everything works, just that it booted up. I would do my normal smokescreen tests, put in sample data, click on profiles, etc, but we don't know what isn't working because we don't have a test suite. Which is not enough for our community, they are going to do more than what I would do in my smokescreen testing.

Andrew: Maybe there are some other institutions that can do some testing? We can throw the net out wider?

Javed: I think it would be a good idea to make a list of what we're testing. Going to the organization/department pages, do they load, can we curate data and enter things properly.

Mike: The issue of course is that there are 411 classes.

Javed: I'm not worried about the classes and properties, I am more worried about the Vitro annotations, whether the software has hardcoded links to the files.

Mike: I'm worried about graph names, I'm worried about lots of things. We have impressions but we don't know. Our impression is that this should work. So we'll give it a try. And then I will take -- I don't know much about how GitHub does cross-organization merges and such.

Marijane: I think it's pretty straightforward, the names of things being merged are just longer than they would be if they were in the main repository.

Andrew: this seems like this is just a fork of an existing repository, it might be a good idea to merge it back in and make it a branch off the main project.

Marijane: Maybe he intended it as a sandbox, maybe he was scared of accidental changes in the main project, I'm just speculating.

Mike: he was definitely afraid of us!

Javed: My changes are only in the Tbox folders.

Andrew: So these are only structural changes?

Javed: Yes.

Mike: I had merged some things together and made some errors. Did you discover any errors in the orginial distribution?

Javed: you mean in the separate files? No.

Mike: there were inconsistencies, I had already resolved those in my filegraph file.

Mike, Marijane: the DateTime was the wrong property type.

Javed: I recall we were also using something incorrectly from VCard but I don't want to go into that here.

Mike: There were some things that were not quite right.

Javed: There are some things we are using that have been removed from the external ontologies they came from. But for this version we're just merging.

Mike: Yes, good, just merge these files, we don't need to make any substantial changes. There are some substantial changes we'd eventually like to make.

Andrew: The developer interest call is after this meeting, those people seem very active, do we want to ask them to test these changes? Or are their local changes too different for this to work?

Mike: oh, I think this would have to be tested from the fork, rather than on any existing systems.

Javed: I think Mike should test it first and see if it boots up. =)

Andrew: I still have the question, assume things boot and the smoke tests work, is it reasonable to ask people on the Development IG call, how much customization do they do, if they swapped in this consolidated version of the ontology, what impact it should have? Yeah, it should just work.

Mike: It should just work.

Andrew: would they have to wash out their existing files before loading this?

Mike: yes, we would have to come up with a procedure. And I need to think about the impact on the database. VIVO has this first time thing that just drives me wild. There is a set of files called "first time" and the only way to change them is to empty your database, which is very hard, especially for a production site.

Marijane: we don't want people to try this on production systems, though, hopefully people have staging servers they could test on, that they don't mind blowing away.

Mike: Oh no, of course not. But this is why we need a procedure to tell people how to go about testing this. I don't have one off the top of my head, but I can think about how to do it. I will have a clean VIVO, I could load my sample data into it. But we also need a test with an existing database where we swap out the ontology, which should also work. So there are two tests that need to be done.

Andrew: And it's the second test our users are interested in.

Mike: and is much harder, but I will think about it. And we have people in the community who are pretty good at this stuff.

Andrew: have people maybe thought about this? It doesn't necessarily seem revolutionary to want to reload your ontology.

Mike: It's revolutionary, remember we haven't done anything like this since 2013. But we do have pretty clear thinkers about the RDF and such, like Benjamin Gross at clarivate, he might be able to poke holes in the procedure. Also a lot of sites are production sites, they don't often build a VIVO from scratch, a lot of sites are not familiar with this process. That makes it all riskier. And maybe we could talk about a test framework on a development call.

Andrew: Yes.

Mike: I'd sleep a lot better if I knew we could just push a button and run 100 tests and know things are working.

I'm not sure what else we have to talk about today. I've written some JavaScript to display the ontology in a different way. We know we have a bunch of dangling classes not tied to BFO, and when you draw a set of nodes representing the classes and the links to the subclass assertions, you see the dangling classes right away. So having a complete IsA hierarchy seems like a good thing to have, and this tool/script makes it very easy to see. It reads the filegraph.owl and shows classes as nodes, and subclass links as links, and it colors them by namespace. It's a pretty good inspection mechanism.

Javed: Can you see a specific part?

Mike: No, it's just the whole file. We could think about what kind of filters we might want to have.

Marijane: Filter by namespace, perhaps.

Mike: Yeah, though you wouldn't see the ontological scaffolding.

Andrew: So I'm looking at the github organizations, seeing the openrif/vivo-isf-ontology, trying to draw a line between what you did Javed, did you pull from that?

Javed: No, the one single ontology file available on openrif is different from the merged files in the software. They are different.

Andrew: Is idea that there's this ISF and a mirror in the codebase?

Marijane: I want to note, Andrew, the ISF is the merger of the eagle-i and vivo ontologies. I am not clear on the history of why they were merged, I gather that Harvard wanted to link to eagle-i from Harvard Profiles. Sometimes I question the wisdom of the merge.

Juliane: As do we all! =)

Mike: Well, there were two NIH grants, and there was a winning argument that the ontologies should be reconciled, and Melissa said they could do that.

Andrew: [missed comment/question]

Marijane: One of the longer term goals here is to be able to extract the VIVO ontology from the ISF, as the eagle-i ontology is extracted from it, at least when it's working, the idea that both projects will have their local extracted ontologies that are maintained in the ISF. It's been broken for over a year, after I worked with Tennille Johnson to add new things to it. I am working with Shahim Essaid, the tools developer, to understand the build process so we can debug and loop in the Harvard team on the process. Once we do that, we will be in a good place to work on extracting VIVO from the ISF as well. But there is a lot of technical debt and a lot of yaks to shave before we can get there.

Javed: I put a link in the chat to the 1.6 file, it should be the mirror of what we have in the software. But they are not the same, so we decided to go with what's in the software, so then Mike merged them, and we eventually ended up with what's in the vivo-ontology-lab repository.

Mike: and if the file in the vivo-ontology-lab works, we will replace the file on the web, and when we do that, we will be able to make the changes.

Andrew: Which are the files in vivo-ontology-lab replacing?

Javed: you can see it in the github commit history, a number of files are removed and two are added.

Andrew: Is getting rid of vivo-ontology-lab a goal?

Javed: I think we just want to have a separate branch.

Andrew: Just as a mitigating strategy if there's debate around the possibility of getting rid of it.

Javed: I think the idea is that this work is still research. So that is maybe why it's separate.

Andrew: That makes sense, I'm just looking further

Marijane: Can we ask Graham why he made it a separate org? Or is he dead to us now? (joking) I hadn't even realized he left until I saw Mike's email.

Mike: Yes, at the end of the year.

Andrew: and it seems like coordinating with the ISF work is separate. And the longer we wait the bigger it will be.

Marijane: There's a lot of big stuff with the ISF, I'm not sure waiting will make it much worse.

Mike: And we have some people from that project on this call, so we're in it together.

Andrew: so, Mike, you're going to do some greenfield development and testing to swap out an existing database.

Mike: I do have several VIVOs I could test that on.

Andrew: I'm also looking at the wiki page, we've mostly covered the agenda items. I see a list of completed items and action items.

Mike: Those are there to remind people what we're trying to do. We are making good progress, this single file is key to the whole enterprise.

Javed: OK, mike you are going to test it, and I am going to go to the ontology file and find some no-impact changes.

Mike: Low impact changes, that would be good.

The VIVO-ISF ontology is an information standard for representing scholarly work.

Additional Resources

Clone this wiki locally