Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First revision #2

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions journey-to-ontologize-metadata.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Journey to ontologize metadata

### Abstract

TBD.

---

### Authors

- Nukorn Plainpan (Empa)
Expand All @@ -20,7 +26,7 @@ Ontology is a formal representation of knowledge within a domain, including enti

## Ontologizing metadata: why the hassle?

In scientific research, the concept of ontologizing metadata has emerged as a pivotal strategy to enhance data interoperability across various research groups [6,7]. The essence of ontologizing metadata lies in structuring the data in such a manner that it becomes both machine-readable and actionable. This process not only accelerates data sharing and integration for applications such as machine learning, but also significantly boosts automation and workflow efficiency. Furthermore, ontologizing metadata upholds the principles of FAIR [8] data, making them **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable, specifically focusing on the aspect of Interoperability. One of the most compelling use cases for ontologized metadata is its compatibility with SPARQL [9] queries, which allows for sophisticated data retrieval and analysis. SPARQL, a powerful query language for RDF data, allows for precise searches across structured, semantic data formats. This enables researchers to perform complex queries over diverse datasets, improving data interoperability and fostering advanced analysis. Such capability is crucial for identifying specific data points, patterns, and relationships across vast and interdisciplinary datasets, thereby accelerating discovery and innovation in scientific research. An example of how to use SPARQL with ontologies, is provided within the BattINFO [10–12] ontology for the case of "Zinc powder from a supplier" [13]. The example touches on several concepts: utilizing ontology terms and JSON-LD for resource description, transforming JSON-LD into triples by machines, understanding the roles of subject, predicate, and object in identifiers, executing basic SPARQL queries, leveraging ontology for enhanced data retrieval from various sources.
In scientific research, the concept of ontologizing metadata has emerged as a pivotal strategy to enhance data interoperability across various research groups [6,7]. The essence of ontologizing metadata lies in structuring the data in such a manner that it becomes both machine-readable and actionable. This process not only accelerates data sharing and integration for applications such as machine learning, but also significantly boosts automation and workflow efficiency. Furthermore, ontologizing metadata upholds the principles of FAIR [8] data, making them **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable, specifically focusing on the aspect of Interoperability. One of the most compelling use cases for ontologized metadata is its compatibility with SPARQL [9] queries, which allows for sophisticated data retrieval and analysis. SPARQL, a powerful query language for RDF data, allows for precise searches across structured, semantic data formats. This enables researchers to perform complex queries over diverse datasets, improving data interoperability and fostering advanced analysis. Such capability is crucial for identifying specific data points, patterns, and relationships across vast and interdisciplinary datasets, thereby accelerating discovery and innovation in scientific research. An example of how to use SPARQL with ontologies is provided within the BattINFO [10–12] ontology for the case of "Zinc powder from a supplier" [13]. The example touches on several concepts: utilizing ontology terms and JSON-LD for resource description, transforming JSON-LD into triples by machines, understanding the roles of subject, predicate, and object in identifiers, executing basic SPARQL queries, leveraging ontology for enhanced data retrieval from various sources.

One of the key advantages of ontologies is the ability to support interoperability and integration across different systems and datasets. They enable different systems to exchange data and perform reasoning tasks, even if they were developed independently. Specifically, the exchange of research data may be facilitated via machine-actionable, self-ontologized containers of research data accompanied by ontologized metadata. A receiving platform can in turn unpack the container by following its specification and metadata. Examples of container specification formats include [RO-Crate](https://www.researchobject.org/ro-crate/) , [BagIt](https://datatracker.ietf.org/doc/html/rfc8493) , [DataCite](https://datacite.org/), [Dataverse](https://dataverse.org/), and more. These formats provide guidelines and standards for packaging research data in a FAIR way to facilitate exchange and interoperability. The PREMISE project is actively exploring these and other formats, as it continues to develop a set of guidelines for interoperability of FAIR research data in materials science. The RO-Crate specification, built on the JSON-LD serialization of RDF, has been chosen as a working example for the second MADICES workshop aimed at exploring inter-platform [interoperability](https://www.cecam.org/workshop-details/machine-actionable-data-interoperability-for-the-chemical-sciences-madices-2-1321).

Expand All @@ -36,7 +42,7 @@ Before we could proceed with assigning ontologies to metadata, we had to identif

In the battery illustrative case, we selected 90 distinct metadata items for the process of coin cell battery production. This is an initial set, meant to be updated over time, created from brainstorming and feedback from the researchers active in this field with whom we are collaborating within PREMISE. In practice, each group that decided to adopt ontologies has to define their own requirements. The first step is of course identifying which metadata set could be already available from the literature in a specific research field. For the case of Scanning Probe Microscopy (SPM), we identified a metadata file available online named Nexus [14]. However, the main feedback we received from the researchers collaborating with us, is that this scheme misses several fields. Therefore, we created our own set of metadata following the information obtained from our onsite experts and from the Nanonis software [15]. We identified more than 300 distinct metadata items by combining metadata as specified in the Nanonis file format together with metadata needed to describe experimental setups, experimental procedures, inventories for molecular precursors and crystals together with all processes needed to reproduce a typical experimental project containing simulations results and microscopy results.

To maintain simplicity and clarity, we opted to organize both metadata sets (battery and SPM) in a tabulated format. This approach not only aids in visualization, but also in the subsequent steps of defining the ontology. Examples of the tabulated metadata items are provided as Excel files in the repositories of deliverables [D2.1](https://github.com/ord-premise/metadata-spectroscopy) and [D3.1](https://github.com/ord-premise/metadata-batteries).
To maintain simplicity and clarity, we opted to organize both metadata sets (battery and SPM) in a tabulated format. This approach not only aids in visualization, but also in the subsequent steps of defining the ontology. Examples of the tabulated metadata items are provided as Excel files in the repositories of PREMISE deliverables [D2.1](https://github.com/ord-premise/metadata-spectroscopy) and [D3.1](https://github.com/ord-premise/metadata-batteries).

### 2. Choosing ontology concepts

Expand All @@ -56,7 +62,7 @@ With an ontology concept in hand, the subsequent step involves adding the ontolo

![Figure 2.](./Figures/fig2.svg)

_Figure 2. Example of JSON-LD file. Top left: context of the JSON-LD file. The context contains the parameters (e.g. name) needed for defining the metadata. For example "abstract" is identified by the entity https://schema.org/abstract within the ontology schema.org . The definition of "abstract" within the ontology is obtained (bottom left panel) following the link. Abstract is a "string" as pointed out by "@type" and a "string" is defined within "xsd". The right panel is the graph of the JSON-LD, it contains metadata. For example, we have here three objects: annealing, sputtering and instrument. We see the metadata of these objects (for example "name" or "current") and the relation between these objects. Annealing contains sputtering, which contains instrument. Here "hasPart" means that for an annealing we need to do sputtering before. And to do sputtering we need to have an instrument._
_Figure 2. Example of JSON-LD file. Top left: context of the JSON-LD file. The context contains the parameters (e.g. name) needed for defining the metadata. For example "abstract" is identified by the entity "https://schema.org/abstract" within the ontology schema.org . The definition of "abstract" within the ontology is obtained (bottom left panel) by following the link. Abstract is a "string" as pointed out by "@type" and a "string" is defined within "xsd". The right panel is the graph of the JSON-LD, it contains metadata. For example, we have here three objects: annealing, sputtering and instrument. We see the metadata of these objects (for example "name" or "current") and the relation between these objects. Annealing contains sputtering, which contains instrument. Here "hasPart" means that for an annealing we need to do sputtering before. And to do sputtering we need to have an instrument._

### 4. Choosing the file format

Expand Down