Improve metadata handling #89

jlaehne · 2023-02-16T10:15:17Z

Describe the functionality you would like to see.

As brought up by @francisco-dlp in LumiSpy/lumispy#53 (comment), it would be desirable to have a more universal metadata handling. Currently, metadata is mapped from original_metadata in every file_reader independently following the HyperSpy conventions. If other packages would want to built on RosettaSciIO, this is not the most convenient. Also it does include a lot of redundant code. Instead, we could for example use something like yaml files to define the mapping, and then each folder could include a hyperspy.yaml, but potentially also other mapping files for other applications.

Of course, metadata mapping is not always 1:1 (node from one tree is directly mapped to position in other metadata tree), which can be done using a basic dictionary. The mapping definition would need to include several extra situations:

if/elif/else like statement, where a certain field in original_metadata can decide which other field is mapped or what string/value is set in a certain node of metadata
processing the content of a field by python (e.g. one line code segments), such as unit conversion, calculation of an overall exposure time from multiple acquisitions (number of frames x time per frame)

The developers of the https://github.com/nomad-coe/nomad repository/ELN have implemented a similar functionality based on what they call "schemas". Maybe, we can team up with them @markus1978, @haltugyildirim to implement such a mapping in RosettaSciIO, as the possibility to read in a number of (partly binary) data formats provided by RosettaSciIO should in turn be valuable to Nomad in order to support a broader range of experiments and to integrate processing via e.g. HyperSpy.

Additional information

Should not hold back an initial release, but should be on the roadmap.

The text was updated successfully, but these errors were encountered:

francisco-dlp · 2023-02-16T15:31:28Z

Thanks @jlaehne for bringing back this important topic.

Indeed RosettaSciIO does map all metadata to HyperSpy's metadata specification. This comes with the advantage that it can translate all mapped metadata across formats (hence the link with the Rosetta Stone), but it is an overhead when this is not required. Therefore, it should be an optional feature (task 1).

As you rightly point out, the mapping to HyperSpy's metadata specification is not done very smartly. Ideally, one should be able to specify the mapping using an easy to maintain mapping specification file, e.g. in yaml (task 2). The task is far from trivial, and it is of interest beyond RosettaSciIO, so ideally it should be performed by an independent tool. Nomad's schemas seem like a good candidate.

Finally, HyperSpy's metadata specification is defined in the User Guide. It would be better to defined the metadata using e.g. Nexus' specifications or simply switch to Nexus' EM microscopy format (task 3).

ericpre · 2023-02-17T14:14:23Z

Now that there is nexus definition for electron microscopy, it would be great to use it and provide feedback on its usability.

jat255 · 2023-04-13T16:27:31Z

I wanted to share a few links to maybe push this discussion along (I think this is a great idea and would be interested in helping work on it, as interoperability is a critical part of a mature data ecosystem):

EM Glossary: https://codebase.helmholtz.cloud/em_glossary/em_glossary
- The EM glossary group is working on a community standardized controlled vocabulary for electron microscopy terms. The NexusEM implementation is related to this effort (and the terms defined in the glossary I believe are informing what goes into NexusEM)
- In some initial conversations with members of that group, they're very interested to see software-level implementations of the glossary to see how it works in "the real world", and would be happy to have HyperSpy/RScIO involved
Scythe EM metadata schema: https://github.com/materials-data-facility/scythe/blob/master/scythe/schemas/electron_microscopy.json
- The Scythe project's goal is to provide a shared resource of metadata 'extractors' (not EM specific) that are each controlled by a schema
- I worked briefly on implementing a prototype EM JSONSchema (linked above) and extractor, but the project as a whole has stalled a bit due to lack of person-power; regardless of technology (i.e. JSONSchema or something else), I think formally specifying our metadata schema (or using another's) would be very powerful as it allows for real-time and automated validation of metadata structures
- My example makes heavy use of HyperSpy for the mechanics of reading metadata, but the values to map in/out are manually written at this point and very simplistic. I like the idea of some sort of standardized .yaml file or something else to define mappings.
MaRDA metadata extraction working group: https://github.com/marda-alliance/metadata_extractors
- This is not an actual implementation of anything, but is a recently launched effort from the Materials Research Data Alliance to attempt to coordinate efforts on metadata extraction (just wanted to make people aware of it)

CSSFrancis · 2023-04-13T19:24:22Z

@jat255 These are all great resources. It does seem like there is a fair bit of duplication of efforts occurring in the community and it would be good to get ahead of that. Is there anyway we can bring more people into the fold/ integrate packages?

Developer time in the microscopy community seems to be very limited so anything we can do to reduce duplication is very valuable!

Maybe a meeting with all interested parties would help to get the ball rolling.

CSSFrancis · 2023-04-17T17:45:13Z

@jat255 It seems like it might be also worthwhile to send someone to a MaDRA meeting. I can attend, but don't know if I am the most qualified person to represent rosettasciio.

CSSFrancis · 2024-10-23T13:30:46Z

Can we potentially have a meeting about this?

I think that this has come up in a couple of places and it would be good formalize.

@jat255
@francisco-dlp
@sk1p
@ericpre
@jlaehne

(anyone else who might be interested?)

jlaehne · 2024-10-23T13:54:45Z

@mkuehbach @markus1978 who could represent FAIRmat in such a discussion? As you have started using RSciIO for some import functionalities, I guess a streamlining of metadata handling should be interesting for you ... on the other hand the nexus definitions for electron microscopy come from FAIRmat and could be the future way to go for a metadata convention also for HyperSpy/RosettaSciIO.

mkuehbach · 2024-10-23T14:20:03Z

@jlaehne I will represent FAIRmat on this topic.
With me reporting to markus1978, we will put him in the loop of course for all topics that go substantially beyond the scope of EM and touching on NOMAD.

Has there been a concrete time slot defined @jlaehne @francisco-dlp @CSSFrancis ? Please put me in the loop, thank you.

This issue here is specifically about metadata, one of the first parts on the agenda of such a meeting should be to
define what each party is interested in willing to contribute, desired time frame, and how deep that should be done using how strong semantics.

I am aware that several of the metadata in the hyperspy ecosystem and for file formats supported by rosettasciio touch on fields other than EM. Making EM a case to prototype things how to move the discussion further sure lets do it.

For the specific case of electron microscopy (EM) in FAIRmat rosettasciio is one of the I/O library components we appreciate here very much the efforts of the hspy team on this one. However, the topic of this thread goes significantly beyond reading of files.

There is indeed already a substantial number of use cases where the above-suggested mappings are defined and used that would profit from becoming more professionalized which is a topic in Q1 and Q2 2025 for us for EM and APM. In the German National Research Data Infrastructure there are specific suggestions already how to deal with such mappings which is what will be considered in electron microscopy and atom probe as far as I can tell.

@francisco-dlp Prof. Christoph Koch told me that you discussed with him briefly at this years Copenhagen EMC on how to work closer together here. We are interested in this.

mkuehbach · 2024-10-23T14:42:28Z

@jat255 I have worked through the data schema underlying the Nexus LIMS model there are many good connections one could draw to NeXus here, for most terms the mapping is straightforward, several of the microscopes that are handled in NexusLIMS write technology partner specific files for which the for some of the tech-partner-specific concepts a mapping has been implemented.

One more thought @CSSFrancis @francisco-dlp I think it would be good in that meeting also to have everybody first express their aims as all comments on this issue here touch on a number of projects for which it is expected that there are differences in aims and thus it is essential in my personal opinion to understand each projects concerns.

I really do like that this issue substantiates already that there is interest in going beyond reading from different and writing out to again different serializations. Enough efforts have been spent on that the topic is often seen unfortunately mainly as a format serialization question but it is more.

sk1p · 2024-10-23T15:38:51Z

Can we potentially have a meeting about this?

I think that this has come up in a couple of places and it would be good formalize.

Thanks for the ping; I'm interested in joining a call on this topic.

@mkuehbach wrote:

it would be good in that meeting also to have everybody first express their aims as all comments on this issue here touch on a number of projects for which it is expected that there are differences in aims and thus it is essential in my personal opinion to understand each projects concerns.

Agreed; I posted some of my own motivation/aims for working on this topic in this discussion: hyperspy/hyperspy#3431

In addition, in that thread @uellue also showed a possible technical solution for declaring and serializing metadata, which is similar to the json schema idea posted upthread, with a focus on including pint units for all values.

ericpre · 2024-10-24T17:50:56Z

This all sounds good.
Regarding arranging a meeting, the easiest may be to do a doodle. Normal working hours in CET and Boulder time (for @jat255) only give us a narrow window: possibly 4-5pm CET!

@CSSFrancis, I think you are very well place to lead and coordinate this, are you happy to do that? 🙂

@jat255, since you are involved in MaRDA, would you be able to give an overview of the aspect of the MarDA initiative relevant to the discussion here?

CSSFrancis · 2024-10-24T19:41:52Z

@CSSFrancis, I think you are very well place to lead and coordinate this, are you happy to do that? 🙂

Sure, If people want to share their availability here I can choose a time and we can set up a meeting either next week or the week after. I'll send out a agenda with a meeting once we have a time decided, although if people have things that they want me to put on the agenda please let me know.

jlaehne · 2024-10-25T07:55:08Z

For the future, German research landscape has several privacy friendly alternatives to doodle, one is:
https://terminplaner6.dfn.de/
(but for this time lets stay with doodle)

magnunor · 2024-10-25T08:53:35Z

Me and @emichr (Emil Christiansen) is interested in this. I inputted my availability in the Doodle, and Emil will join if he is available.

TomaSusi · 2024-10-25T13:43:24Z

Hi! This is very important topic for the entire community. I've been following Fairmat for a while, and they seem to be the most mature project in this space. Personally I might help abTEM be compliant with whatever is settled upon.

I'm in Japan currently so the timezones are a bit tricky, but I put in any possible availabilities to the Doodle.

ercius · 2024-10-25T16:59:48Z

Im very interested in this as well and happy to support it. Ill put my schedule in Doodle.

mkuehbach · 2024-10-25T18:08:05Z

I wanted to share a few links to maybe push this discussion along (I think this is a great idea and would be interested in helping work on it, as interoperability is a critical part of a mature data ecosystem):

* EM Glossary: https://codebase.helmholtz.cloud/em_glossary/em_glossary
  
  * The EM glossary group is working on a community standardized controlled vocabulary for electron microscopy terms. The NexusEM implementation is related to this effort (and the terms defined in the glossary _I believe_ are informing what goes into NexusEM)
  * In some initial conversations with members of that group, they're very interested to see software-level implementations of the glossary to see how it works in "the real world", and would be happy to have HyperSpy/RScIO involved

* Scythe EM metadata schema: https://github.com/materials-data-facility/scythe/blob/master/scythe/schemas/electron_microscopy.json
  
  * The Scythe project's goal is to provide a shared resource of metadata 'extractors' (not EM specific) that are each controlled by a schema
  * I worked briefly on implementing a prototype EM JSONSchema (linked above) and [extractor](https://github.com/materials-data-facility/scythe/blob/master/scythe/electron_microscopy.py), but the project as a whole has stalled a bit due to lack of person-power; regardless of technology (i.e. JSONSchema or something else), I think formally specifying our metadata schema (or using another's) would be very powerful as it allows for real-time and automated validation of metadata structures
  * My example makes heavy use of HyperSpy for the mechanics of reading metadata, but the values to map in/out are manually written at this point and very simplistic. I like the idea of some sort of standardized .yaml file or something else to define mappings.

* MaRDA metadata extraction working group: https://github.com/marda-alliance/metadata_extractors
  
  * This is not an actual implementation of anything, but is a recently launched effort from the Materials Research Data Alliance to attempt to coordinate efforts on metadata extraction (just wanted to make people aware of it)

@jat255 NXem uses the EMglossary already where concepts match exactly via references currently but that will be professionalized at some point though unclear when also not unaffected by what is decided here

CSSFrancis · 2024-10-29T19:14:48Z

The key point of today's discussion was that we must have strongly typed metadata to best support interoperability. This includes definitions such as those defined in the EM Glossary (which we should co-opt when available). Additionally, information about the data type, units, and alternative names. I will also add that dynamic linkages between metadata fields are something to be considered. (i.e., beam energy <--> wavelength in transmission electron microscopy)

From a more pragmatic point of view the most important thing to do is define a schema and then help implement it across a wide breadth of different scientific packages within the community. While the schema can be fluid between different packages/ domains, there should be firm expectations that all metadata is well-defined and documented.

Comments from Discussion:
There is potentially some confusion about the neXus file format, especially with relation to where the file format is defined:

http://www.nexusformat.org/

With specific definitions here:

https://github.com/FAIRmat-NFDI/nexus_definitions

@mkuehbach can you link to the HTML-rendered version of the EM nexus metadata definitions as a starting point?

The Nexus File format is based on XML. The census seems to largely be how we can take the metadata defined by the NeXus file format and convert it to a format that works with existing HDF5/ zarr file formats in a way that is compatible.
From a loading binary files point of view, better mapping from current metadata --> a well defined metadata scheme seems like a very important task.

Where to go from here:

@jat255 will help define a schema. I can help to some degree and I think @sk1p should be able to devote some time to helpping as well. Already @uellue and @sk1p have started defining a schema and Josh suggested using linkml potentially as an easier way to define a schema

Ideally, the metadata schema would be something defined individually by each extension package, as HyperSpy should remain domain-agnostic. My hesitation with this approach is that with the added freedom and flexibility comes the potential for two domains or even two packages within the same domain to call the same thing different names/define differently. I’d suggest a strong preference for metadata parameter reuse between techniques where appropriate and a central repository for defining metadata parameters for each technique.

Secondly, many metadata standardization efforts have limited success partially due to a lack of support for the community of project maintainers. It would be nice to help implement support into py4dstem, hyperspy et al., abtem, LiberTEM etc. One standard, lightweight metadata-focused package might be one option. The Dictionary Tree Browser in Hyperspy might be useful as a standalone pacakge.

mkuehbach · 2024-10-29T19:57:31Z

There are two relevant repositories for NeXus:

https://github.com/nexusformat/definitions on its main branch that contains an earlier version of NXem that has been discussed about to become updated and an official NIAC-approved standard NXem as soon as Fairmat 2024: proposal on electron microscopy (EM) nexusformat/definitions#1423 has been merged which will happen as soon as all remaining other dependent standardization acitivity branches (see open PR) have been reviewed and merged
https://github.com/FAIRmat-NFDI/nexus_definitions on its fairmat branch contains the latest developments. Here updates of the latest NIAC-accepted standardized version haare kept, discussed about and to become batch-proposed for a future updates of the NeXus standard.

An automation creates a version of the fairmat branch for human consumption as a documentation that
is accessible under this link: https://fairmat-nfdi.github.io/nexus_definitions/

@CSSFrancis @jat255 @ericpre @sk1p @magnunor @emichr @mkuehbach attended todays meeting.

mkuehbach · 2024-10-29T20:06:14Z

My hesitation with this approach is that with the added freedom and flexibility comes the potential for two domains or even two packages within the same domain to call the same thing different names/define differently."

Use a mapping approach with strong semantics like
namespace:concept namespace:predicate namespace:concept

E.g hyperspy:metadata_axis namespace:isEquivalentTo othernamespace:name_of_the_concept_that_metadata is equivalent to

The use of the phrase "easy"

Just stop using as it is suggestive and vague as there is no consensus among all hyperspy users and developers on what they feel/consider is "easy" or not. We all have different backgrounds, skillsets, and learning curves.

The use of the term "schema"

That is used too vaguely here instead we talked exemplarily about two types of schema whose development is related but also different:

The data schema underlying e.g. a file format or data model to hyperspy (that is along the idea of LiberTEMs pydantic + pint)
Schemes that specify how versioned concepts relate to other versioned concepts to work towards a productive coexistence of standards if we cannot agree on using fewer standards (although also OMERO states that we should work towards exactly https://www.openmicroscopy.org/2019/06/25/formats.html)

wrt to parsing to and from a large number of examples

This is what the www.github.com/FAIRmat-NFDI/pynxtools tool and its www.github.com/FAIRmat-NFDI/pynxtools-em plugin for the here discussed examples of EM is actively pursueing powered by rosettasciio, hyperspy and other software tools outside the hyperspy ecosystem

mkuehbach · 2024-10-29T20:22:19Z

@Pepe-Marquez eventually FYI

mkuehbach · 2024-10-29T20:25:11Z

One more point and that is low-hanging: Rigorous and transparent versioning of the individual tools at the commit and revision level rather than version, use of persistent identifiers for concepts used in ones documentation and transparent versioning and listing which format version specific versions of hyperspy tools especially rosettasciio they support

jlaehne added the type: proposal label Feb 16, 2023

jlaehne mentioned this issue Feb 17, 2023

Update metadata convention hyperspy/hyperspy#3093

Open

7 tasks

francisco-dlp mentioned this issue Oct 6, 2023

Release 1.7.x and 2.0.0 hyperspy/hyperspy#2996

Closed

57 tasks

jlaehne mentioned this issue Dec 21, 2023

Connection with and contribution to the MaRDA metadata extractors registry #207

Open

PeterKraus mentioned this issue Jun 10, 2024

original_metadata: Consistent namespacing of metadata dgbowl/yadg#164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metadata handling #89

Improve metadata handling #89

jlaehne commented Feb 16, 2023

francisco-dlp commented Feb 16, 2023

ericpre commented Feb 17, 2023

jat255 commented Apr 13, 2023

CSSFrancis commented Apr 13, 2023

CSSFrancis commented Apr 17, 2023

CSSFrancis commented Oct 23, 2024

jlaehne commented Oct 23, 2024

mkuehbach commented Oct 23, 2024 •

edited

Loading

mkuehbach commented Oct 23, 2024

sk1p commented Oct 23, 2024

ericpre commented Oct 24, 2024

CSSFrancis commented Oct 24, 2024 •

edited

Loading

jlaehne commented Oct 25, 2024

magnunor commented Oct 25, 2024

TomaSusi commented Oct 25, 2024

ercius commented Oct 25, 2024

mkuehbach commented Oct 25, 2024 •

edited

Loading

CSSFrancis commented Oct 29, 2024 •

edited

Loading

mkuehbach commented Oct 29, 2024

mkuehbach commented Oct 29, 2024 •

edited

Loading

mkuehbach commented Oct 29, 2024

mkuehbach commented Oct 29, 2024

Improve metadata handling #89

Improve metadata handling #89

Comments

jlaehne commented Feb 16, 2023

Describe the functionality you would like to see.

Additional information

francisco-dlp commented Feb 16, 2023

ericpre commented Feb 17, 2023

jat255 commented Apr 13, 2023

CSSFrancis commented Apr 13, 2023

CSSFrancis commented Apr 17, 2023

CSSFrancis commented Oct 23, 2024

jlaehne commented Oct 23, 2024

mkuehbach commented Oct 23, 2024 • edited Loading

mkuehbach commented Oct 23, 2024

sk1p commented Oct 23, 2024

ericpre commented Oct 24, 2024

CSSFrancis commented Oct 24, 2024 • edited Loading

jlaehne commented Oct 25, 2024

magnunor commented Oct 25, 2024

TomaSusi commented Oct 25, 2024

ercius commented Oct 25, 2024

mkuehbach commented Oct 25, 2024 • edited Loading

CSSFrancis commented Oct 29, 2024 • edited Loading

mkuehbach commented Oct 29, 2024

mkuehbach commented Oct 29, 2024 • edited Loading

mkuehbach commented Oct 29, 2024

mkuehbach commented Oct 29, 2024

mkuehbach commented Oct 23, 2024 •

edited

Loading

CSSFrancis commented Oct 24, 2024 •

edited

Loading

mkuehbach commented Oct 25, 2024 •

edited

Loading

CSSFrancis commented Oct 29, 2024 •

edited

Loading

mkuehbach commented Oct 29, 2024 •

edited

Loading