-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve metadata handling #89
Comments
Thanks @jlaehne for bringing back this important topic. Indeed RosettaSciIO does map all metadata to HyperSpy's metadata specification. This comes with the advantage that it can translate all mapped metadata across formats (hence the link with the Rosetta Stone), but it is an overhead when this is not required. Therefore, it should be an optional feature (task 1). As you rightly point out, the mapping to HyperSpy's metadata specification is not done very smartly. Ideally, one should be able to specify the mapping using an easy to maintain mapping specification file, e.g. in yaml (task 2). The task is far from trivial, and it is of interest beyond RosettaSciIO, so ideally it should be performed by an independent tool. Nomad's schemas seem like a good candidate. Finally, HyperSpy's metadata specification is defined in the User Guide. It would be better to defined the metadata using e.g. Nexus' specifications or simply switch to Nexus' EM microscopy format (task 3). |
Now that there is nexus definition for electron microscopy, it would be great to use it and provide feedback on its usability. |
I wanted to share a few links to maybe push this discussion along (I think this is a great idea and would be interested in helping work on it, as interoperability is a critical part of a mature data ecosystem):
|
@jat255 These are all great resources. It does seem like there is a fair bit of duplication of efforts occurring in the community and it would be good to get ahead of that. Is there anyway we can bring more people into the fold/ integrate packages? Developer time in the microscopy community seems to be very limited so anything we can do to reduce duplication is very valuable! Maybe a meeting with all interested parties would help to get the ball rolling. |
@jat255 It seems like it might be also worthwhile to send someone to a MaDRA meeting. I can attend, but don't know if I am the most qualified person to represent rosettasciio. |
Can we potentially have a meeting about this? I think that this has come up in a couple of places and it would be good formalize. @jat255 (anyone else who might be interested?) |
@mkuehbach @markus1978 who could represent FAIRmat in such a discussion? As you have started using RSciIO for some import functionalities, I guess a streamlining of metadata handling should be interesting for you ... on the other hand the nexus definitions for electron microscopy come from FAIRmat and could be the future way to go for a metadata convention also for HyperSpy/RosettaSciIO. |
@jlaehne I will represent FAIRmat on this topic. Has there been a concrete time slot defined @jlaehne @francisco-dlp @CSSFrancis ? Please put me in the loop, thank you. This issue here is specifically about metadata, one of the first parts on the agenda of such a meeting should be to I am aware that several of the metadata in the hyperspy ecosystem and for file formats supported by rosettasciio touch on fields other than EM. Making EM a case to prototype things how to move the discussion further sure lets do it. For the specific case of electron microscopy (EM) in FAIRmat rosettasciio is one of the I/O library components we appreciate here very much the efforts of the hspy team on this one. However, the topic of this thread goes significantly beyond reading of files. There is indeed already a substantial number of use cases where the above-suggested mappings are defined and used that would profit from becoming more professionalized which is a topic in Q1 and Q2 2025 for us for EM and APM. In the German National Research Data Infrastructure there are specific suggestions already how to deal with such mappings which is what will be considered in electron microscopy and atom probe as far as I can tell. @francisco-dlp Prof. Christoph Koch told me that you discussed with him briefly at this years Copenhagen EMC on how to work closer together here. We are interested in this. |
@jat255 I have worked through the data schema underlying the Nexus LIMS model there are many good connections one could draw to NeXus here, for most terms the mapping is straightforward, several of the microscopes that are handled in NexusLIMS write technology partner specific files for which the for some of the tech-partner-specific concepts a mapping has been implemented. One more thought @CSSFrancis @francisco-dlp I think it would be good in that meeting also to have everybody first express their aims as all comments on this issue here touch on a number of projects for which it is expected that there are differences in aims and thus it is essential in my personal opinion to understand each projects concerns. I really do like that this issue substantiates already that there is interest in going beyond reading from different and writing out to again different serializations. Enough efforts have been spent on that the topic is often seen unfortunately mainly as a format serialization question but it is more. |
Thanks for the ping; I'm interested in joining a call on this topic. @mkuehbach wrote:
Agreed; I posted some of my own motivation/aims for working on this topic in this discussion: hyperspy/hyperspy#3431 In addition, in that thread @uellue also showed a possible technical solution for declaring and serializing metadata, which is similar to the json schema idea posted upthread, with a focus on including pint units for all values. |
This all sounds good. @CSSFrancis, I think you are very well place to lead and coordinate this, are you happy to do that? 🙂 @jat255, since you are involved in MaRDA, would you be able to give an overview of the aspect of the MarDA initiative relevant to the discussion here? |
Sure, If people want to share their availability here I can choose a time and we can set up a meeting either next week or the week after. I'll send out a agenda with a meeting once we have a time decided, although if people have things that they want me to put on the agenda please let me know. |
For the future, German research landscape has several privacy friendly alternatives to doodle, one is: |
Me and @emichr (Emil Christiansen) is interested in this. I inputted my availability in the Doodle, and Emil will join if he is available. |
Hi! This is very important topic for the entire community. I've been following Fairmat for a while, and they seem to be the most mature project in this space. Personally I might help abTEM be compliant with whatever is settled upon. I'm in Japan currently so the timezones are a bit tricky, but I put in any possible availabilities to the Doodle. |
Im very interested in this as well and happy to support it. Ill put my schedule in Doodle. |
@jat255 NXem uses the EMglossary already where concepts match exactly via references currently but that will be professionalized at some point though unclear when also not unaffected by what is decided here |
The key point of today's discussion was that we must have strongly typed metadata to best support interoperability. This includes definitions such as those defined in the EM Glossary (which we should co-opt when available). Additionally, information about the data type, units, and alternative names. I will also add that dynamic linkages between metadata fields are something to be considered. (i.e., beam energy <--> wavelength in transmission electron microscopy) From a more pragmatic point of view the most important thing to do is define a schema and then help implement it across a wide breadth of different scientific packages within the community. While the schema can be fluid between different packages/ domains, there should be firm expectations that all metadata is well-defined and documented. Comments from Discussion: With specific definitions here: https://github.com/FAIRmat-NFDI/nexus_definitions @mkuehbach can you link to the HTML-rendered version of the EM nexus metadata definitions as a starting point? The Nexus File format is based on XML. The census seems to largely be how we can take the metadata defined by the NeXus file format and convert it to a format that works with existing HDF5/ zarr file formats in a way that is compatible. Where to go from here: @jat255 will help define a schema. I can help to some degree and I think @sk1p should be able to devote some time to helpping as well. Already @uellue and @sk1p have started defining a schema and Josh suggested using linkml potentially as an easier way to define a schema Ideally, the metadata schema would be something defined individually by each extension package, as HyperSpy should remain domain-agnostic. My hesitation with this approach is that with the added freedom and flexibility comes the potential for two domains or even two packages within the same domain to call the same thing different names/define differently. I’d suggest a strong preference for metadata parameter reuse between techniques where appropriate and a central repository for defining metadata parameters for each technique. Secondly, many metadata standardization efforts have limited success partially due to a lack of support for the community of project maintainers. It would be nice to help implement support into py4dstem, hyperspy et al., abtem, LiberTEM etc. One standard, lightweight metadata-focused package might be one option. The Dictionary Tree Browser in Hyperspy might be useful as a standalone pacakge. |
There are two relevant repositories for NeXus:
An automation creates a version of the fairmat branch for human consumption as a documentation that @CSSFrancis @jat255 @ericpre @sk1p @magnunor @emichr @mkuehbach attended todays meeting. |
Use a mapping approach with strong semantics like E.g hyperspy:metadata_axis namespace:isEquivalentTo othernamespace:name_of_the_concept_that_metadata is equivalent to
Just stop using as it is suggestive and vague as there is no consensus among all hyperspy users and developers on what they feel/consider is "easy" or not. We all have different backgrounds, skillsets, and learning curves.
That is used too vaguely here instead we talked exemplarily about two types of schema whose development is related but also different:
This is what the www.github.com/FAIRmat-NFDI/pynxtools tool and its www.github.com/FAIRmat-NFDI/pynxtools-em plugin for the here discussed examples of EM is actively pursueing powered by rosettasciio, hyperspy and other software tools outside the hyperspy ecosystem |
@Pepe-Marquez eventually FYI |
One more point and that is low-hanging: Rigorous and transparent versioning of the individual tools at the commit and revision level rather than version, use of persistent identifiers for concepts used in ones documentation and transparent versioning and listing which format version specific versions of hyperspy tools especially rosettasciio they support |
Describe the functionality you would like to see.
As brought up by @francisco-dlp in LumiSpy/lumispy#53 (comment), it would be desirable to have a more universal metadata handling. Currently,
metadata
is mapped fromoriginal_metadata
in everyfile_reader
independently following the HyperSpy conventions. If other packages would want to built on RosettaSciIO, this is not the most convenient. Also it does include a lot of redundant code. Instead, we could for example use something likeyaml
files to define the mapping, and then each folder could include ahyperspy.yaml
, but potentially also other mapping files for other applications.Of course, metadata mapping is not always 1:1 (node from one tree is directly mapped to position in other metadata tree), which can be done using a basic dictionary. The mapping definition would need to include several extra situations:
if/elif/else
like statement, where a certain field inoriginal_metadata
can decide which other field is mapped or what string/value is set in a certain node ofmetadata
The developers of the https://github.com/nomad-coe/nomad repository/ELN have implemented a similar functionality based on what they call "schemas". Maybe, we can team up with them @markus1978, @haltugyildirim to implement such a mapping in RosettaSciIO, as the possibility to read in a number of (partly binary) data formats provided by RosettaSciIO should in turn be valuable to Nomad in order to support a broader range of experiments and to integrate processing via e.g. HyperSpy.
Additional information
Should not hold back an initial release, but should be on the roadmap.
The text was updated successfully, but these errors were encountered: