Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verification #133

Closed
wants to merge 81 commits into from
Closed

Verification #133

wants to merge 81 commits into from

Conversation

domna
Copy link
Collaborator

@domna domna commented Jun 22, 2023

This is a simple verification. I just load everything from a provided file and write it into a dict. There are still several issues:

  • The checker should run through and show all errors instead of stopping at the first invalid field (maybe even a good feature for writing a file?)
  • Large datasets shouldn't actually be loaded (they should rather be checked for type, shape and dimension as provided by h5py's dtype, shape and ndim)
  • units are not checked as they are not checked by pynxtools
  • Check NXtransformation units
  • types are not checked (?)
  • Tests
  • Timestamp validity (with timezone!)
  • Follow the logging routine in the validator instead of throwing errors
  • Internal representation for reader errors (wrapper class for logging?)
  • Follow links and check if the type matches
  • NXdata validation
  • Loaded fields should be properly namefitted
  • Report on namefitting, i.e., report if a field is not properly fitted by name or is a general invalid name (e.g., it contains whitespace, capitalized word).

@domna
Copy link
Collaborator Author

domna commented Jun 22, 2023

This nyaml entry still gives an error SOLUTE(NXsample_component).
The template verification demands the path

ValueError: The data entry corresponding to /ENTRY[entry]/SAMPLE[sample]/SOLUTE/solvent 
is required and hasn't been supplied by the reader.

while the script here adds this key:

"/ENTRY[entry]/SAMPLE[sample]/SAMPLE_COMPONENT[solute2]/solvent": "solvent1"

Which one of these should it be: SOLUTE[solute2], SAMPLE_COMPONENT[solute2] or SOLUTE?

@domna
Copy link
Collaborator Author

domna commented Jun 22, 2023

@sherjeelshabih @sanbrock What do you think of this approach? It is roughly working already (except for the error in the previous comment but I'm actually not sure if this is a general bug in the template generation, @sherjeelshabih ?) and IMO it would be helpful to have a verification tool for nexus files. If you think this is a good path I would continue adding more verification features also in other parts of the code (i.e., the points addressed above) and tests.

@domna domna marked this pull request as draft June 22, 2023 15:33
@sherjeelshabih
Copy link
Collaborator

This nyaml entry still gives an error SOLUTE(NXsample_component). The template verification demands the path

ValueError: The data entry corresponding to /ENTRY[entry]/SAMPLE[sample]/SOLUTE/solvent 
is required and hasn't been supplied by the reader.

while the script here adds this key:

"/ENTRY[entry]/SAMPLE[sample]/SAMPLE_COMPONENT[solute2]/solvent": "solvent1"

Which one of these should it be: SOLUTE[solute2], SAMPLE_COMPONENT[solute2] or SOLUTE?

I think SOLUTE[solute2] might work. The parts outside the brackets are used to look for the NXDL node. So the function get_node_at_nxdl_path receives a concatenation of what is outside the brackets. You can check if that function finds this NXDL path correctly: /ENTRY/SAMPLE/SOLUTE/solvent. I keep forgetting the exact caps/no caps format that goes in there. But as far as the converter sees this, the outside will be used to get the nxdl node for nexus info it neds. Then it just writes whatever you have in the [brackets] to the hdf5 file.

@sherjeelshabih
Copy link
Collaborator

@sherjeelshabih @sanbrock What do you think of this approach? It is roughly working already (except for the error in the previous comment but I'm actually not sure if this is a general bug in the template generation, @sherjeelshabih ?) and IMO it would be helpful to have a verification tool for nexus files. If you think this is a good path I would continue adding more verification features also in other parts of the code (i.e., the points addressed above) and tests.

I really like this approach. I already see that you are using the same validation routine used while converting. This will help improve verification there too then. That's awesome.

I think the not loading of large datasets should have a nice wrapper to allow readers, from the other side, to also somehow tell this verfication routine to not load large sets until it actually writes the hdf5. Then for writing we can have a streamIO object or whatever one calls it. For the verification of hdf5 files this will be easier as we can rely on the h5py's lazyload functionalities. Maybe for this PR we can focus on the simple hdf5 case and just have this in mind.

I think shape and dim aren't checked but types were being checked. Or you mean the types don't deal with what comes out of h5py.dtype?

@domna
Copy link
Collaborator Author

domna commented Jun 23, 2023

I think SOLUTE[solute2] might work. The parts outside the brackets are used to look for the NXDL node. So the function get_node_at_nxdl_path receives a concatenation of what is outside the brackets. You can check if that function finds this NXDL path correctly: /ENTRY/SAMPLE/SOLUTE/solvent. I keep forgetting the exact caps/no caps format that goes in there. But as far as the converter sees this, the outside will be used to get the nxdl node for nexus info it neds. Then it just writes whatever you have in the [brackets] to the hdf5 file.

ok I will just check what works. SOLUTE[solute2] will be a bit trickier to implement in the reverse way as I'm currently just reading the NX_class attribute for this. I would need to get the actual group name from the nxdl to resolve this from the file. Maybe it would be helpful to allow SAMPLE_COMPONENT[solute2] for a group named SOLUTE in the validation step and just check if the actual naming is correct for the upper/lower combinations (e.g. something like SOLUTE_fixedpart should just allow names like test1_fixedpart, test2_fixedpart etc).

@domna
Copy link
Collaborator Author

domna commented Jun 23, 2023

@sherjeelshabih @sanbrock What do you think of this approach? It is roughly working already (except for the error in the previous comment but I'm actually not sure if this is a general bug in the template generation, @sherjeelshabih ?) and IMO it would be helpful to have a verification tool for nexus files. If you think this is a good path I would continue adding more verification features also in other parts of the code (i.e., the points addressed above) and tests.

I really like this approach. I already see that you are using the same validation routine used while converting. This will help improve verification there too then. That's awesome.

Yes this is why I started this because I felt most of the parts for validation are already there. I just did not have an easy way to use this from the command line.

I think the not loading of large datasets should have a nice wrapper to allow readers, from the other side, to also somehow tell this verfication routine to not load large sets until it actually writes the hdf5. Then for writing we can have a streamIO object or whatever one calls it. For the verification of hdf5 files this will be easier as we can rely on the h5py's lazyload functionalities. Maybe for this PR we can focus on the simple hdf5 case and just have this in mind.

Yes sounds good 👍️This would also help building this for verification as we could just build a small class to pass instead of the actual data and let the verification routine check its contents against the nxdl.

I think shape and dim aren't checked but types were being checked. Or you mean the types don't deal with what comes out of h5py.dtype?

I did not actually check the details of what is being checked and what not by the code already (the points are also a somehow chaotic collection of the points I did not want to forget to address here 😅)

@coveralls
Copy link

coveralls commented Jul 4, 2023

Pull Request Test Coverage Report for Build 8845625971

Details

  • 210 of 239 (87.87%) changed or added relevant lines in 7 files are covered.
  • 5 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.2%) to 78.534%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pynxtools/nexus/nexus.py 0 1 0.0%
tests/dataconverter/test_helpers.py 32 34 94.12%
pynxtools/dataconverter/helpers.py 166 192 86.46%
Files with Coverage Reduction New Missed Lines %
pynxtools/definitions/dev_tools/utils/nxdl_utils.py 1 74.23%
pynxtools/dataconverter/convert.py 1 75.69%
pynxtools/dataconverter/helpers.py 1 88.85%
pynxtools/dataconverter/template.py 2 86.05%
Totals Coverage Status
Change from base Build 8833244405: 0.2%
Covered Lines: 2817
Relevant Lines: 3587

💛 - Coveralls

@domna
Copy link
Collaborator Author

domna commented Jun 10, 2024

This is superseeded by #333, which includes everything of this PR but uses the new validation algorithm.

@domna domna closed this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants