Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String parsing and expression evaluation in json_map reader #252

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mmpsi
Copy link

@mmpsi mmpsi commented Feb 21, 2024

I propose two extensions of the json_map reader to handle custom date-time formats and expression evaluation. These changes serve the following use cases:

  1. Nexus requires date and time specifications in ISO format. Older data files may use custom date formats such as mm/dd/yyyy or numeric time stamps. The proposed change allows to parse entries in the original data file using the Python datetime or dateutil libraries and convert them on the fly to ISO format.
  2. Nexus requires coordinate axes to be given point-by-point. Some data files may specify only the end points, where the full array can be constructed using a numpy function, e.g. np.linspace. To be more general, the proposed code accepts any expression that can be evaluated using the eval built-in and the numpy library.

The proposed code introduces additional syntax to the json mapping file. It is fully backward compatible with existing mappings. Documentation in the code and README is included.

@sherjeelshabih sherjeelshabih self-requested a review February 21, 2024 14:07
@coveralls
Copy link

coveralls commented Feb 21, 2024

Pull Request Test Coverage Report for Build 7990178235

Details

  • 20 of 63 (31.75%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.03%) to 43.49%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pynxtools/dataconverter/readers/json_map/reader.py 20 63 31.75%
Totals Coverage Status
Change from base Build 7989122601: -0.03%
Covered Lines: 4733
Relevant Lines: 10883

💛 - Coveralls

@sherjeelshabih
Copy link
Collaborator

Hey Matthias! Thank you for your pull request. I roughly went over your changes. They look very nicely done.

I'll check the details of the date parsing function later. It seems quite nice and clear already.

I just wanted to ask you the motivation behind using eval for expressions. We don't use eval on purpose in most of the code. It can, in many cases, be a security threat. Especially on one of the use cases of this tool within web apps such as Nomad. If you would like to add some programmability in the mapping file, we will need to strongly sanitize the expressions before we pass them to eval. And almost no sanitation is good enough.

I will instead suggest forming our own syntax around this like with the date parsing and only calling safe numpy functions from the code, like np.linspace. Allowing dynamic code with eval is very dangerous.

@mmpsi
Copy link
Author

mmpsi commented Feb 21, 2024

Hey Sherjeel

I understand that there may be security concerns. For my purposes, I just need linspace. I wrote the code more general so that it could accommodate more use cases and keep the syntax simple. It is up to you as the maintainer to decide how general or specific the syntax should be. I'm happy to change the code according to your suggestions.

@domna
Copy link
Collaborator

domna commented Feb 21, 2024

Hello @mmpsi, thank you for your PR. I think these are all very nice suggestions to incorporate into the reader. I agree with @sherjeelshabih that we should avoid using eval but I think we can integrate parsing without using eval (I already implemented formula parsing from within a nexus file for one of our readers).

Actually, we are currently redesigning our generic readers, which includes the json map reader, into one which supports and aligns our use cases. The idea is also that we have a config file with special notations which allows conveniently reading data from another source (see https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/main/tests/data/config_file.json for an example). The change is handled in #250. I will also add formula parsing and your string parsing routines there.

@domna domna mentioned this pull request Feb 21, 2024
2 tasks
Copy link
Collaborator

@sherjeelshabih sherjeelshabih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the date part of the code.

I would say let us get the generic reader a bit more formalized before you refactor the linspace functionality. It should be quite straightforward though. I just don't want you to refactor it twice when we already are shaping out the generic reader.

Comment on lines +197 to +202
try:
value = parse_opts["parse_string"]
if is_path(value):
value = get_val_nested_keystring_from_dict(value[1:], data)
except (KeyError, TypeError):
continue
Copy link
Collaborator

@sherjeelshabih sherjeelshabih Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the cases here will result in an exception and hit continue. It will be better to replace this try except block with an if statement. It should perform better in this case.

Suggested change
try:
value = parse_opts["parse_string"]
if is_path(value):
value = get_val_nested_keystring_from_dict(value[1:], data)
except (KeyError, TypeError):
continue
if "parse_string" not in parse_opts:
continue
value = parse_opts["parse_string"]
if is_path(value):
value = get_val_nested_keystring_from_dict(value[1:], data)


if "regexp" in parse_opts:
match = re.match(parse_opts["regexp"], value)
groups = match.groups('')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for consistency you can do this:

Suggested change
groups = match.groups('')
groups = match.groups("")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants