Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yaml provenance #11

Open
LuiggiTenorioK opened this issue Jul 11, 2024 · 5 comments · May be fixed by #34
Open

Yaml provenance #11

LuiggiTenorioK opened this issue Jul 11, 2024 · 5 comments · May be fixed by #34

Comments

@LuiggiTenorioK
Copy link
Member

In GitLab by @mandresm on Jul 11, 2024, 17:24

Summary

As mentioned in previous meetings I want to propose that the conf/metadata/experiment_data.yml contains information about the provenance of each value in the form of a comment. I am opening the issue in order to discuss this implementation strategy, timeline, responsibilities, possible improvements to the feature...

An equivalent feature exists for ESM-Tools, a experiment configuration tool and workflow manager we develop at AWI. I propose we copy/paste from there and start modifying what we need. Here, there is an example of what I have in mind for the equivalent yaml file in ESM-Tools:

fesom:
  model: fesom  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:4,col:8
  branch: 2.0.2 # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:17,col:13
  version: 2 # <SOME_ABSOLUTE_PATH>/esm_tools/configs/setups/awicm3/awicm3.yaml,line:399,col:18
  type: ocean # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:7,col:7
  comp_command: mkdir -p build; cd build; cmake -DOIFS_COUPLED=ON -DFESOM_COUPLED=ON -DCMAKE_INSTALL_PREFIX=../ ..;   make install -j `nproc --all` # <SOME_ABSOLUTE_PATH>/esm_tools/configs/setups/awicm3/awicm3.yaml,line:414,col:31
  clean_command: rm -rf build CMakeCache.txt # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:10,col:16
  required_plugins:
  - git+https://github.com/esm-tools-plugins/tar_binary_restarts  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:13,col:3
  install_bins: bin/fesom.x  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:22,col:19
  git-repository:
  - https://github.com/FESOM/fesom2.git  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:20,col:7
  - https://gitlab.dkrz.de/FESOM/fesom2.git # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:21,col:7

Who am I suggesting that implements this feature?

Either me or @Hussam-Turjman over the next month/2-months, with the support from someone from Autosubmit, for example @dbeltrankyl or @kinow. But if someone at BSC wants to have a head start, help yourself :)

What does this feature support?

  • Reading yamls into python collections, and for dictionaries and lists, storing the line, column and path to the file as a provenance attribute of the value
  • Each value read from a yaml file is a subclass of the original value's type defined dynamically, with methods to handle the provenance history
  • Dictionary and list subclasses for ensuring that methods such as update, __setitem__, etc. keep a history of the value's provenance history, and other methods to recursively retrieve and set the provenance values. Also a clean_provenance method to recursively return the original value and value type.
  • Writing yamls where each value containing a provenance has a comment next to it indicating the line, column and path of the file defining that value

All of this won't only be useful for the comments in conf/metadata/experiment_data.yml, but also to question at any point in Autosubmit, the provenance of a given value, simply by using the provenance attribute of that particular value: in a dict my_dict["my_key"].provenance and in a list my_list[my_index].provenance. Could also come pretty handy for improving error messages.

Can we reuse (copy/paste) the code from ESM-Tools?

Yes, our license is GPL-2: https://github.com/esm-tools/esm_tools?tab=GPL-2.0-1-ov-file#readme

Relevant files in ESM-Tools

How can it be implemented?

  1. During the parser of the yaml one needs to extract the line and column information somehow and store it in a collection that has the same structure as the collection loaded from the yaml. We do that with the EsmToolsLoader, a subclass of ruamel.yaml.YAML:
    https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L693-L770
    Note that EsmToolsLoader has some deprecated methods related to the dumping. The most important method there is load

    That uses this constructor class, subclassed from the ruamel.yaml.RoundTripRepresenter
    https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L638-L673

    Note there we are subclassing from EnvironmentConstructor, which parent class is ruamel.yaml.RoundTripRepresenter. For the implementation here we could directly subclass from ruamel.yaml.RoundTripRepresenter.

    Once the code is implemented one can simply do:

    esm_tools_loader = EsmToolsLoader()
    esm_tools_loader.set_filename(yaml_file)
    yaml_load, provenance = esm_tools_loader.load(yaml_file)

    as in these lines: https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L188-L198

    After this you are going to have your standard collection as read from ruamel.yaml in yaml_load and the provenance, another collection with the same structure as yaml_load in terms of keys, but the values contain provenance objects instead.

  2. Join the two worlds in one single collection, for example, for a dictionary use the class DictionaryWithProvenance:

    dictionary_with_provenance = DictWithProvenance(yaml_load, provenance)

    This dict has now all the provenance information attached to its values and you can use it at your own will. If your collection is a list you can choose to use ListWithProvenance instead of DictWithProvenance. tuples, sets and others are not supported.

    For all the methods related to provenance see the procenance.py itself. It's almost more docstrings than code: https://github.com/esm-tools/esm_tools/blob/release/src/esm_parser/provenance.py

  3. You can now operate with the lists and dictionaries as you would usually do. As long as you are using __setitem__ (or update in the case of the dictionaries) you would keep the provenance history in the provenance attribute of the value, the last entry on the provenance is the actual provenance of its current value:

    my_list_with_prov[2] = my_var_with_prov
    previous_provenance = my_list_with_prov[2].provenance[-2]
    latest_provenance = my_list_with_prov[2].provenance[-1]
  4. Time to dump the Frankenstein dictionary we've been putting together from pieces of other yamls with using the function yaml_dump https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/dict_to_yaml.py#L11-L130

    yaml_dump(your_dict/list_with_prov, "/path/to/the/commented.yaml")

    It's not a very elegant and efficient function, but it does the job, I guess...

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Jul 16, 2024, 10:00

Hello @mandresm ,

Thanks for explaining the proposal and for trying to implement it! very interesting

I'm the one who wrote the Autosubmit Frankenstein dict, and I'll be on holiday from 22/07 to 05/08. If you or @Hussam-Turjman have any doubts, I can answer them during this week or after my holidays, but @kinow reviewed it a long time ago, so maybe you can also ask him.

Thanks

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @kinow on Sep 6, 2024, 08:08

mentioned in merge request digital-twins/de_340-2/workflow!294

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mandresm on Sep 25, 2024, 09:30

mentioned in issue digital-twins/de_340-2/workflow#591

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @kinow on Oct 9, 2024, 14:41

mentioned in commit 0b99076

@LuiggiTenorioK
Copy link
Member Author

unassigned @mandresm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant