Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for conversion -> combination change in Provenance attribute #810

Closed
b-reyes opened this issue Sep 16, 2022 · 10 comments
Closed

Account for conversion -> combination change in Provenance attribute #810

b-reyes opened this issue Sep 16, 2022 · 10 comments
Assignees
Milestone

Comments

@b-reyes
Copy link
Contributor

b-reyes commented Sep 16, 2022

In PR #806 conversion was changed to combination in combine.py. This causes issues when displaying a combined file because we specifically obtain the version from the Provenance attributes:

@property
def version_info(self) -> Tuple[int]:
if self["Provenance"].attrs.get("conversion_software_name", None) == "echopype":
version_str = self["Provenance"].attrs.get("conversion_software_version", None)

I don't believe this will causes issues with other portions of the codebase.

@b-reyes b-reyes added this to the 0.6.3 milestone Sep 16, 2022
@b-reyes b-reyes self-assigned this Sep 16, 2022
@emiliom
Copy link
Collaborator

emiliom commented Sep 16, 2022

Based on discussion in #811, I'm changing the Milestone tag to 0.6.4

@leewujung
Copy link
Member

This seems also an issue that we are in a position to address once and for all.

@leewujung leewujung removed this from the 0.6.4 milestone Nov 28, 2022
@leewujung leewujung assigned emiliom and unassigned b-reyes Apr 7, 2023
@leewujung leewujung added this to the 0.7.2 milestone May 17, 2023
@leewujung
Copy link
Member

@emiliom and @leewujung discussed this: we will not remove these provenance attributes and just use a long History attribute to capture all the functions applied to the data. Some more thinking required to figure out what to do with the outcome of combine_echodata. We might add a set of equivalent attributes with name combinatino_*.

@emiliom
Copy link
Collaborator

emiliom commented May 25, 2023

The conversion_* provenance attributes are actually specified by the convention and are "mandatory if applicable or available" (MA). If I tried to change them earlier (#806), that was a mistake! So, we will definitely not remove these attributes.

Adding equivalent combination_* attributes when combining data (echodata objects) may be a good way to go. Then, the version_info method mentioned above can be modified so that it first checks if a conversion_software_version attribute exists; if it does, it'll use it to read the echodata (=echopype) version info, otherwise it falls back to the conversion_software_version.

But before settling on that scheme, I'll ask the SONAR-netCDF4 community about the interpretation of these "conversion" provenance attributes. Specifically, should a combine-data processing step be interpreted as another type of conversion step? In that case, we should just reuse the conversion_* attributes. The SONAR-netCDF4 documents provide too little information to make that determination; they just say that these attributes refer to the "software used to do the conversion". Is combination a type of conversion?

I'll hold off on working on this until PR #1042 is merged. But I'll plan to ask the SONAR-netCDF4 community soon.

@leewujung
Copy link
Member

I think reusing the “conversion” set of attributes to store the “combination” information is potentially problematic, because we will lose the info pertaining to each originally converted files. It will also make it much harder to keep track of what’s converted from the individual raw file and what’s combined afterwards (and therefore does not 1-1 corresponding raw files). I suggest that we just keep “conversion” and “combination” separate.

Also, for distinguishing if a file is previously combined or just converted: in discussing with @lsetiawan on #1042, I thought we can add a global attribute to the Too-level group combine = True or False. This way we can easily work with the files without having to load the metadata from specific groups, and just use the very light weight Top-level group for this information. I have not actually reviewed the PR to see it’s implemented, but even if not, it’s a relatively small change to make.

@lsetiawan
Copy link
Member

lsetiawan commented May 25, 2023

I thought we can add a global attribute to the Too-level group combine = True or False

This is stored in Provenance rather than the Top-level after discussion with @emiliom.

Update from @emiliom, 7/25/23: The name of that new, "boolean" attribute is is_combined.

@leewujung leewujung modified the milestones: 0.8.1, 0.8.0 Jul 15, 2023
@emiliom
Copy link
Collaborator

emiliom commented Jul 27, 2023

The conversion_* attributes are defined in the convention as having single entries. eg: conversion_time : 2023-07-27T18:47:02Z. combine_data reads multiple converted datasets, each with its own values for these attributes. Currently it looks like combine_echodata doesn't try to create a list of the attribute values from each input dataset, but it overwrites conversion_time.

I believe that in other cases of global attributes, lists are created from the collection of inputs and variables in the Provenance group are created to hold those lists? Should we do that here? Then the question would be, what values should remain in the single set of Provenance.conversion_* attributes; the most recent one?

@leewujung
Copy link
Member

Currently it looks like combine_echodata doesn't try to create a list of the attribute values from each input dataset, but it overwrites conversion_time.

I believe that in other cases of global attributes, lists are created from the collection of inputs and variables in the Provenance group are created to hold those lists? Should we do that here? Then the question would be, what values should remain in the single set of Provenance.conversion_* attributes; the most recent one?

I think combine_echodata currently:

  • overwrites the global attributes of each group using the first value for a given attributes: I think this makes more sense because if we treat a set of files to be 1 entity, "the first" is the value of the first files. So currently the conversion_time after combine_echodata will be conversion_time of the first file. This is in _merge_attribute:
    def _merge_attributes(attributes: List[Dict[str, str]]) -> Dict[str, str]:
    . There is a NOTE in there about you and I discussing this.
  • put together a
  • does create data variables in Provenance to store attributes from the original (pre-combined) files into data variables in the Provenance group, with the source echodata group stored as an attribute. This is in _capture_prov_attrs:
    def _capture_prov_attrs(

I think previously we talked about potentially having a combination_time that is separate from conversion_time. What do you think about that?

@emiliom
Copy link
Collaborator

emiliom commented Jul 29, 2023

I think previously we talked about potentially having a combination_time that is separate from conversion_time. What do you think about that?

Yes, that's still the plan.

@emiliom
Copy link
Collaborator

emiliom commented Aug 4, 2023

Addressed in #1113. Closing the issue

@emiliom emiliom closed this as completed Aug 4, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Echopype Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants