Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize Github workflow logic for running recipes in feedstocks. #67

Open
Tracked by #33
sharkinsspatial opened this issue Aug 2, 2021 · 4 comments
Open
Tracked by #33
Labels
CI Related to CI and GitHub workflows

Comments

@sharkinsspatial
Copy link
Contributor

The proposed workflow logic for running recipes in feedstock repos is described here https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0001-github-workflows.md but we should probably review for a consensus decision and address some open questions around versioning, updating existing zarr archives.

The initial naive Github action will register and run the full recipe (replacing an existing zarr store if it exists) for each recipe in the repository when a new release tag is created. While functional, there are several issues with this approach which should be addressed.

  1. The action should really work in a similar fashion to the existing staged-recipes PR testing workflow which only proceeds when the PR includes changes to a meta.yaml file. I'm unsure how diff should be used with a release tag in order to determine the listing of meta.yaml files that have been updated in a release. There is also the possibility that the underlying recipe code can change with no corresponding change to the meta.yaml so I would suggest the possibility of including a root level version property to the meta.yaml spec to facilitate flagging with diff.

  2. The current approach will naively replace the entire existing zarr archive when the recipe is re-run. (Additionally, I would like clarification on how pangeo-forge-recipes will handle the case of writing to object storage when the target archive already exists.). There has been previous discussion about incrementally updating existing archives but this poses a few questions in relation to workflows. How should underlying changes to source data (notably from reprocessing efforts) be handled? How should intermediate cache files invalidation be handled for source data changes?

  3. Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.

  4. Another open question around feedstock workflows concerns how we should manage recipes for different chunking strategies for the same data. I believe it makes sense that we maintain single feedstocks for a source dataset and that different chunking strategies are treated as unique recipes within the feedstock. With this approach, users could submit PRs for new chunking strategies and using the diff method discussed above we can run only the recipes which were changed in the release.

As discussed in the coordination meeting today. I'll push forward with the naive approach initially so that we can close #58. Then @rabernat can schedule a sprint where we can address these open questions and update the workflows.

@rabernat
Copy link
Contributor

rabernat commented Aug 3, 2021

  1. There is also the possibility that the underlying recipe code can change with no corresponding change to the meta.yaml so I would suggest the possibility of including a root level version property to the meta.yaml spec to facilitate flagging with diff.

Conda forge recipe.yaml (example) distinguishes between "version" (the actual software version, same as on pypi) and "build". To rebuild a recipe, you can increment "build" but not touch "version." We could consider doing the same.

2. The current approach will naively replace the entire existing zarr archive when the recipe is re-run. (Additionally, I would like clarification on how pangeo-forge-recipes will handle the case of writing to object storage when the target archive already exists.).

For the XarrayZarr recipe, the original dataset will not be deleted. It will follow this code path, trying to open the target dataset. Then it will overwrite all the data variables, but not change the metadata. This is probably not the ideal behavior in all cases. It was done this way to facilitate "append" workflows.

  • Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.

I think this would make lots of sense. It would require an update to ADR 3. Keep in mind that there is a layer of abstraction between the raw object store and the user: the catalog.

4. Another open question around feedstock workflows concerns how we should manage recipes for different chunking strategies for the same data.

For now, It think we can handle this simply as distinct recipes. The recipe names would include a reference to the chunking scheme, e.g. noaa_oisst_space_chunks vs noaa_oisst_time_chunks. This is not super elegant, but it would work and allow us to move forward. We can revisit this if it becomes a very common pattern.

I am eager to organize this sprint, but not sure when would be the best time. My week is once again a bit scattered, without a lot of consistent work time. I could possibly do Wednesday (tomorrow) 6-7 pm CET. Otherwise next week would probably be better.

@cisaacstern
Copy link
Member

Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.

I think this would make lots of sense. It would require an update to ADR 3. Keep in mind that there is a layer of abstraction between the raw object store and the user: the catalog.

👍 and seems like this would also require an update to ADR 3:

The recipe names would include a reference to the chunking scheme, e.g. noaa_oisst_space_chunks vs noaa_oisst_time_chunks

Seems like if we're ever going to want it there, then it's easiest to always include chunking in the target path name.

I could possibly do Wednesday (tomorrow) 6-7 pm CET. Otherwise next week would probably be better.

Available Wednesday (tomorrow) 6-7 pm CET, and also flexible next week.

@rabernat
Copy link
Contributor

rabernat commented Aug 3, 2021

then it's easiest to always include chunking in the target path name.

Not so sure about this. First of all, the chunking information is trivially accessible from the zarr metadata, so it is easy to discover when building the catalog. Second, I'm not sure how you would "encode" chunking into the path name. But we could try to work this out in a whiteboard session.

@rabernat
Copy link
Contributor

rabernat commented Aug 4, 2021

I am not sure if this hack is happening, but I'll join https://whereby.com/pangeo now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Related to CI and GitHub workflows
Projects
None yet
Development

No branches or pull requests

3 participants