Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append-only production runs #5

Open
cisaacstern opened this issue Apr 26, 2022 · 2 comments
Open

Append-only production runs #5

cisaacstern opened this issue Apr 26, 2022 · 2 comments

Comments

@cisaacstern
Copy link
Member

cisaacstern commented Apr 26, 2022

User Profile

As a recipe maintainer

User Action

I want to re-run recipes in my feedstock (either manually or on a schedule) to append newly released data to my dataset

User Goal

So that I can keep the dataset built by my feedstock up-to-date with the latest releases from the data provider without needing to re-run the entire recipe

Acceptance Criteria

The ability to trigger append-only production runs (manually or on a schedule) from a feedstock. This might be inferred from the recipe itself, or perhaps specified by a new property in the meta.yaml

Linked Issues

@cisaacstern
Copy link
Member Author

I believe there are only two remaining prerequisites before the actual work of this feature can begin in pangeo-forge-recipes:

  1. Merge Persist execution context in storage target pangeo-forge-recipes#359 (I believe this is ready to go, but I haven't looked at it in a week or so, so one last review is probably worthwhile)
  2. We'll probably want to revisit the way the recipe hash is generated. As it stands, the file pattern hash is included in the recipe hash calculation. On further thought, I realize that if we do not isolate recipe and pattern hashes, then we don't have any way to confirm that a given recipe, if used to append to a particular existing dataset, shares the same (non-file-pattern) attributes (e.g. target chunks, etc.) as the recipe used to create the existing dataset. This is perhaps less straightforward than simply isolating file pattern and recipe hashes, because the recipe classes do currently contain certain attributes which relate to execution concerns, which could vary without affecting "append compatibility" of two recipe instances.

Once these two items are addressed, work can begin on an appending feature in pangeo-forge-recipes.

@sharkinsspatial
Copy link

@rabernat Per our discussion in the call yesterday I'm including some more details on our AWS-ASDI specific use cases here rather than in a new ticket on pangeo-forge-recipes. For many of the reference indexes we are generating for data in the AWS PDS buckets pangeo-forge/staged-recipes#208, we'll need to periodically update the index as new data becomes available. In almost all of our cases, this will involve expanding the index's time dimension. I think our use case is a bit atypical in that most of the buckets where these datasets live have event notifications configured for new keys which allow us to monitor data being added. Originally I had envisioned us queuing these event notifications and periodically sending a block of new files to pangeo-forge for appending to the target archive. This will be great for our use case, but I don't think it generalizes as well for most users. Instead I think we'll likely need a process that

  • Allows users to configure an update checking cron increment for their recipe.
  • Uses the last step in the zarr or kerchunk archive and the new date to construct a new FilePattern for the data to append.

This assumes that the recipe's concat dim is temporal and we'd likely need to restrict the append only cron configuration to work for recipes where this is true.

@cisaacstern has linked most of the related issues above but I'll include the more recent Beam specific issue here for tracking as well pangeo-forge/pangeo-forge-recipes#447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants