Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip re-computing metadata cache. #243

Merged
merged 6 commits into from
Jan 31, 2022

Conversation

alxmrs
Copy link
Contributor

@alxmrs alxmrs commented Nov 19, 2021

A fix for #241.

@alxmrs alxmrs changed the title Simple implementation of skipping re-computing metadata cache. Skip re-computing metadata cache. Nov 19, 2021
@cisaacstern
Copy link
Member

xref #224

@rabernat
Copy link
Contributor

@alxmrs - I think this change makes sense.

To move forward, we would need two tests:

@rabernat
Copy link
Contributor

rabernat commented Dec 2, 2021

Alex, if we can just get a test for __contains__ I am fine with merging this as is.

@alxmrs
Copy link
Contributor Author

alxmrs commented Dec 2, 2021

I'm happy to write unit tests for this – I will push a patch for this tomorrow.

I'm investigating if an additional flag is needed for this feature – should the user be able to set a recompute_metadata flag to force overwriting the cache? I may have encountered a need for this in the current project that I'm working on. My current workaround is to just change the path of the metadata cache.

@cisaacstern
Copy link
Member

should the user be able to set a recompute_metadata flag to force overwriting the cache

@jbusecke and I are working on some derived datasets, where the inputs are already in Zarr format on the cloud.

In this case, we would not be caching inputs, and therefore need some way to call cache_input_metadata as a standalone Stage of the recipe. This being it's own stage might supersede the need for for a specific recompute_metadata flag on the cache_inputs stage.

xref #224

@rabernat
Copy link
Contributor

rabernat commented Dec 3, 2021

My plan once this is merged is to start refactoring this recipe to have several [optional] standalone stages (#224), rather than just one big cache_inputs stage.

Here we have a choice--another option to add to the recipe config for recompute_metadata, or instead make the responsibility for clearing the metadata cache live in user land. The former adds more complexity to maintain. The latter requires intervention from the bakery operator who presumably has direct access to the metadata storage and can call rm on the recipe metadata cache directory. That's certainly what I would do if I were executing a recipe locally--just clean out the metadata cache dir.

This discussion reminds me a lot of - pangeo-forge/roadmap#29. Where in our workflow do we allocate and manage temporary storage? In pangeo-forge/roadmap#29 we discussed this for the target storage and resolved it with pangeo-forge/roadmap#34... But we have not really had that discussion for cache storage or metadata storage. Maybe this is something the recipe orchestrator could manage.

Alex, can I ask why you need to recompute the metadata? This might help figure out the best way to support that need.

@alxmrs
Copy link
Contributor Author

alxmrs commented Dec 3, 2021

Alex, can I ask why you need to recompute the metadata? This might help figure out the best way to support that need.

I hit an unrelated error to the metadata cache (I'll probably bring this up in our Monday meeting anyway; see below). I noticed in the error trace that some metadata may be off. I had iterated a bunch on the Recipe parameters as well as the structure of the input data. Since I had cached the metadata previously, I wanted to rule out the possibility that the pipeline was failing due to stale metadata.

When I recomputed the metadata (by changing the path for temp storage), I noticed that the error trace changed. To be more specific: In the first run, it reported inconsistent sequence lengths across my data in merge dimensions (Data was stored by month, chunked by hour. Across every merge dimension — i.e. groups of variables — the number of hours per month should be the same). In the second run, while the sequence lengths across concat dimensions differed (different # of total hours per month), they were all the same in merge dimensions.

Edit: I should mention, I'm working with a large Grib 2 dataset.

@alxmrs alxmrs marked this pull request as ready for review December 3, 2021 23:26
@rabernat rabernat merged commit 3b20ff6 into pangeo-forge:master Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants