Skip re-computing metadata cache. #243

alxmrs · 2021-11-19T21:28:48Z

A fix for #241.

cisaacstern · 2021-11-22T19:42:29Z

xref #224

rabernat · 2021-11-23T16:53:16Z

@alxmrs - I think this change makes sense.

To move forward, we would need two tests:

A test for the __contains__ function in https://github.com/pangeo-forge/pangeo-forge-recipes/blob/master/tests/test_storage.py
A unit test for this re-computing feature. This is harder to test because our recipes tests are mostly integration-style tests which execute the whole recipe from start to finish. I don't think it makes sense to re-execute an entire recipe just to test this feature. Since the big refactor, it would now be much easier to implement unit tests (see Add unit tests for recipe functions #173), but we have not actually done it yet. Perhaps this could be our first unit test? 😄

rabernat · 2021-12-02T14:03:48Z

Alex, if we can just get a test for __contains__ I am fine with merging this as is.

alxmrs · 2021-12-02T19:32:26Z

I'm happy to write unit tests for this – I will push a patch for this tomorrow.

I'm investigating if an additional flag is needed for this feature – should the user be able to set a recompute_metadata flag to force overwriting the cache? I may have encountered a need for this in the current project that I'm working on. My current workaround is to just change the path of the metadata cache.

cisaacstern · 2021-12-02T23:34:03Z

should the user be able to set a recompute_metadata flag to force overwriting the cache

@jbusecke and I are working on some derived datasets, where the inputs are already in Zarr format on the cloud.

In this case, we would not be caching inputs, and therefore need some way to call cache_input_metadata as a standalone Stage of the recipe. This being it's own stage might supersede the need for for a specific recompute_metadata flag on the cache_inputs stage.

xref #224

rabernat · 2021-12-03T01:17:03Z

My plan once this is merged is to start refactoring this recipe to have several [optional] standalone stages (#224), rather than just one big cache_inputs stage.

Here we have a choice--another option to add to the recipe config for recompute_metadata, or instead make the responsibility for clearing the metadata cache live in user land. The former adds more complexity to maintain. The latter requires intervention from the bakery operator who presumably has direct access to the metadata storage and can call rm on the recipe metadata cache directory. That's certainly what I would do if I were executing a recipe locally--just clean out the metadata cache dir.

This discussion reminds me a lot of - pangeo-forge/roadmap#29. Where in our workflow do we allocate and manage temporary storage? In pangeo-forge/roadmap#29 we discussed this for the target storage and resolved it with pangeo-forge/roadmap#34... But we have not really had that discussion for cache storage or metadata storage. Maybe this is something the recipe orchestrator could manage.

Alex, can I ask why you need to recompute the metadata? This might help figure out the best way to support that need.

… into xr-zarr-metadata-local

alxmrs · 2021-12-03T23:24:37Z

Alex, can I ask why you need to recompute the metadata? This might help figure out the best way to support that need.

I hit an unrelated error to the metadata cache (I'll probably bring this up in our Monday meeting anyway; see below). I noticed in the error trace that some metadata may be off. I had iterated a bunch on the Recipe parameters as well as the structure of the input data. Since I had cached the metadata previously, I wanted to rule out the possibility that the pipeline was failing due to stale metadata.

When I recomputed the metadata (by changing the path for temp storage), I noticed that the error trace changed. To be more specific: In the first run, it reported inconsistent sequence lengths across my data in merge dimensions (Data was stored by month, chunked by hour. Across every merge dimension — i.e. groups of variables — the number of hours per month should be the same). In the second run, while the sequence lengths across concat dimensions differed (different # of total hours per month), they were all the same in merge dimensions.

Edit: I should mention, I'm working with a large Grib 2 dataset.

alxmrs added 2 commits November 18, 2021 19:56

Simple implementation of skipping re-computing metadata cache.

962f99d

Checking before re-computing for open_input_with_fsspec_reference.

4c1b82d

alxmrs changed the title ~~Simple implementation of skipping re-computing metadata cache.~~ Skip re-computing metadata cache. Nov 19, 2021

Passing pre-commit checks.

2d5e826

cisaacstern mentioned this pull request Dec 3, 2021

Derived CMIP6 data recipe builder (WIP) #252

Closed

alxmrs added 2 commits December 3, 2021 15:12

Merge branch 'master' of github.com:pangeo-forge/pangeo-forge-recipes…

a4edaf6

… into xr-zarr-metadata-local

Added test for __contains__ in metadata target.

febee33

Pre-commit: '' --> ""

8d86aa5

alxmrs marked this pull request as ready for review December 3, 2021 23:26

rabernat merged commit 3b20ff6 into pangeo-forge:master Jan 31, 2022

alxmrs mentioned this pull request Feb 17, 2022

Check that metadata is already cached before re-computing to prevent local file copy. #241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip re-computing metadata cache. #243

Skip re-computing metadata cache. #243

alxmrs commented Nov 19, 2021

cisaacstern commented Nov 22, 2021

rabernat commented Nov 23, 2021

rabernat commented Dec 2, 2021

alxmrs commented Dec 2, 2021

cisaacstern commented Dec 2, 2021

rabernat commented Dec 3, 2021

alxmrs commented Dec 3, 2021 •

edited

Loading

Skip re-computing metadata cache. #243

Skip re-computing metadata cache. #243

Conversation

alxmrs commented Nov 19, 2021

cisaacstern commented Nov 22, 2021

rabernat commented Nov 23, 2021

rabernat commented Dec 2, 2021

alxmrs commented Dec 2, 2021

cisaacstern commented Dec 2, 2021

rabernat commented Dec 3, 2021

alxmrs commented Dec 3, 2021 • edited Loading

alxmrs commented Dec 3, 2021 •

edited

Loading