Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opener refactor #245

Closed
wants to merge 14 commits into from
Closed

Conversation

rabernat
Copy link
Contributor

Eventually fixes #242.

Goes on top of #238.

@rabernat
Copy link
Contributor Author

rabernat commented Dec 17, 2021

As I mentioned at our latest meeting, I'm having some challenges around this refactor because our current "opener" mixes together file opening and caching. Caching is essentially a side effect which obscures the flow of data.

I am starting to wonder if we should continue to refactor things so that data only flows in one direction.

Imagine in we could create Recipes like this, in extreme pseudocode

source = Source(file_pattern, opener, **options)
destination = ZarrDestination(storage_target, **options)

recipe = Recipe(source, destination)

This would create a one way flow of data from source ➡️ destination.

Now to add caching we could do the following

cache = FileCacheDestination(cache_target)

recipe = Recipe(source, cache, destination)

This would create a one way flow of data from source ➡️ cache ➡️ destination.

This implies that any valid "destination" would also be a valid "source" for the next stage. In theory we could keep chaining steps together to build more complicated pipelines. For example.

combine = CombineBlocks(time=10)

recipe = Recipe(source, cache, combine, destination)

Just starting to think through the implications of this model. A key question is: What are the basic interface for one of the stages in a recipe? What methods / attributes are implemented by all of source, cache, combine, destination?

Possible answers:

  • Each has to have something equivalent to a FilePattern which allows the next stage to effectively iterate through a known number of steps. For the final step, it would just be a single item, the Zarr store itself.
  • Each stage has to know what type of thing it will produce: a file / url, an Xarray dataset, etc.

Can each stage effectively ignore everything other than the most proximate previous stage?

@TomNicholas
Copy link

Having spoken to @cisaacstern , I think that this refactor might be needed in order for @RobertPincus to use datatree to open all the groups in his NASA data in one function call.

@cisaacstern
Copy link
Member

This has been superseded by the beam refactor, so closing.

Cool to see how this work informed the opener transforms there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make "inputs" concept more general in order to accommodate data access from APIs
3 participants