Skip to content

Zarr reader #271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 170 commits into from
Apr 24, 2025
Merged

Zarr reader #271

merged 170 commits into from
Apr 24, 2025

Conversation

norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented Oct 24, 2024

WIP PR to add a Zarr reader. Thanks to @TomNicholas for the how to write a reader guide.

  • Closes Add Zarr Reader(s) #262
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • optimizations (e.g. using async interface to list lengths of chunks for each variable concurrently)
  • New functionality has documentation
  • Read v3 Zarr

Future PR(s):

  • Read v2 Zarr
  • sharded v3 data

To Do:

  • Open the store using zarr-python v3 (behind a protected import). This should handle both v2 and v3 stores for us.
  • Use zarr-python to list the variables in the store, and check that all loadable_variables are present
    For each virtual variable:
  • Use zarr-python to get the attributes and the dimension names, and coordinate names (which come from the .zmetadata or zarr.json)
  • Use zarr-python to also get the dtype and chunk grid info + everything else needed to create the virtualizarr.zarr.ZArray object (eventually we can skip this step and use a zarr-python array metadata class directly instead of virtualizarr.zarr.ZArray, see
    Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
  • Use the knowledge of the store location, variable name, and the zarr format to deduce which directory / S3 prefix the chunks must live in.
  • List all the chunks in that directory using fsspec.ls(detail=True), as that should also return the nbytes of each chunk. Remember that chunks are allowed to be missing.
  • The offset of each chunk is just 0 (ignoring sharding for now), and the length is the file size fsspec returned. The paths are just all the paths fsspec listed.
  • Parse the path and length information returned by fsspec into the structure that we can pass to ChunkManifest.init
  • Create a ManifestArray from our ChunkManifest and ZArray
  • Wrap that ManifestArray in an xarray.Variable, using the dims and attrs we read before
    Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
  • Use separate_coords to set the correct variables as coordinate variables (and avoid building indexes whilst doing it)
  • Merge all the variables into one xr.Dataset and return it.
  • All the above should be wrapped in a virtualizarr.readers.zarr.open_virtual_dataset function, which then should be called as a method from a ZarrVirtualBackend(VirtualBackend) subclass.
  • Finally add that ZarrVirtualBackend to the list of readers in virtualizarr.backend.py

@norlandrhagen
Copy link
Collaborator Author

#273

@norlandrhagen
Copy link
Collaborator Author

Bit of an update. With the help from @sharkinsspatial @abarciauskas-bgse and @maxrjones I got a Zarr loaded as a virtual dataset.

<xarray.Dataset> Size: 3kB
Dimensions:  (time: 10, lat: 9, lon: 18)
Coordinates:
    lat      (lat) float32 36B ManifestArray<shape=(9,), dtype=float32, chunk...
    lon      (lon) float32 72B ManifestArray<shape=(18,), dtype=float32, chun...
  * time     (time) datetime64[ns] 80B 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
    air      (time, lat, lon) int16 3kB ManifestArray<shape=(10, 9, 18), dtyp...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Next up is how to deal with fill_values.

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

There seems to be some conversion logic in @sharkinsspatial's HDF reader for converting fill_values . It also looks like there is some fill_value handling in zarr.py.

@TomNicholas
Copy link
Member

I got a Zarr loaded as a virtual dataset.

Amazing!

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

This seems like an issue that should actually be orthogonal to this PR (if it weren't for the ever-present difficulty of testing). Either the problem is in the ZArray class and what types it allows, or it's in the Kerchunk writer not knowing how to serialize a valid ZArray. Either way if np.float32(nan) is a valid fill_value for a zarr array then it's not the fault of the new zarr reader.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a a great start! I think the main thing here is that we don't actually need kerchunk in order to test this reader.

@TomNicholas
Copy link
Member

get chunk size with zarr-python (zarr-developers/zarr-python#2426) instead of fsspec

I think we should just do this in this PR. We can point to Tom's PR for now in the CI, but I expect that will get merged before this does anyway. If you look at Tom's implementation it's basically what we're doing here.

@TomNicholas
Copy link
Member

Store.getsize was just merged so we can possibly just use the same upstream env we are already using zarr-python zarr-developers/zarr-python#2426

* Use ManifestStore in Zarr reader

* Update virtualizarr/readers/zarr.py

Co-authored-by: Raphael Hagen <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Raphael Hagen <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @norlandrhagen , and for your help @maxrjones .

I believe I spotted one small bug, but otherwise this looks great.

The implementation is very neat now!

@@ -77,7 +77,7 @@ vds.virtualize.to_icechunk(icechunkstore)

### I already have some data in Zarr, do I have to resave it?

No! VirtualiZarr can (well, [soon will be able to](https://github.com/zarr-developers/VirtualiZarr/issues/262)) create virtual references pointing to existing Zarr stores in the same way as for other file formats.
No! VirtualiZarr can create virtual references pointing to existing Zarr stores in the same way as for other file formats. Note: Currently only reading Zarr V3 is supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue to track learning to read zarr v2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Tom Nicholas <[email protected]>
@norlandrhagen norlandrhagen merged commit ff1ddb4 into develop Apr 24, 2025
12 checks passed
@norlandrhagen norlandrhagen deleted the zarr_reader branch April 24, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
readers zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants