Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested HDF5 Data / HEC-RAS #490

Open
thwllms opened this issue Aug 8, 2024 · 8 comments
Open

Nested HDF5 Data / HEC-RAS #490

thwllms opened this issue Aug 8, 2024 · 8 comments

Comments

@thwllms
Copy link

thwllms commented Aug 8, 2024

I'm working on development of the rashdf library for reading HEC-RAS HDF5 data. A big part of the motivation for development of the library is stochastic hydrologic/hydraulic modeling.

We want to be able to generate Zarr metadata for stochastic HEC-RAS outputs, so that e.g., results for many different stochastic flood simulations from a given RAS model can be opened as a single xarray Dataset. For example, results for 100 different simulations could be concatenated in a new simulation dimension, with coordinates being the index number of each simulation. It took me a little while to figure out how to make that happen because RAS HDF5 data is highly nested and doesn't conform to typical conventions.

The way I implemented it is hacky:

  1. Given an xr.Dataset pulled from the HDF file and the path of each child xr.DataArray within the HDF file,
  2. Get the filters for each DataArray: filters = SingleHdf5ToZarr._decode_filters(None, hdf_ds)
  3. Get the storage info for each DataArray: storage_info = SingleHdf5ToZarr._storage_info(None, hdf_ds)
  4. Build out metadata for chunks using storage_info
  5. "Write" the xr.Dataset to a zarr.MemoryStore with compute=False, to generate the framework of what's needed for the Zarr metadata
  6. Read the objects generated by writing to zarr.MemoryStore and decode
  7. Assemble the zarr.MemoryStore objects, filters, and storage_info into a dictionary and finally return

I suppose my questions are:

  • Is there a better way to approach highly nested or otherwise idiosyncratic HDF5 data with Kerchunk?
  • Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?
@martindurant
Copy link
Member

Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think.

The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.

Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?

This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class.

@thwllms
Copy link
Author

thwllms commented Aug 9, 2024

Does kerchunk.hdf not already cope with your "idiosyncratic" HDF5 files? A few recent changes were made to target nested trees in HDF, but I'd be interested to know in what manner it fails. Although there are some netCDF-influenced choices, JSON reference output should be fairly immune to those, I think.

SingleHdf5ToZarr.translate works, but due to the way the HDF is structured it's difficult to manage. (Or at least I haven't totally figured it out...) HEC-RAS HDF5 Datasets within the same Group have mismatched and/or unlabeled dimensions. We end up with errors and phony_dims. For example, with this JSON reference:

>>> import h5py
>>> import xarray as xr
>>> from kerchunk.hdf import SingleHdf5ToZarr
>>> ras_h5 = h5py.File("/mnt/c/temp/ElkMiddle.p01.hdf", "r")
>>> zmeta = SingleHdf5ToZarr(ras_h5, "file:///mnt/c/temp/ElkMiddle.p01.hdf").translate()
>>> import json
>>> with open("ElkMiddle.p01.hdf.json", "w") as z:
...     z.write(json.dumps(zmeta, indent=2))
...
722087
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}})
>>> ds
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    File Type:     HEC-RAS Results
    File Version:  HEC-RAS 6.3.1 September 2022
    Projection:    PROJCS["USA_Contiguous_Albers_Equal_Area_Conic_USGS_versio...
    Units System:  US Customary
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "ElkMiddle.p01.hdf.json"}}, group="/Results/Unsteady/Output/Output Blocks/Base Output/Unsteady Time Series/2D Flow Areas/ElkMid
dle")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/api.py", line 588, in open_dataset
    backend_ds = backend.open_dataset(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1188, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/backends/store.py", line 58, in open_dataset
    ds = Dataset(vars, attrs=attrs)
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 713, in __init__
    variables, coord_names, dims, indexes, _ = merge_data_and_coords(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/dataset.py", line 427, in merge_data_and_coords
    return merge_core(
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/merge.py", line 705, in merge_core
    dims = calculate_dimensions(variables)
  File "/home/thwllms/dev/scratch/test-kerchunk-ras/venv-test-kerchunk-ras/lib/python3.10/site-packages/xarray/core/variable.py", line 3009, in calculate_dimensions
    raise ValueError(
ValueError: conflicting sizes for dimension 'phony_dim_1': length 33101 on 'Face Velocity' and length 14606 on {'phony_dim_0': 'Cell Cumulative Precipitation Depth', 'phony_dim_1': 'Cell Cumulative Precipitation Depth'}

The structure is complex. Any developer who has worked with RAS data could rant about it but ultimately the point is that a helping hand is needed extract data from RAS HDF5 files into nice xarray objects, hence the rashdf project.

We considered using xarray-datatree for the project but ultimately decided not to use it at this time since a) DataTree seems to be making its way into xarray proper, b) I couldn't figure out how to get Kerchunk and datatree working together, and c) our driving use case is pretty focused on specific data within the files and it seemed simpler to start by tackling those specific needs. Here's a visualized datatree of a typical RAS HDF file.

The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset. It doesn't sound like you are doing that yet.

Combining reference sets is not in the rashdf library itself and I'm not sure we plan to add it there, but logical aggregate is definitely the idea.

Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?

This is python, so it doesn't really matter. I certainly didn't anticipate that anyone would want to call them from outside the class.

Seems like a reasonable assumption. Calling methods with leading underscores feels naughty to me, but if those methods are unlikely to change then maybe we're alright.

@thwllms thwllms closed this as completed Aug 9, 2024
@thwllms
Copy link
Author

thwllms commented Aug 9, 2024

Sorry, didn't mean to close this.

@thwllms thwllms reopened this Aug 9, 2024
@martindurant
Copy link
Member

That truly is a gnarly data hierarchy. I wonder whether where kerchunk gets confused, is groups which have both child group(s) and array(s). The zarr model does support that, but I wouldn't be too surprised if we have extra hidden assumptions.

Writing specialised versions of the file scanner for a well-understood use case is of course fine, and I can try to help however I can.

Is there a better way to approach highly nested or otherwise idiosyncratic HDF5 data with Kerchunk?

It might be worthwhile figuring out what exactly is going wrong with pure kerchunk, but this doesn't touch most current users, as there is a lot of netCDF-compliant data, which is far simpler in structure. If you make interesting reference sets with your code, I'd be happy to see them and even better would be a blog post about your process :)

@TomNicholas
Copy link

TomNicholas commented Oct 2, 2024

The main shortcoming in kerchunk for your files would be combining many of the reference sets into a logical aggregate dataset.

Using VirtualiZarr should make this part simpler to handle :)

We considered using xarray-datatree for the project but ultimately decided not to use it at this time since a) DataTree seems to be making its way into xarray proper, b) I couldn't figure out how to get Kerchunk and datatree working together, and c) our driving use case is pretty focused on specific data within the files and it seemed simpler to start by tackling those specific needs.

That all makes sense. FYI soon we should have DataTree released in xarray main, and then making DataTree work with virtualizarr should be very straightforward (it should actually work out of the box, it's just not super useful until you have tree-aware references readers and writers).

Given an xr.Dataset pulled from the HDF file and the path of each child xr.DataArray within the HDF file

That's nasty 😆 But if you already have a (hacky) way of generating kerchunk references on a per-variable basis from an xarray dataset, it should be pretty straightforward to convert that into a virtualizarr "virtual" xarray dataset. You're basically just calling the internal function virtualizarr.readers.kerchunk.variable_from_kerchunk_refs on every variable in your dataset.

@martindurant
Copy link
Member

@TomNicholas , do you think the time has come to use Vzarr to do all that MultiZarrToZarr can do? Basically, the existing code in MZZ for finding the coordinates (the "mappers") for each input dataset is still useful, but then building the coordinates and outputting the references need not be duplicated.

For cases where there is uneven chunking in the concat dimension(s), we can come up with the best way to represent that output, whichever manifest works.

kerchunk could simply depend on VZarr.

@TomNicholas
Copy link

do you think the time has come to use Vzarr to do all that MultiZarrToZarr can do?

IMO it's very close.

Basically, the existing code in MZZ for finding the coordinates (the "mappers") for each input dataset is still useful

If you're willing to inline the indexed coordinates, then I think this is possible with VZ right now too, see zarr-developers/VirtualiZarr#18 (comment).

kerchunk could simply depend on VZarr.

Currently we have the opposite: VirtualiZarr depends on kerchunk (for reading, and for writing to kerchunk json/parquet). I plan to make this dependency optional though - the codebase is already factored in such a way that this is effectively optional, it's only the tests that still mix concerns.

need not be duplicated.

It really depends what you want to do with MZZ. In my head, MZZ is effectively already deprecated in favour of using vz and xarray to achieve the same combining operations. You could follow that and literally deprecate MZZ, leaving the kerchunk readers for use in other packages, especially to be called from virtualizarr.readers. Or you might decide that MZZ is too widely used to be deprecated. You could try to rewrite it in terms of VZ, but users would then end up with very weird workflows that jump back and forth between representations.

For cases where there is uneven chunking in the concat dimension(s), we can come up with the best way to represent that output, whichever manifest works.

This is an orthogonal issue. Once uneven chunking is supported in the zarr spec, both generalizing vz's ManifestArray or kerchunks' in-memory references dict are possible.

@martindurant
Copy link
Member

If you're willing to inline the indexed coordinates, then I think this is possible with VZ right now too, see zarr-developers/VirtualiZarr#18 (comment).

We'd be figuring out the coordindates and putting them somewhere, wherever is most convenient. In many cases, it would be a single value per input data set.

As I've said before, I'm indifferent to where the functionality ends up being implemented, so log as it exists! I think that being able to persist kerchunk references (or other manifests) is critical, though, so that a user needs only to open a thing without worrying about further options - I think everything is already in place for this to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants