Using the Zarr library to read HDF5 #1645
Replies: 21 comments
-
I like the sound of the |
Beta Was this translation helpful? Give feedback.
-
This recent blog post by @RPrudden gives some very pertinent suggestions: |
Beta Was this translation helpful? Give feedback.
-
Hi @rsignell-usgs, thanks a lot for posting, nice proof of concept. |
Beta Was this translation helpful? Give feedback.
-
Thanks @alimanfoo! As @RPrudden pointed out to me, probably the neatest thing here is a mechanism for allowing existing community formats to be used efficiently via the Zarr library. I've got a draft Medium post, in case folks are interested in commenting. |
Beta Was this translation helpful? Give feedback.
-
Moving a question from gitter here: What would it take to do this type of operation for HDF5 files without generating the external |
Beta Was this translation helpful? Give feedback.
-
@ajelenak can provide a better answer, but my understanding is that it would be pretty complicated to make the HDF5 library have the functionality we demonstrated here. That's what made using the Zarr library very convenient! I know there are several annoying steps in the current workflow that could be improved upon. We could imagine computing the augmented Do folks have other ideas how this could be made more user friendly? |
Beta Was this translation helpful? Give feedback.
-
When the binary file is first created? Or when the binary data is being read? I'm just hoping I can read GOES-R ABI NetCDF4 data from Amazon/Google without having to convince NOAA to add |
Beta Was this translation helpful? Give feedback.
-
The HDF5 library has the S3 Virtual File Driver (released in v1.10.6, I think) that enables access to HDF5 files in S3. You also can use h5py with the library without this virtual file driver. I have an example notebook how to do that. Both of these methods are not optimized and need to make frequent requests for small amounts of file content to figure out chunk file locations. To answer @rsignell-usgs, the HDF5 library would need some form of the very similar information as in One solution to this problem now could be that someone sets up a Lambda function that will generate |
Beta Was this translation helpful? Give feedback.
-
@djhoese , you don't have to convince NOAA to create the augmented |
Beta Was this translation helpful? Give feedback.
-
I thought I needed the entire input file to properly generate the |
Beta Was this translation helpful? Give feedback.
-
@djhoese, you just need to extract the metadata from the existing GOES NetCDF4 files and stick it in another file. Then you reference both files when you read, as in our example notebook |
Beta Was this translation helpful? Give feedback.
-
I just read the notebook @ajelenak linked to. This makes it more clear. When the Python file-like object from fsspec is passed to |
Beta Was this translation helpful? Give feedback.
-
You are correct. |
Beta Was this translation helpful? Give feedback.
-
@djhoese, I'm afraid if you are after efficient reading of the GOES data, we would have some more work to do. I downloaded a sample file and the chunks are tiny: import xarray as xr
file = '/users/rsignell/downloads/OR_ABI-L2-SSTF-M6_G16_s20200412100059_e20200412159366_c20200412206173.nc'
ds = xr.open_dataset(file)
print(ds['SST'].encoding) produces:
so each chunk is only 216x216 Int16 values, so 2162162/1e6=0.1MB and that's before compression! @ajelenak had some ideas that when we encounter tiny chunks we might create meta chunks to make the S3 byte-range requests and dask jobs bigger. But that would require more thought and effort... |
Beta Was this translation helpful? Give feedback.
-
For those interested, the Medium Blog Post on this work. |
Beta Was this translation helpful? Give feedback.
-
+1 I think this is awesome and will be particularly beneficial for the Australian Ocean Data Network which has a vast quantity of data stored in NetCDF format in AWS S3. How is it looking in regards to incorporating a 'FileChunkStore' and a convenience function to generate the chunk metadata into the main development branch? |
Beta Was this translation helpful? Give feedback.
-
just a quick update here, the h5py library now includes support for the ros3 driver. so potentially one could use that to create the |
Beta Was this translation helpful? Give feedback.
-
I now have a 200K smallish hdf5 type files to deal with, so an 'automatic' metadata creator sounds handy. e.g. https://geoh5py.readthedocs.io/en/stable/ @rsignell-usgs @rabernat - any idea where you would start? |
Beta Was this translation helpful? Give feedback.
-
there is also this now, which could help: https://github.com/fsspec/kerchunk |
Beta Was this translation helpful? Give feedback.
-
Thanks @satra ! Yes, https://github.com/fsspec/kerchunk should help -- version 0.0.6 now not only allows merging files along a dimension, but merging variables together! Give it a shot @RichardScottOZ and raise an issue on https://github.com/fsspec/kerchunk/issues and we'll try to help out and perhaps improve the docs! |
Beta Was this translation helpful? Give feedback.
-
Thanks Rich, will do, having a look now. |
Beta Was this translation helpful? Give feedback.
-
The USGS contracted the HDFGroup to do a test:
Could we make HDF5 format as performant on the Cloud as Zarr format by writing the HDF5 chunk locations into
.zmetadata
and then having the Zarr library read from those chunks instead of Zarr format chunks?From our first test the answer appears to be YES: https://gist.github.com/rsignell-usgs/3cbe15670bc2be05980dec7c5947b540
We modified both the
zarr
andxarray
libraries to make that notebook possible, adding theFileChunkStore
concept. The modified libraries are: https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/zarr-hdf5/binder/environment.yml#L20-L21Feel free to try running the notebook yourself:
(If you run into a 'stream is closed` error computing the max of the zarr data, just run the cell again.
I'm trying to figure out why that error occurs sometimes)
Beta Was this translation helpful? Give feedback.
All reactions