Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HDF5 compression filter plugins #351

Closed
florianziemen opened this issue Aug 16, 2023 · 11 comments
Closed

Support HDF5 compression filter plugins #351

florianziemen opened this issue Aug 16, 2023 · 11 comments

Comments

@florianziemen
Copy link
Contributor

HDF5 has a zoo of compression filters. Some of them can be mapped to numcodecs filters, and simply need an entry in the json. Others might need further effort.
https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins

I've addressed blosc and zstd in #350 (still early state, but I figured it might be good to announce this to avoid duplication of efforts).
lz4 ( id 32004) and bitshuffle ( id 32008) so far resisted my efforts, and I have not tackled combinations of filters, that's why they are currently set to yield an error message in the MR draft.

Maybe it would be good to use the implementations from hdf5plugin and announce them to numcodecs as done in gribscan. @d70-t - any thoughts?

@martindurant
Copy link
Member

We try to cover the most frequently used HDF5 filters, but given the pluggable nature and big ecosystem of HDF, we will never succeed! See zarr-developers/numcodecs#422 for a discussion of SZip and zarr-developers/numcodecs#412 for fletcher32 checksum. Some like SZip are implemented in imagecodecs or elsewhere.

I don't immediately see how you can get numcodecs classes from hdf5plugin, but it would be good if it would work. Ideally, though, reading HDF data via zarr and kerchunk should not depend on HDF itself.

I have not tackled combinations of filters

We can get this to work!

@d70-t
Copy link

d70-t commented Aug 17, 2023

lz4:
lz4 (as of HDF id 32004) seems to do blocked compression (see spec and code) where numcodecs' lz4 seems to compress the thing as a whole.

Unfortunately, the blocking scheme used by HDF5 is also different from the one used by blosc, so we can't use that as a fallback.

On the other hand, the DEFAULT_CHUNK_SIZE is 1GB so we could hope that this is the case (or check using a kerchunk run) that there's only a single such chunk, then update the offsets such that they only point to the true lz4 payload and then use the usual numcodecs codec.

bitshuffle:
The bitshuffle filter (HDF5 id 32008) shuffles bits and additionally handles zstd and lz4 compression. Whereas I believe in numcodecs, there's only a standalone (byte-) Shuffle filter, which does something different. However, the bitshuffle library (the one behind the HDF5-filter) provides Python bindings to the actual filter. Thus it should be possible to register that filter with numcodecs if needed.

hdf5plugin:
The hdf5plugin Python package seems to me like a tool to describe the parameters and choices of a HDF5 filter chain, but it (surprisingly) doesn't seem to provide any means of calling those plugins. Likely that's enogh for h5py, as the plugins wouldn't be called from the Python side anyways, but only inside the wrapped HDF5. To make it work with numcodecs, we'd probably have to re-implement the HDF5 plugin mechanism (possible, but maybe neither worth it, nor desired if other methods would do the trick).

@martindurant
Copy link
Member

cramjam has both blocked and block-free lz4 (as compress/decompress functions, easy to wrap).

the bitshuffle library (the one behind the HDF5-filter) provides Python bindings to the actual filter. Thus it should be possible to register that filter with numcodecs if needed.

It would be a shame to have to call HDF :(

By the way, blosc has a bitshuffle, but I don't know if it's the same implementation as HDF and whether you can call it in isolation.

@d70-t
Copy link

d70-t commented Aug 17, 2023

It would be a shame to have to call HDF :(

I agree. (Although here it would "only" be the plugin code, but it doesn't seem to be straightforward to get the bitshuffling part on it's own, without indirectly depending on HDF5)

By the way, blosc has a bitshuffle, but I don't know if it's the same implementation as HDF and whether you can call it in isolation.

I don't know for sure, but the Python API doesn't look like it's possible to call it in isolation.

@d70-t
Copy link

d70-t commented Aug 17, 2023

cramjam has both blocked and block-free lz4 (as compress/decompress functions, easy to wrap).

That's nice, but I fear that the cramjam-blocked-lz4 is according to the lz4 block format, which is something different than the HDF5-lz4-block format.
It shouldn't be too hard though, to write a Python (meta-) compressor which adheres to the HDF5 blocking specification of lz4 (it's just a few offset numbers and a one or more calls to plain lz4), but it's probably not used anywhere except in HDF5.

But as I don't believe that there are many datasets out which use larger than 1GB chunk size, the offset trick mentioned above could be more elegant and easier to implement.

@martindurant
Copy link
Member

lz4 block format, which is something different than the HDF5-lz4-block format.

Of course it is - why ever would they be the same?? :)

So yes, the question becomes what minimal amount of work do we need to do to support 95% of cases, and you are probably right that offsetting is the way to go. I can't immediately see a spec - is it just 8 bytes for the block size?

@d70-t
Copy link

d70-t commented Aug 21, 2023

I can't immediately see a spec - is it just 8 bytes for the block size?

Sorry, it probably got a bit burried in the links. I believe this should be it. So it should be a 16 byte offset. The 16 bytes before are (big endian int):

  • 8 bytes orig_size total uncompressed size
  • 4 bytes block_size uncompressed size per block
  • 4 bytes lz4_size_0 compressed bytes of first block

So if orig_size == block_size, we should be fine doing the offset trick.

@martindurant
Copy link
Member

Sounds good! So all we need is a small test file for CI, and we can go ahead.

@d70-t
Copy link

d70-t commented Aug 21, 2023

@florianziemen do you have one at hand?

@florianziemen
Copy link
Contributor Author

In principle yes. I was on holidays last week and our HPC is on holidays today. I'll look into things tomorrow.

@martindurant
Copy link
Member

This is probably fixed #350

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants