-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Zlib compression and shuffle filter for local storage #119
Conversation
This change adds support for compressed and filtered data for local storage. Data in S3 will be addressed separately. The compression and filters arguments passed to reduce_chunk are actually numcodecs.abc.Codec instances, so we can use them as a black box to decode the compression or filter. Currently we are testing Zlib compression algorithm as well as the HDF5 byte shuffle filter. It's possible that other compression algorithms and filters will "just work" due to using the numcodecs.abc.Codec interface to decode the data, but they have not been tested. Closes: #118
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #119 +/- ##
==========================================
+ Coverage 85.52% 85.66% +0.13%
==========================================
Files 8 8
Lines 539 544 +5
==========================================
+ Hits 461 466 +5
Misses 78 78
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool! Many thanks @markgoddard 🍺 I'd not merge this just yet though, for two reasons: I'd be happy if @bnlawrence had a quick look at it too, and I reckon it's a good idea to test with a couple actual compressed netCDF4 files off, say a CMIP5/6 archive, not just synthetic data produced on the fly. I can plop that in here 👍
@@ -8,6 +8,8 @@ def reduce_chunk(rfile, offset, size, compression, filters, missing, dtype, shap | |||
|
|||
rfile - the actual file with the data | |||
offset, size - where and what we want ... | |||
compression - optional `numcodecs.abc.Codec` compression codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to pass an actual codec instance, I'd rather have a complete separation of concerns and pass an identifier for a particular codec, just as is done in the NetCDF API ... after all this is our API layer???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point. I was quite surprised to see an actual codec instance being passed through, although it did make the implementation trivial.
Would it make sense to align the S3 active storage API parameters and the storage driver parameters used here?
Currently I'm using a lower case string to describe the (de)compression algorithm. I need to check that that no parameters are required for other decompression using other algorithms. Filters may need parameters. Shuffle needs to know the data type size, although we do already have that info elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to know how other communities are handling this. How does Zarr doit? We should find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the Zarr Python API, it looks like numcodecs is the interface: https://zarr.readthedocs.io/en/stable/api/codecs.html
I believe it was originally part of zarr, then got extracted into a separate library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's not so helpful I guess as they have the entire block in memory client-side so they can put it all in the stack. Our problem of course is that the client and server are not on the same stack, so we have to go through an API interface. Where we put that is of course a moot point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that the main API to get right is that of the active storage server. It's JSON based, so can't accept a Python object, but in my current implementation I'm just grabbing the codec_id
from the numcodecs Codec object, which happens to match my API.
If you want this reduce_chunk interface to form a driver API for this library then it would also be important to keep it simple and stable, although currently all "drivers" are in-tree in this repo.
Is this something we should iterate on from here, with an issue in the backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up with a bit more info on Zarr, the compression algorithm, filters, missing data, etc. end up in the JSON file:
"b/.zarray": "{\"chunks\":[32],\"compressor\":{\"id\":\"zlib\",\"level\":1},\"dtype\":\"<f8\",\"fill_value\":9.969209968386869e+36,\"filters\":null,\"order\":\"C\",\"shape\":[32],\"zarr_format\":2}",
Each compression algorithm has an ID and a dict of parameters than are used to instantiate one of the Codec objects, with a registry interface that looks like it would accept the compressor dict: https://github.com/zarr-developers/numcodecs/blob/350cbd1eb6751cb32e66aa5a8bc4002077083a83/numcodecs/registry.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ain in previous commit
…a nonfunctional CMIP6 data test
@markgoddard this is great, mate! I merged |
done - test runs fine 🥳 - this is ready to merge 🍺 |
Currently this check fails - data is not compressed.
""" | ||
test_file = str(Path(__file__).resolve().parent / 'test_data' / 'CMIP6_IPSL-CM6A-LR_tas.nc') | ||
|
||
check_dataset_filters(test_file, "tas", "zlib", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check suggests this data is not compressed.
Output from filters:
{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for obs4mips
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agh! what a palaver - iris saves cubes with no compression by default - good spotting it, man! Let me plug in zlib at save point 👍
Nevermind! Both the GA tests and an install from scratch with env rebuild on my home machine show all green, phew! Must be my pytest cache or something else on my work machine that caused local test fails - all fine 👍 🍺 |
I'll be a monkey's uncle - all works perfectly fine! Great work @markgoddard - we are now compressed 🍺 Merging it now, but pls have a look at #119 (comment) see if it's worth elaborating on in an issue 👍 |
Thanks for adding the real world datasets, glad to see this merged. I created #125 for the reduce_chunk API change. |
thank you, Mark! 🍺 This is great to have in 👍 |
This change adds support for compressed and filtered data for local
storage. Data in S3 will be addressed separately.
The compression and filters arguments passed to reduce_chunk are
actually numcodecs.abc.Codec instances, so we can use them as a black
box to decode the compression or filter.
Currently we are testing Zlib compression algorithm as well as the HDF5
byte shuffle filter. It's possible that other compression algorithms and
filters will "just work" due to using the numcodecs.abc.Codec interface
to decode the data, but they have not been tested.
Closes: #118