Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Zlib compression and shuffle filter for local storage #119

Merged
merged 20 commits into from
Jul 20, 2023

Conversation

markgoddard
Copy link

This change adds support for compressed and filtered data for local
storage. Data in S3 will be addressed separately.

The compression and filters arguments passed to reduce_chunk are
actually numcodecs.abc.Codec instances, so we can use them as a black
box to decode the compression or filter.

Currently we are testing Zlib compression algorithm as well as the HDF5
byte shuffle filter. It's possible that other compression algorithms and
filters will "just work" due to using the numcodecs.abc.Codec interface
to decode the data, but they have not been tested.

Closes: #118

This change adds support for compressed and filtered data for local
storage. Data in S3 will be addressed separately.

The compression and filters arguments passed to reduce_chunk are
actually numcodecs.abc.Codec instances, so we can use them as a black
box to decode the compression or filter.

Currently we are testing Zlib compression algorithm as well as the HDF5
byte shuffle filter. It's possible that other compression algorithms and
filters will "just work" due to using the numcodecs.abc.Codec interface
to decode the data, but they have not been tested.

Closes: #118
@markgoddard markgoddard self-assigned this Jul 6, 2023
@markgoddard markgoddard requested a review from valeriupredoi July 6, 2023 16:02
@codecov
Copy link

codecov bot commented Jul 6, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.13 🎉

Comparison is base (bce41b5) 85.52% compared to head (61c7516) 85.66%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
+ Coverage   85.52%   85.66%   +0.13%     
==========================================
  Files           8        8              
  Lines         539      544       +5     
==========================================
+ Hits          461      466       +5     
  Misses         78       78              
Impacted Files Coverage Δ
activestorage/dummy_data.py 88.65% <100.00%> (+0.23%) ⬆️
activestorage/storage.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool! Many thanks @markgoddard 🍺 I'd not merge this just yet though, for two reasons: I'd be happy if @bnlawrence had a quick look at it too, and I reckon it's a good idea to test with a couple actual compressed netCDF4 files off, say a CMIP5/6 archive, not just synthetic data produced on the fly. I can plop that in here 👍

@valeriupredoi valeriupredoi added the enhancement New feature or request label Jul 7, 2023
@valeriupredoi valeriupredoi requested a review from bnlawrence July 7, 2023 11:37
@@ -8,6 +8,8 @@ def reduce_chunk(rfile, offset, size, compression, filters, missing, dtype, shap

rfile - the actual file with the data
offset, size - where and what we want ...
compression - optional `numcodecs.abc.Codec` compression codec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to pass an actual codec instance, I'd rather have a complete separation of concerns and pass an identifier for a particular codec, just as is done in the NetCDF API ... after all this is our API layer???

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair point. I was quite surprised to see an actual codec instance being passed through, although it did make the implementation trivial.

Would it make sense to align the S3 active storage API parameters and the storage driver parameters used here?

Currently I'm using a lower case string to describe the (de)compression algorithm. I need to check that that no parameters are required for other decompression using other algorithms. Filters may need parameters. Shuffle needs to know the data type size, although we do already have that info elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to know how other communities are handling this. How does Zarr doit? We should find out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the Zarr Python API, it looks like numcodecs is the interface: https://zarr.readthedocs.io/en/stable/api/codecs.html

I believe it was originally part of zarr, then got extracted into a separate library.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's not so helpful I guess as they have the entire block in memory client-side so they can put it all in the stack. Our problem of course is that the client and server are not on the same stack, so we have to go through an API interface. Where we put that is of course a moot point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that the main API to get right is that of the active storage server. It's JSON based, so can't accept a Python object, but in my current implementation I'm just grabbing the codec_id from the numcodecs Codec object, which happens to match my API.

If you want this reduce_chunk interface to form a driver API for this library then it would also be important to keep it simple and stable, although currently all "drivers" are in-tree in this repo.

Is this something we should iterate on from here, with an issue in the backlog?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up with a bit more info on Zarr, the compression algorithm, filters, missing data, etc. end up in the JSON file:

"b/.zarray": "{\"chunks\":[32],\"compressor\":{\"id\":\"zlib\",\"level\":1},\"dtype\":\"<f8\",\"fill_value\":9.969209968386869e+36,\"filters\":null,\"order\":\"C\",\"shape\":[32],\"zarr_format\":2}",

Each compression algorithm has an ID and a dict of parameters than are used to instantiate one of the Codec objects, with a registry interface that looks like it would accept the compressor dict: https://github.com/zarr-developers/numcodecs/blob/350cbd1eb6751cb32e66aa5a8bc4002077083a83/numcodecs/registry.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriupredoi
Copy link
Collaborator

@markgoddard this is great, mate! I merged main into this, "fixed" the conflicts (only to notice I did that badly and refixed the bits after), and plopped a real-workd CMIP6 file in (less than 1MB size so GH don't complain) -> works very well 🥳 One more thing I did was to change a bit the test so that it works with the way we do things now: we close all netCDF4.Dataset members and reopen them when we need to ie not leave any Dataset open hanging about from one module to another, since we've seen that that is causing issues over the network. Have a quick look at the changes pls, when you have a bit of time. I also want to plop an obs4MIPS sample file and that's it from me, will do that now 🍺

@valeriupredoi
Copy link
Collaborator

done - test runs fine 🥳 - this is ready to merge 🍺

"""
test_file = str(Path(__file__).resolve().parent / 'test_data' / 'CMIP6_IPSL-CM6A-LR_tas.nc')

check_dataset_filters(test_file, "tas", "zlib", False)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check suggests this data is not compressed.

Output from filters:

{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for obs4mips

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agh! what a palaver - iris saves cubes with no compression by default - good spotting it, man! Let me plug in zlib at save point 👍

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Jul 20, 2023

bugger! There's an issue with the Active result now that indeed we have compression inside the netCDF4 file for both CMIP6 and obs4MIPS - the numpy min is fine, the active min is bazooka! Will have a look tomorrow - any ideas, @markgoddard ?

Nevermind! Both the GA tests and an install from scratch with env rebuild on my home machine show all green, phew! Must be my pytest cache or something else on my work machine that caused local test fails - all fine 👍 🍺

@valeriupredoi
Copy link
Collaborator

I'll be a monkey's uncle - all works perfectly fine! Great work @markgoddard - we are now compressed 🍺 Merging it now, but pls have a look at #119 (comment) see if it's worth elaborating on in an issue 👍

@markgoddard
Copy link
Author

Thanks for adding the real world datasets, glad to see this merged. I created #125 for the reduce_chunk API change.

@valeriupredoi
Copy link
Collaborator

thank you, Mark! 🍺 This is great to have in 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for compressed and shuffled datasets on local storage
3 participants