Support Zlib compression and shuffle filter for local storage #119

markgoddard · 2023-07-06T16:02:14Z

This change adds support for compressed and filtered data for local
storage. Data in S3 will be addressed separately.

The compression and filters arguments passed to reduce_chunk are
actually numcodecs.abc.Codec instances, so we can use them as a black
box to decode the compression or filter.

Currently we are testing Zlib compression algorithm as well as the HDF5
byte shuffle filter. It's possible that other compression algorithms and
filters will "just work" due to using the numcodecs.abc.Codec interface
to decode the data, but they have not been tested.

Closes: #118

This change adds support for compressed and filtered data for local storage. Data in S3 will be addressed separately. The compression and filters arguments passed to reduce_chunk are actually numcodecs.abc.Codec instances, so we can use them as a black box to decode the compression or filter. Currently we are testing Zlib compression algorithm as well as the HDF5 byte shuffle filter. It's possible that other compression algorithms and filters will "just work" due to using the numcodecs.abc.Codec interface to decode the data, but they have not been tested. Closes: #118

codecov · 2023-07-06T16:04:07Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.13 🎉

Comparison is base (bce41b5) 85.52% compared to head (61c7516) 85.66%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
+ Coverage   85.52%   85.66%   +0.13%     
==========================================
  Files           8        8              
  Lines         539      544       +5     
==========================================
+ Hits          461      466       +5     
  Misses         78       78

Impacted Files	Coverage Δ
activestorage/dummy_data.py	`88.65% <100.00%> (+0.23%)`	⬆️
activestorage/storage.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

valeriupredoi

very cool! Many thanks @markgoddard 🍺 I'd not merge this just yet though, for two reasons: I'd be happy if @bnlawrence had a quick look at it too, and I reckon it's a good idea to test with a couple actual compressed netCDF4 files off, say a CMIP5/6 archive, not just synthetic data produced on the fly. I can plop that in here 👍

bnlawrence · 2023-07-07T16:22:31Z

activestorage/storage.py

@@ -8,6 +8,8 @@ def reduce_chunk(rfile, offset, size, compression, filters, missing, dtype, shap

    rfile - the actual file with the data 
    offset, size - where and what we want ...
+    compression - optional `numcodecs.abc.Codec` compression codec


Do we really want to pass an actual codec instance, I'd rather have a complete separation of concerns and pass an identifier for a particular codec, just as is done in the NetCDF API ... after all this is our API layer???

It's a fair point. I was quite surprised to see an actual codec instance being passed through, although it did make the implementation trivial.

Would it make sense to align the S3 active storage API parameters and the storage driver parameters used here?

Currently I'm using a lower case string to describe the (de)compression algorithm. I need to check that that no parameters are required for other decompression using other algorithms. Filters may need parameters. Shuffle needs to know the data type size, although we do already have that info elsewhere.

It'd be good to know how other communities are handling this. How does Zarr doit? We should find out.

In terms of the Zarr Python API, it looks like numcodecs is the interface: https://zarr.readthedocs.io/en/stable/api/codecs.html

I believe it was originally part of zarr, then got extracted into a separate library.

OK, that's not so helpful I guess as they have the entire block in memory client-side so they can put it all in the stack. Our problem of course is that the client and server are not on the same stack, so we have to go through an API interface. Where we put that is of course a moot point.

I'd say that the main API to get right is that of the active storage server. It's JSON based, so can't accept a Python object, but in my current implementation I'm just grabbing the codec_id from the numcodecs Codec object, which happens to match my API.

If you want this reduce_chunk interface to form a driver API for this library then it would also be important to keep it simple and stable, although currently all "drivers" are in-tree in this repo.

Is this something we should iterate on from here, with an issue in the backlog?

To follow up with a bit more info on Zarr, the compression algorithm, filters, missing data, etc. end up in the JSON file:

"b/.zarray": "{\"chunks\":[32],\"compressor\":{\"id\":\"zlib\",\"level\":1},\"dtype\":\"<f8\",\"fill_value\":9.969209968386869e+36,\"filters\":null,\"order\":\"C\",\"shape\":[32],\"zarr_format\":2}",

Each compression algorithm has an ID and a dict of parameters than are used to instantiate one of the Codec objects, with a registry interface that looks like it would accept the compressor dict: https://github.com/zarr-developers/numcodecs/blob/350cbd1eb6751cb32e66aa5a8bc4002077083a83/numcodecs/registry.py

…ain in previous commit

…a nonfunctional CMIP6 data test

valeriupredoi · 2023-07-20T13:43:24Z

@markgoddard this is great, mate! I merged main into this, "fixed" the conflicts (only to notice I did that badly and refixed the bits after), and plopped a real-workd CMIP6 file in (less than 1MB size so GH don't complain) -> works very well 🥳 One more thing I did was to change a bit the test so that it works with the way we do things now: we close all netCDF4.Dataset members and reopen them when we need to ie not leave any Dataset open hanging about from one module to another, since we've seen that that is causing issues over the network. Have a quick look at the changes pls, when you have a bit of time. I also want to plop an obs4MIPS sample file and that's it from me, will do that now 🍺

valeriupredoi · 2023-07-20T13:54:36Z

done - test runs fine 🥳 - this is ready to merge 🍺

Currently this check fails - data is not compressed.

markgoddard · 2023-07-20T16:08:36Z

tests/test_compression.py

+    """
+    test_file = str(Path(__file__).resolve().parent / 'test_data' / 'CMIP6_IPSL-CM6A-LR_tas.nc')
+
+    check_dataset_filters(test_file, "tas", "zlib", False)


This check suggests this data is not compressed.

Output from filters:

{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False}

Same for obs4mips

agh! what a palaver - iris saves cubes with no compression by default - good spotting it, man! Let me plug in zlib at save point 👍

valeriupredoi · 2023-07-20T17:01:31Z

bugger! There's an issue with the Active result now that indeed we have compression inside the netCDF4 file for both CMIP6 and obs4MIPS - the numpy min is fine, the active min is bazooka! Will have a look tomorrow - any ideas, @markgoddard ?

Nevermind! Both the GA tests and an install from scratch with env rebuild on my home machine show all green, phew! Must be my pytest cache or something else on my work machine that caused local test fails - all fine 👍 🍺

valeriupredoi · 2023-07-20T18:37:37Z

I'll be a monkey's uncle - all works perfectly fine! Great work @markgoddard - we are now compressed 🍺 Merging it now, but pls have a look at #119 (comment) see if it's worth elaborating on in an issue 👍

markgoddard · 2023-07-20T19:31:01Z

Thanks for adding the real world datasets, glad to see this merged. I created #125 for the reduce_chunk API change.

valeriupredoi · 2023-07-21T12:19:54Z

thank you, Mark! 🍺 This is great to have in 👍

markgoddard self-assigned this Jul 6, 2023

markgoddard requested a review from valeriupredoi July 6, 2023 16:02

valeriupredoi approved these changes Jul 7, 2023

View reviewed changes

valeriupredoi added the enhancement New feature or request label Jul 7, 2023

valeriupredoi requested a review from bnlawrence July 7, 2023 11:37

bnlawrence reviewed Jul 7, 2023

View reviewed changes

valeriupredoi added 5 commits July 20, 2023 13:32

Merge branch 'main' into compression

80c2c87

bring back unwanted changes from conflicts resultution from merging m…

ea40a48

…ain in previous commit

correctly handly test data via netCDF4.Dataset and close(); also add …

4f83534

…a nonfunctional CMIP6 data test

add CMIP6 sample data file

09e18e2

working version of CMIP6 data test with compression huzzah

d7c03d7

valeriupredoi added 3 commits July 20, 2023 14:51

add obs4MIPS test case

4483023

add test file for obs4mips

a5889c2

add unwanted jsons for new test data files

7cdb391

valeriupredoi and others added 3 commits July 20, 2023 14:58

syntax

26eef6d

dummy data: drop ds from return value

5265e01

compression tests: Sanity check filters on CMIP6 and obs4mips data

9157d2a

Currently this check fails - data is not compressed.

markgoddard commented Jul 20, 2023

View reviewed changes

valeriupredoi added 3 commits July 20, 2023 17:42

zlib-ed netCDF file for CMIP6

208bc00

sanity check with numpy

adab7f4

compressed obs4mips data

ffe6b52

valeriupredoi added 4 commits July 20, 2023 19:15

add a couple numpy safenets

14e68aa

run GA tests for all Pythons

2441b62

unrun GA

00b8f36

run on feature branch for all Pythons

d7e1ffa

unrun GA, all works spiffy

61c7516

valeriupredoi merged commit a5a8920 into main Jul 20, 2023

valeriupredoi deleted the compression branch July 20, 2023 18:42

This was referenced Jul 20, 2023

Add support for compressed and shuffled datasets on local storage #118

Closed

Add support for compressed and filtered datasets on S3 storage #120

Closed

markgoddard mentioned this pull request Jul 20, 2023

Formalise compression and filters arguments in reduce_chunk, avoid numcodecs.Codec objects #125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Zlib compression and shuffle filter for local storage #119

Support Zlib compression and shuffle filter for local storage #119

markgoddard commented Jul 6, 2023

codecov bot commented Jul 6, 2023 •

edited

Loading

valeriupredoi left a comment

bnlawrence Jul 7, 2023

markgoddard Jul 10, 2023

bnlawrence Jul 10, 2023

markgoddard Jul 10, 2023

bnlawrence Jul 10, 2023

markgoddard Jul 10, 2023

markgoddard Jul 11, 2023

markgoddard Jul 20, 2023

valeriupredoi commented Jul 20, 2023

valeriupredoi commented Jul 20, 2023

markgoddard Jul 20, 2023

markgoddard Jul 20, 2023

valeriupredoi Jul 20, 2023

valeriupredoi commented Jul 20, 2023 •

edited

Loading

valeriupredoi commented Jul 20, 2023

markgoddard commented Jul 20, 2023

valeriupredoi commented Jul 21, 2023

Support Zlib compression and shuffle filter for local storage #119

Support Zlib compression and shuffle filter for local storage #119

Conversation

markgoddard commented Jul 6, 2023

codecov bot commented Jul 6, 2023 • edited Loading

Codecov Report

valeriupredoi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented Jul 20, 2023

valeriupredoi commented Jul 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented Jul 20, 2023 • edited Loading

valeriupredoi commented Jul 20, 2023

markgoddard commented Jul 20, 2023

valeriupredoi commented Jul 21, 2023

codecov bot commented Jul 6, 2023 •

edited

Loading

valeriupredoi commented Jul 20, 2023 •

edited

Loading