-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Zlib compression and shuffle filter for local storage #119
Changes from all commits
6cec850
80c2c87
ea40a48
4f83534
09e18e2
d7c03d7
4483023
a5889c2
7cdb391
26eef6d
5265e01
9157d2a
208bc00
adab7f4
ffe6b52
14e68aa
2441b62
00b8f36
d7e1ffa
61c7516
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
import os | ||
import numpy as np | ||
import pytest | ||
|
||
from netCDF4 import Dataset | ||
from pathlib import Path | ||
|
||
from activestorage.active import Active | ||
from activestorage.config import * | ||
from activestorage.dummy_data import make_compressed_ncdata | ||
|
||
import utils | ||
|
||
|
||
def check_dataset_filters(temp_file: str, ncvar: str, compression: str, shuffle: bool): | ||
# Sanity check that test data is compressed and filtered as expected. | ||
with Dataset(temp_file) as test_data: | ||
test_data_filters = test_data.variables[ncvar].filters() | ||
print(test_data_filters) | ||
assert test_data_filters[compression] | ||
assert test_data_filters['shuffle'] == shuffle | ||
|
||
|
||
def create_compressed_dataset(tmp_path: str, compression: str, shuffle: bool): | ||
""" | ||
Make a vanilla test dataset which is compressed and optionally shuffled. | ||
""" | ||
temp_file = str(tmp_path / "test_compression.nc") | ||
test_data = make_compressed_ncdata(filename=temp_file, compression=compression, shuffle=shuffle) | ||
|
||
check_dataset_filters(temp_file, "data", compression, shuffle) | ||
|
||
test_file = utils.write_to_storage(temp_file) | ||
if USE_S3: | ||
os.remove(temp_file) | ||
return test_file | ||
|
||
|
||
@pytest.mark.skipif(USE_S3, reason="Compression and filtering not supported in S3 yet") | ||
@pytest.mark.parametrize('compression', ['zlib']) | ||
@pytest.mark.parametrize('shuffle', [False, True]) | ||
def test_compression_and_filters(tmp_path: str, compression: str, shuffle: bool): | ||
""" | ||
Test use of datasets with compression and filters applied. | ||
""" | ||
test_file = create_compressed_dataset(tmp_path, compression, shuffle) | ||
|
||
active = Active(test_file, 'data', utils.get_storage_type()) | ||
active._version = 1 | ||
active._method = "min" | ||
result = active[0:2,4:6,7:9] | ||
assert result == 740.0 | ||
|
||
|
||
@pytest.mark.skipif(USE_S3, reason="Compression and filtering not supported in S3 yet") | ||
def test_compression_and_filters_cmip6_data(): | ||
""" | ||
Test use of datasets with compression and filters applied for a real | ||
CMIP6 dataset (CMIP6_IPSL-CM6A-LR_tas). | ||
""" | ||
test_file = str(Path(__file__).resolve().parent / 'test_data' / 'CMIP6_IPSL-CM6A-LR_tas.nc') | ||
|
||
check_dataset_filters(test_file, "tas", "zlib", False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check suggests this data is not compressed. Output from filters:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same for obs4mips There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agh! what a palaver - iris saves cubes with no compression by default - good spotting it, man! Let me plug in zlib at save point 👍 |
||
|
||
with Dataset(test_file) as nc_data: | ||
nc_min = np.min(nc_data["tas"][0:2,4:6,7:9]) | ||
print(f"Numpy min from compressed file {nc_min}") | ||
|
||
active = Active(test_file, 'tas', utils.get_storage_type()) | ||
active._version = 1 | ||
active._method = "min" | ||
result = active[0:2,4:6,7:9] | ||
assert nc_min == result | ||
assert result == 239.25946044921875 | ||
|
||
|
||
@pytest.mark.skipif(USE_S3, reason="Compression and filtering not supported in S3 yet") | ||
def test_compression_and_filters_obs4mips_data(): | ||
""" | ||
Test use of datasets with compression and filters applied for a real | ||
obs4mips dataset (obs4MIPS_CERES-EBAF_L3B_Ed2-8_rlut.nc) at CMIP5 MIP standard | ||
but with CMIP6-standard file packaging. | ||
""" | ||
test_file = str(Path(__file__).resolve().parent / 'test_data' / 'obs4MIPS_CERES-EBAF_L3B_Ed2-8_rlut.nc') | ||
|
||
check_dataset_filters(test_file, "rlut", "zlib", False) | ||
|
||
with Dataset(test_file) as nc_data: | ||
nc_min = np.min(nc_data["rlut"][0:2,4:6,7:9]) | ||
print(f"Numpy min from compressed file {nc_min}") | ||
|
||
active = Active(test_file, 'rlut', utils.get_storage_type()) | ||
active._version = 1 | ||
active._method = "min" | ||
result = active[0:2,4:6,7:9] | ||
print(nc_min) | ||
assert nc_min == result | ||
assert result == 124.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to pass an actual codec instance, I'd rather have a complete separation of concerns and pass an identifier for a particular codec, just as is done in the NetCDF API ... after all this is our API layer???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point. I was quite surprised to see an actual codec instance being passed through, although it did make the implementation trivial.
Would it make sense to align the S3 active storage API parameters and the storage driver parameters used here?
Currently I'm using a lower case string to describe the (de)compression algorithm. I need to check that that no parameters are required for other decompression using other algorithms. Filters may need parameters. Shuffle needs to know the data type size, although we do already have that info elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to know how other communities are handling this. How does Zarr doit? We should find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the Zarr Python API, it looks like numcodecs is the interface: https://zarr.readthedocs.io/en/stable/api/codecs.html
I believe it was originally part of zarr, then got extracted into a separate library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's not so helpful I guess as they have the entire block in memory client-side so they can put it all in the stack. Our problem of course is that the client and server are not on the same stack, so we have to go through an API interface. Where we put that is of course a moot point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that the main API to get right is that of the active storage server. It's JSON based, so can't accept a Python object, but in my current implementation I'm just grabbing the
codec_id
from the numcodecs Codec object, which happens to match my API.If you want this reduce_chunk interface to form a driver API for this library then it would also be important to keep it simple and stable, although currently all "drivers" are in-tree in this repo.
Is this something we should iterate on from here, with an issue in the backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up with a bit more info on Zarr, the compression algorithm, filters, missing data, etc. end up in the JSON file:
Each compression algorithm has an ID and a dict of parameters than are used to instantiate one of the Codec objects, with a registry interface that looks like it would accept the compressor dict: https://github.com/zarr-developers/numcodecs/blob/350cbd1eb6751cb32e66aa5a8bc4002077083a83/numcodecs/registry.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#125