-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Zlib compression and shuffle filter for local storage #119
Merged
Merged
Changes from 1 commit
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
6cec850
Support Zlib compression and shuffle filter for local storage
markgoddard 80c2c87
Merge branch 'main' into compression
valeriupredoi ea40a48
bring back unwanted changes from conflicts resultution from merging m…
valeriupredoi 4f83534
correctly handly test data via netCDF4.Dataset and close(); also add …
valeriupredoi 09e18e2
add CMIP6 sample data file
valeriupredoi d7c03d7
working version of CMIP6 data test with compression huzzah
valeriupredoi 4483023
add obs4MIPS test case
valeriupredoi a5889c2
add test file for obs4mips
valeriupredoi 7cdb391
add unwanted jsons for new test data files
valeriupredoi 26eef6d
syntax
valeriupredoi 5265e01
dummy data: drop ds from return value
markgoddard 9157d2a
compression tests: Sanity check filters on CMIP6 and obs4mips data
markgoddard 208bc00
zlib-ed netCDF file for CMIP6
valeriupredoi adab7f4
sanity check with numpy
valeriupredoi ffe6b52
compressed obs4mips data
valeriupredoi 14e68aa
add a couple numpy safenets
valeriupredoi 2441b62
run GA tests for all Pythons
valeriupredoi 00b8f36
unrun GA
valeriupredoi d7e1ffa
run on feature branch for all Pythons
valeriupredoi 61c7516
unrun GA, all works spiffy
valeriupredoi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
import os | ||
import pytest | ||
|
||
from activestorage.active import Active | ||
from activestorage.config import * | ||
from activestorage.dummy_data import make_compressed_ncdata | ||
|
||
import utils | ||
|
||
|
||
def create_compressed_dataset(tmp_path: str, compression: str, shuffle: bool): | ||
""" | ||
Make a vanilla test dataset which is compressed and optionally shuffled. | ||
""" | ||
temp_file = str(tmp_path / "test_compression.nc") | ||
test_data = make_compressed_ncdata(filename=temp_file, compression=compression, shuffle=shuffle) | ||
|
||
# Sanity check that test data is compressed and filtered as expected. | ||
test_data_filters = test_data[0].variables['data'].filters() | ||
assert test_data_filters[compression] | ||
assert test_data_filters['shuffle'] == shuffle | ||
|
||
test_file = utils.write_to_storage(temp_file) | ||
if USE_S3: | ||
os.remove(temp_file) | ||
return test_file | ||
|
||
|
||
@pytest.mark.skipif(USE_S3, reason="Compression and filtering not supported in S3 yet") | ||
@pytest.mark.parametrize('compression', ['zlib']) | ||
@pytest.mark.parametrize('shuffle', [False, True]) | ||
def test_compression_and_filters(tmp_path: str, compression: str, shuffle: bool): | ||
""" | ||
Test use of datasets with compression and filters applied. | ||
""" | ||
test_file = create_compressed_dataset(tmp_path, compression, shuffle) | ||
|
||
active = Active(test_file, 'data', utils.get_storage_type()) | ||
active._version = 1 | ||
active._method = "min" | ||
result = active[0:2,4:6,7:9] | ||
assert result == 740.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to pass an actual codec instance, I'd rather have a complete separation of concerns and pass an identifier for a particular codec, just as is done in the NetCDF API ... after all this is our API layer???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point. I was quite surprised to see an actual codec instance being passed through, although it did make the implementation trivial.
Would it make sense to align the S3 active storage API parameters and the storage driver parameters used here?
Currently I'm using a lower case string to describe the (de)compression algorithm. I need to check that that no parameters are required for other decompression using other algorithms. Filters may need parameters. Shuffle needs to know the data type size, although we do already have that info elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to know how other communities are handling this. How does Zarr doit? We should find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the Zarr Python API, it looks like numcodecs is the interface: https://zarr.readthedocs.io/en/stable/api/codecs.html
I believe it was originally part of zarr, then got extracted into a separate library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's not so helpful I guess as they have the entire block in memory client-side so they can put it all in the stack. Our problem of course is that the client and server are not on the same stack, so we have to go through an API interface. Where we put that is of course a moot point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that the main API to get right is that of the active storage server. It's JSON based, so can't accept a Python object, but in my current implementation I'm just grabbing the
codec_id
from the numcodecs Codec object, which happens to match my API.If you want this reduce_chunk interface to form a driver API for this library then it would also be important to keep it simple and stable, although currently all "drivers" are in-tree in this repo.
Is this something we should iterate on from here, with an issue in the backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up with a bit more info on Zarr, the compression algorithm, filters, missing data, etc. end up in the JSON file:
Each compression algorithm has an ID and a dict of parameters than are used to instantiate one of the Codec objects, with a registry interface that looks like it would accept the compressor dict: https://github.com/zarr-developers/numcodecs/blob/350cbd1eb6751cb32e66aa5a8bc4002077083a83/numcodecs/registry.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#125