Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr group JSON file issues #134

Closed
markgoddard opened this issue Aug 15, 2023 · 2 comments · Fixed by #135
Closed

Zarr group JSON file issues #134

markgoddard opened this issue Aug 15, 2023 · 2 comments · Fixed by #135
Assignees
Labels
bug Something isn't working

Comments

@markgoddard
Copy link

markgoddard commented Aug 15, 2023

PyActiveStorage writes out a Zarr group JSON metadata file when processing a variable in a dataset. The filename is <netcdf file basename>_<variable name>.json. If using local storage it is written to the same directory as the netCDF file. If using S3 storage it is written to the current directory. If a file of the same name exists, it is not updated. The files are never removed.

This leads to various issues when reusing netCDF filenames for different datasets (e.g. during testing), since the Zarr group metadata may be describing a previous incarnation of the dataset.

It also leads to an undesirable build up of JSON files.

Zarr dataset metadata is cached on an Active instance, so there isn't much to be gained from leaving the JSON files around to be reused.

I propose we use a temporary file to write the Zarr group metadata, and immediately remove it once it has been used.

@markgoddard markgoddard self-assigned this Aug 15, 2023
@markgoddard markgoddard added the bug Something isn't working label Aug 15, 2023
@markgoddard
Copy link
Author

To add some specifics, this lead to a difficult to diagnose issue when trying to implement #120. Because the same file name was used (test_compression.nc) for multiple test cases (with and without byte shuffled data), the Zarr metadata file would not be updated. This lead to incorrect chunk sizes and offsets, and ultimately decompression failure.

It's not clear why this did not also affect the test when using local storage.

@markgoddard
Copy link
Author

It's not clear why this did not also affect the test when using local storage.

Worked it out (and updated the description) - for local files we generate the JSON file in the same directory as the netCDF file, which in the test case is in a temporary directory. This is not persistent, so does not suffer from the issue.

markgoddard added a commit that referenced this issue Aug 15, 2023
When using multiple netCDF files with the same names, the Zarr group
JSON file would previously not be overwritten after it was first
written. This would lead to subsequent uses potentially using an invalid
Zarr group metadata file.

This change switches to use a temporary file to store the Zarr group
metadata.  This should not be a problem because the Zarr datasource is
cached in the Active object as the _zds member between operations.

Closes #134
valeriupredoi referenced this issue Sep 14, 2023
Use a temporary file for Zarr group JSON
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant