Zarr group JSON file issues #134

markgoddard · 2023-08-15T14:15:35Z

PyActiveStorage writes out a Zarr group JSON metadata file when processing a variable in a dataset. The filename is <netcdf file basename>_<variable name>.json. If using local storage it is written to the same directory as the netCDF file. If using S3 storage it is written to the current directory. If a file of the same name exists, it is not updated. The files are never removed.

This leads to various issues when reusing netCDF filenames for different datasets (e.g. during testing), since the Zarr group metadata may be describing a previous incarnation of the dataset.

It also leads to an undesirable build up of JSON files.

Zarr dataset metadata is cached on an Active instance, so there isn't much to be gained from leaving the JSON files around to be reused.

I propose we use a temporary file to write the Zarr group metadata, and immediately remove it once it has been used.

The text was updated successfully, but these errors were encountered:

markgoddard · 2023-08-15T14:20:06Z

To add some specifics, this lead to a difficult to diagnose issue when trying to implement #120. Because the same file name was used (test_compression.nc) for multiple test cases (with and without byte shuffled data), the Zarr metadata file would not be updated. This lead to incorrect chunk sizes and offsets, and ultimately decompression failure.

It's not clear why this did not also affect the test when using local storage.

markgoddard · 2023-08-15T14:30:39Z

It's not clear why this did not also affect the test when using local storage.

Worked it out (and updated the description) - for local files we generate the JSON file in the same directory as the netCDF file, which in the test case is in a temporary directory. This is not persistent, so does not suffer from the issue.

When using multiple netCDF files with the same names, the Zarr group JSON file would previously not be overwritten after it was first written. This would lead to subsequent uses potentially using an invalid Zarr group metadata file. This change switches to use a temporary file to store the Zarr group metadata. This should not be a problem because the Zarr datasource is cached in the Active object as the _zds member between operations. Closes #134

Use a temporary file for Zarr group JSON

markgoddard self-assigned this Aug 15, 2023

markgoddard added the bug Something isn't working label Aug 15, 2023

markgoddard mentioned this issue Aug 15, 2023

Use a temporary file for Zarr group JSON #135

Merged

valeriupredoi closed this as completed in #135 Sep 14, 2023

valeriupredoi referenced this issue Sep 14, 2023

Merge pull request #135 from valeriupredoi/issues/134

883383f

Use a temporary file for Zarr group JSON

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr group JSON file issues #134

Zarr group JSON file issues #134

markgoddard commented Aug 15, 2023 •

edited

Loading

markgoddard commented Aug 15, 2023

markgoddard commented Aug 15, 2023

Zarr group JSON file issues #134

Zarr group JSON file issues #134

Comments

markgoddard commented Aug 15, 2023 • edited Loading

markgoddard commented Aug 15, 2023

markgoddard commented Aug 15, 2023

markgoddard commented Aug 15, 2023 •

edited

Loading