Skip to content

Writing and reopening introduces bad values #5739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dougrichardson opened this issue Aug 26, 2021 · 2 comments
Closed

Writing and reopening introduces bad values #5739

dougrichardson opened this issue Aug 26, 2021 · 2 comments

Comments

@dougrichardson
Copy link

What happened: When I open two particular netcdf files in xarray, concat them and write the resulting Dataset to disk and reopen, bad/unexpected values are introduced. This happens very rarely (at least that I have noticed) and I have not spotted a pattern that would indicate when to expect this behaviour.

What you expected to happen: The values should not change through the writing and reading process.

Minimal Complete Verifiable Example:

Download 2t_era5_moda_sfc_20190201-20190228.nc and 2t_era5_moda_sfc_20190501-20190531.nc from https://github.com/dougrichardson/issues/tree/main/xarray_write (each file is ~1.3MB).

feb = xr.open_dataset('./2t_era5_moda_sfc_20190201-20190228.nc')
feb = feb.sel(latitude=slice(21,19), longitude=slice(79,80))

may = xr.open_dataset('./2t_era5_moda_sfc_20190501-20190531.nc')
may = may.sel(latitude=slice(21,19), longitude=slice(79,80))

ds = xr.concat([feb, may], dim='time')

# The bad values are introduced for may. This is what the file should look like:
ds.t2m.sel(time='2019-05-01').plot()

# Write to file, reopen and plot again
ds.to_netcdf('./test.nc', mode='w')
ds.close()

ds2 = xr.open_dataset('./test.nc')
ds2.t2m.sel(time='2019-05-01').plot()
ds2.close()

# We can also compare values using numpy.isclose:
np.isclose(may.t2m.values, ds2.t2m.sel(time='2019-05').values)
array([[[False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False,  True],
        [False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False],
        [ True,  True, False, False, False],
        [False,  True,  True,  True, False]]])

Anything else we need to know?: Bad data is generated only in one time slice of ds, i.e. ds.sel(time='2019-05-01'). However, I have replaced feb with a number of different netcdf files, and there is no problem. Thus the issue seems to be with these two files specifically. I can provide a third netcdf file to highlight the lack of a problem there, if that would be useful.

This appears to be related to the encoding - if I specify the datatype when writing to file, the problem is fixed. However, as pointed out in #4826, this can introduce other problems. The netcdf files are climate data with add_offset and scale_factor attributes.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-305.7.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.4
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.05.1
distributed: 2021.05.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: 4.10.1
pytest: None
IPython: 7.24.0
sphinx: None

@kmuehlbauer
Copy link
Contributor

@dougrichardson Sorry for the delay. If you are still interested in the source of this issue here is what I found:

The root cause is different scale_factor and add_offset in the source files.

When merging only the .encoding of the first dataset survives. This leads to wrongly encoded file for the may-dates. But why is this so?

The issue is with the packed dtype ("int16") and the particular values of scale_factor/add_offset.

For feb the dynamic range is (228.96394336525748, 309.9690856933594) K whereas for may it is
(205.7644192729947, 311.7797088623047) K.

Now we can clearly see that all our values which are above 309.969 K will be folded to the lower end (>229 K).

To circumvent that you have at least two options:

  • change scale_factor and add_offset values in the variables .encoding before writing to appropriate values which cover your whole dynamic range
  • drop scale_factor/add_offset (and other CF related attributes) from .encoding to write floating point values

It might be nice to have checks for that in the encoding steps, to prevent writing erroneous values. So this is not really a bug, but might be less impactful when encoding is dropped on operations (see discussion in #6323).

@kmuehlbauer
Copy link
Contributor

I think the root cause and solutions were described in the above comment, so I'm closing this. Please reopen, if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants