Description
What happened: When I open two particular netcdf files in xarray, concat them and write the resulting Dataset to disk and reopen, bad/unexpected values are introduced. This happens very rarely (at least that I have noticed) and I have not spotted a pattern that would indicate when to expect this behaviour.
What you expected to happen: The values should not change through the writing and reading process.
Minimal Complete Verifiable Example:
Download 2t_era5_moda_sfc_20190201-20190228.nc
and 2t_era5_moda_sfc_20190501-20190531.nc
from https://github.com/dougrichardson/issues/tree/main/xarray_write (each file is ~1.3MB).
feb = xr.open_dataset('./2t_era5_moda_sfc_20190201-20190228.nc')
feb = feb.sel(latitude=slice(21,19), longitude=slice(79,80))
may = xr.open_dataset('./2t_era5_moda_sfc_20190501-20190531.nc')
may = may.sel(latitude=slice(21,19), longitude=slice(79,80))
ds = xr.concat([feb, may], dim='time')
# The bad values are introduced for may. This is what the file should look like:
ds.t2m.sel(time='2019-05-01').plot()
# Write to file, reopen and plot again
ds.to_netcdf('./test.nc', mode='w')
ds.close()
ds2 = xr.open_dataset('./test.nc')
ds2.t2m.sel(time='2019-05-01').plot()
ds2.close()
# We can also compare values using numpy.isclose:
np.isclose(may.t2m.values, ds2.t2m.sel(time='2019-05').values)
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, True],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[ True, True, False, False, False],
[False, True, True, True, False]]])
Anything else we need to know?: Bad data is generated only in one time slice of ds
, i.e. ds.sel(time='2019-05-01')
. However, I have replaced feb
with a number of different netcdf files, and there is no problem. Thus the issue seems to be with these two files specifically. I can provide a third netcdf file to highlight the lack of a problem there, if that would be useful.
This appears to be related to the encoding - if I specify the datatype when writing to file, the problem is fixed. However, as pointed out in #4826, this can introduce other problems. The netcdf files are climate data with add_offset and scale_factor attributes.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-305.7.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.19.0
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.4
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.05.1
distributed: 2021.05.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: 4.10.1
pytest: None
IPython: 7.24.0
sphinx: None