You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened: When I open two particular netcdf files in xarray, concat them and write the resulting Dataset to disk and reopen, bad/unexpected values are introduced. This happens very rarely (at least that I have noticed) and I have not spotted a pattern that would indicate when to expect this behaviour.
What you expected to happen: The values should not change through the writing and reading process.
feb=xr.open_dataset('./2t_era5_moda_sfc_20190201-20190228.nc')
feb=feb.sel(latitude=slice(21,19), longitude=slice(79,80))
may=xr.open_dataset('./2t_era5_moda_sfc_20190501-20190531.nc')
may=may.sel(latitude=slice(21,19), longitude=slice(79,80))
ds=xr.concat([feb, may], dim='time')
# The bad values are introduced for may. This is what the file should look like:ds.t2m.sel(time='2019-05-01').plot()
# Write to file, reopen and plot againds.to_netcdf('./test.nc', mode='w')
ds.close()
ds2=xr.open_dataset('./test.nc')
ds2.t2m.sel(time='2019-05-01').plot()
ds2.close()
# We can also compare values using numpy.isclose:np.isclose(may.t2m.values, ds2.t2m.sel(time='2019-05').values)
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, True],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[ True, True, False, False, False],
[False, True, True, True, False]]])
Anything else we need to know?: Bad data is generated only in one time slice of ds, i.e. ds.sel(time='2019-05-01'). However, I have replaced feb with a number of different netcdf files, and there is no problem. Thus the issue seems to be with these two files specifically. I can provide a third netcdf file to highlight the lack of a problem there, if that would be useful.
This appears to be related to the encoding - if I specify the datatype when writing to file, the problem is fixed. However, as pointed out in #4826, this can introduce other problems. The netcdf files are climate data with add_offset and scale_factor attributes.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-305.7.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
@dougrichardson Sorry for the delay. If you are still interested in the source of this issue here is what I found:
The root cause is different scale_factor and add_offset in the source files.
When merging only the .encoding of the first dataset survives. This leads to wrongly encoded file for the may-dates. But why is this so?
The issue is with the packed dtype ("int16") and the particular values of scale_factor/add_offset.
For feb the dynamic range is (228.96394336525748, 309.9690856933594) K whereas for may it is
(205.7644192729947, 311.7797088623047) K.
Now we can clearly see that all our values which are above 309.969 K will be folded to the lower end (>229 K).
To circumvent that you have at least two options:
change scale_factor and add_offset values in the variables .encoding before writing to appropriate values which cover your whole dynamic range
drop scale_factor/add_offset (and other CF related attributes) from .encoding to write floating point values
It might be nice to have checks for that in the encoding steps, to prevent writing erroneous values. So this is not really a bug, but might be less impactful when encoding is dropped on operations (see discussion in #6323).
What happened: When I open two particular netcdf files in xarray, concat them and write the resulting Dataset to disk and reopen, bad/unexpected values are introduced. This happens very rarely (at least that I have noticed) and I have not spotted a pattern that would indicate when to expect this behaviour.
What you expected to happen: The values should not change through the writing and reading process.
Minimal Complete Verifiable Example:
Download
2t_era5_moda_sfc_20190201-20190228.nc
and2t_era5_moda_sfc_20190501-20190531.nc
from https://github.com/dougrichardson/issues/tree/main/xarray_write (each file is ~1.3MB).Anything else we need to know?: Bad data is generated only in one time slice of
ds
, i.e.ds.sel(time='2019-05-01')
. However, I have replacedfeb
with a number of different netcdf files, and there is no problem. Thus the issue seems to be with these two files specifically. I can provide a third netcdf file to highlight the lack of a problem there, if that would be useful.This appears to be related to the encoding - if I specify the datatype when writing to file, the problem is fixed. However, as pointed out in #4826, this can introduce other problems. The netcdf files are climate data with add_offset and scale_factor attributes.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-305.7.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.19.0
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.4
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.05.1
distributed: 2021.05.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: 4.10.1
pytest: None
IPython: 7.24.0
sphinx: None
The text was updated successfully, but these errors were encountered: