-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix to_netcdf append bug (GH1215) #1609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@fmaussion - you bring up a good point. There are two scenarios here.
I'm wondering if we should disallow 1, in favor of being more explicit. So: list_of_vars_to_append = ['var1', 'var2']
ds[list_of_vars_to_append].to_netcdf(filename, mode='a')
# if either var1 or var2 are in filename, raise an error? as apposed to the current behavior which would silently skip all vars already in |
Netcdf4 would overwrite in this situation, and I am also in favor of overwriting as this could be quite a useful usecase: from netCDF4 import Dataset
with Dataset('test.nc', 'w', format='NETCDF4') as nc:
nc.createDimension('lat', 1)
nc.createVariable('lat', 'f4', ('lat',))
nc['lat'][:] = 1
with Dataset("test.nc", "a") as nc:
nc['lat'][:] = 2 |
@fmaussion - I've updated the append logic slightly. I'm wondering what you think? This version more aggressively overwrites existing variables (data_vars and coords). |
Thanks! I like it: simple and in accordance with netcdf4's silent overwriting. It would be cool to describe this behavior this in the documentation somewhere, and add a test that the data is correctly overwritten maybe? |
Question for those who are familiar with the |
@jhamman I think this is something to do with string/bytes. The data gets written as Unicode strings but then read back in as bytes, which is obviously not ideal. If you want to ignore this, like the current tests, you can use |
doc/io.rst
Outdated
@@ -176,6 +176,10 @@ for dealing with datasets too big to fit into memory. Instead, xarray integrates | |||
with dask.array (see :ref:`dask`), which provides a fully featured engine for | |||
streaming computation. | |||
|
|||
It is possible to append or overwrite netCDF variables using the ``mode='a'`` | |||
argument. When using this option, all variables in the dataset will be written | |||
to the netCDF file, regardless if they exist in the original dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity: in the original file
xarray/core/dataset.py
Outdated
@@ -963,7 +963,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None, | |||
default format becomes NETCDF3_64BIT). | |||
mode : {'w', 'a'}, optional | |||
Write ('w') or append ('a') mode. If mode='w', any existing file at | |||
this location will be overwritten. | |||
this location will be overwritten. If mode='a', exisitng variables | |||
will be over written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overwritten in one word?
xarray/tests/test_backends.py
Outdated
data['var9'] = data['var2'] * 3 | ||
data[['var2', 'var9']].to_netcdf(tmp_file, mode='a', | ||
engine='scipy') | ||
actual = open_dataset(tmp_file, engine='scipy') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is triggering a test failure on Windows, where you apparently can't open the same file twice.
xarray/tests/test_backends.py
Outdated
data['var9'] = data['var2'] * 3 | ||
data[['var2', 'var9']].to_netcdf(tmp_file, mode='a', | ||
engine='netcdf4') | ||
actual = open_dataset(tmp_file, autoclose=self.autoclose, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than allowing a clean-up failure, please close this file if possible (use a context manager). File descriptors are still a limited resource for our test suite.
xarray/tests/test_backends.py
Outdated
@@ -823,6 +823,37 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={}, | |||
autoclose=self.autoclose, **open_kwargs) as ds: | |||
yield ds | |||
|
|||
@contextlib.contextmanager | |||
def roundtrip_append(self, data, save_kwargs={}, open_kwargs={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put mode of these tests in one of the base classes, so we don't separately repeat them for scipy, netcdf4 and h5netcdf?
xarray/backends/common.py
Outdated
target, source = self.prepare_variable( | ||
name, v, check, unlimited_dims=unlimited_dims) | ||
if (vn not in self.variables or | ||
(getattr(self, '_mode', False) != 'a')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the explicit check here for the mode here. With mode='w'
, the file should already be starting from scratch.
xarray/tests/test_backends.py
Outdated
mode = 'a' if i > 0 else 'w' | ||
data[[key]].to_netcdf(tmp_file, mode=mode, **save_kwargs) | ||
with open_dataset(tmp_file, | ||
autoclose=self.autoclose, **open_kwargs) as ds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - I've put these tests in a base class now. I've never quite understood how these tests work though. Are the inherited classes passing open/save_kwargs that specify the engine? How does this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly right.
One day it would be nice to split apart the gigantic test_backends.py
into separate files for each backend...
xarray/tests/test_backends.py
Outdated
data['var9'] = data['var2'] * 3 | ||
data[['var2', 'var9']].to_netcdf(tmp_file, mode='a') | ||
with open_dataset(tmp_file, autoclose=self.autoclose) as actual: | ||
assert_identical(data, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the issue with windows. What's the best way to do this then? Is it possible to explicitly close a netCDF file written with to_netcdf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_netcdf()
should write and close the file. You only need to use the context manager to ensure things get closed with open_dataset()
.
@shoyer - ready for final review. |
xarray/core/dataset.py
Outdated
@@ -974,7 +974,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None, | |||
default format becomes NETCDF3_64BIT). | |||
mode : {'w', 'a'}, optional | |||
Write ('w') or append ('a') mode. If mode='w', any existing file at | |||
this location will be overwritten. | |||
this location will be overwritten. If mode='a', exisitng variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exisitng -> existing
xarray/tests/test_backends.py
Outdated
allow_cleanup_failure=allow_cleanup_failure) as tmp_file: | ||
for i, key in enumerate(data.variables): | ||
mode = 'a' if i > 0 else 'w' | ||
data[[key]].to_netcdf(tmp_file, mode=mode, **save_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently always using the default backend. You need to explicitly set engine
here somewhere, though much of the logic can potentially live on the base class or helper function of some sort. There's no magic that sets this automatically for subclasses.
Take a look at how roundtrip()
looks. I think I led you astray by suggesting that you move everything to the baseclass.
xarray/tests/test_backends.py
Outdated
@@ -592,6 +627,8 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False): | |||
|
|||
@requires_netCDF4 | |||
class BaseNetCDF4Test(CFEncodedDataTest): | |||
engine = 'netcdf4' | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer, what do you think about this solution?
@@ -1024,6 +1030,8 @@ class ScipyFilePathTestAutocloseTrue(ScipyFilePathTest): | |||
|
|||
@requires_netCDF4 | |||
class NetCDF3ViaNetCDF4DataTest(CFEncodedDataTest, Only32BitTypes, TestCase): | |||
engine = 'netcdf4' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite enough -- you also need to set the format
in this case. Take a look at roundtrip()
below.
Probably the cleanest fix would be refactor roundtrip()
into three methods:
# on the base class
@contextlib.contextmanager
def roundtrip(self, data, save_kwargs={}, open_kwargs={},
allow_cleanup_failure=False):
with create_tmp_file(
allow_cleanup_failure=allow_cleanup_failure) as path:
self.save(data, path, **save_kwargs)
with self.open(path, **open_kwargs) as ds:
yield ds
# on subclasses, e.g., for NetCDF3ViaNetCDF4DataTest
def save(self, dataset, path, **kwargs):
dataset.to_netcdf(tmp_file, format='NETCDF3_CLASSIC',
engine='netcdf4', **kwargs)
@contextlib.contextmanager
def open(self, path, **kwargs):
with open_dataset(tmp_file, engine='netcdf4',
autoclose=self.autoclose, **open_kwargs) as ds:
yield ds
Then you could write roundtrip_append()
in terms of save
and open
.
@shoyer - take another look. I have basically merged our two ideas and refactored the roundtrip tests. Tests are still failing but not for py2.7, on appveyor, or py3.6 locally. |
@jhamman This looks great, thank you! |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new APITODO:Additional tests needed to verify this works for all the write backends and that the correct errors are raised when dims are different between writes.