-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_mfdataset too many files #463
Comments
Just a little follow up...I tried to work around the file limit by serializing the processing of the files and creating xray datasets with with fewer files in them. However, I still eventually hit this error, suggesting that the files are never being closed. For example I would like to do ds = xray.open_mfdataset(ddir + '*.nc' % yr, engine='scipy')
EKE = (ds.variables['u']**2 + ds.variables['v']**2).mean(dim='time').load() This tries to open 8031 files and produces the So then I try to create a new dataset for each year EKE = []
for yr in xrange(1993,2015):
print yr
# this opens about 365 files
ds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_%04d*.nc' % yr, engine='scipy')
EKE.append((ds.variables['u']**2 + ds.variables['v']**2).mean(dim='time').load()) This works okay for the first two years. However, by the third year, I still get the Using xray version 0.5.1 via conda module. |
Yes, this is a known issue, and I agree that it is annoying. We could work around this by opening up (and closing) netCDF files inside the |
I am using the scipy backend because the netcdf4 backend doesn't work for me at all. It core dumps with the error
Are you suggesting I work on the scipy backend? |
Sure, you could do this on the scipy backend -- the logic will be essentially the same on both backends. I believe your issue with netCDF4 backend is the same as this one: #444. This will be fixed in the next release. |
Ok, I will have a look at this. I would be happy to contribute to this awesome project. By the way, by monitoring /proc, I was able to see that the scipy backend actually opens each file TWICE, exacerbating the problem. |
I came up with a solution for this, but it is so slow that it is useless. |
Hmm. How big are each of your netCDF files? |
8 MB. This is daily satellite data, with one file per time point. (Most satellite data is distributed this way.) There are many other workarounds to this problem. You can try to increase your ulimits. Or you can join these small netcdf files together into a big one. I had daily data files, and I used NCO to concatentate them into monthly files. That basically solved my problem. But of course that involves going out of xray. |
I've run into the same problem and have been looking at the netCDF backend. A solution does not seem to be so easy as to open and close the file in the Short of decorating all the functions of the netCDF4 package I can not think of a workable solution to this. But maybe I'm overlooking something fundamental. |
I think we can actually read all the variable metadata (shape and dtype) in when we open the file -- we already do that for reading in attributes. Something like this prototype, which would also be useful for reading compressed netCDF4 files with multiprocessing: dask/dask#457 (comment) |
I've pushed a few commits trying this out to https://github.com/cpaulik/xray/tree/closing_netcdf_backend . I can open a WIP PR if this would be easier to discuss there. There are however a few tests that keep failing and I can not figure out why. e.g.: If I set a breakpoint at line 941 of dataset.py and just continue the test fails. If I however evaluate The error I get when running the test without interference is: test_backends.py::NetCDF4ViaDaskDataTest::test_compression_encoding FAILED
====================================================== FAILURES =======================================================
__________________________________ NetCDF4ViaDaskDataTest.test_compression_encoding ___________________________________
self = <xray.test.test_backends.NetCDF4ViaDaskDataTest testMethod=test_compression_encoding>
def test_compression_encoding(self):
data = create_test_data()
data['var2'].encoding.update({'zlib': True,
'chunksizes': (5, 5),
'fletcher32': True})
> with self.roundtrip(data) as actual:
test_backends.py:502:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib/python2.7/contextlib.py:17: in __enter__
return self.gen.next()
test_backends.py:596: in roundtrip
yield ds.chunk()
../core/dataset.py:942: in chunk
for k, v in self.variables.items()])
../core/dataset.py:935: in maybe_chunk
token2 = tokenize(name, token if token else var._data)
/home/cpa/.virtualenvs/xray/local/lib/python2.7/site-packages/dask/base.py:152: in tokenize
return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest()
../core/indexing.py:301: in __repr__
(type(self).__name__, self.array, self.key))
../core/utils.py:377: in __repr__
return '%s(array=%r)' % (type(self).__name__, self.array)
../core/indexing.py:301: in __repr__
(type(self).__name__, self.array, self.key))
../core/utils.py:377: in __repr__
return '%s(array=%r)' % (type(self).__name__, self.array)
netCDF4/_netCDF4.pyx:2931: in netCDF4._netCDF4.Variable.__repr__ (netCDF4/_netCDF4.c:25068)
???
netCDF4/_netCDF4.pyx:2938: in netCDF4._netCDF4.Variable.__unicode__ (netCDF4/_netCDF4.c:25243)
???
netCDF4/_netCDF4.pyx:3059: in netCDF4._netCDF4.Variable.dimensions.__get__ (netCDF4/_netCDF4.c:27486)
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E RuntimeError: NetCDF: Not a valid ID
netCDF4/_netCDF4.pyx:2994: RuntimeError
============================================== 1 failed in 0.50 seconds =============================================== |
@cpaulik I wonder if the issue is this section in your data = getitem(self.array, key)
try:
self.store.ensure_open()
data = getitem(self.array, key)
except RuntimeError as e:
raise e
pass
if self.ndim == 0:
# work around for netCDF4-python's broken handling of 0-d
# arrays (slicing them always returns a 1-dimensional array):
# https://github.com/Unidata/netcdf4-python/pull/220
data = np.asscalar(data)
self.store.close()
return data I would put Actually, you probably want to put this in a context manager that automatically closes the file, something like: with self.store.opened():
data = getitem(self.array, key) |
I've only put the try - except there to conditionally set the breakpoint. How does it make a difference if the self.store.close is called? It it is not called then the dataset remains opened which should not cause the weird behaviour reported above? Nevertheless I have updated my branch to use a contextmanager because it is a better solution but I still have this strange behaviour of only printing the variable altering the test outcome. |
OK, so the problem is that |
OK, I'll try. Thanks. But I originally tested if netCDF4 can work with a closed/reopened variable like this: In [1]: import netCDF4
In [2]: a = netCDF4.Dataset("temp.nc", mode="w")
In [3]: a.createDimension("lon")
Out[3]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'lon', size = 0
In [4]: a.createVariable("lon", "f8", dimensions=("lon"))
Out[4]:
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
unlimited dimensions: lon
current shape = (0,)
filling on, default _FillValue of 9.969209968386869e+36 used
In [5]: v = a.variables['lon']
In [6]: v
Out[6]:
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
unlimited dimensions: lon
current shape = (0,)
filling on, default _FillValue of 9.969209968386869e+36 used
In [7]: a.close()
In [8]: v
Out[8]: ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj)
695 type_pprinters=self.type_printers,
696 deferred_pprinters=self.deferred_printers)
--> 697 printer.pretty(obj)
698 printer.flush()
699 return stream.getvalue()
/home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj)
381 if callable(meth):
382 return meth(obj, self, cycle)
--> 383 return _default_pprint(obj, self, cycle)
384 finally:
385 self.end_group()
/home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
501 if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs:
502 # A user-provided repr. Find newlines and replace them with p.break_()
--> 503 _repr_pprint(obj, p, cycle)
504 return
505 p.begin_group(1, '<')
/home/cp/.pyenv/versions/miniconda3-3.16.0/envs/xray-3.5.0/lib/python3.5/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
683 """A pprint that just redirects to the normal repr function."""
684 # Find newlines and replace them with p.break_()
--> 685 output = repr(obj)
686 for idx,output_line in enumerate(output.splitlines()):
687 if idx:
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__repr__ (netCDF4/_netCDF4.c:25045)()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__unicode__ (netCDF4/_netCDF4.c:25243)()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.dimensions.__get__ (netCDF4/_netCDF4.c:27486)()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._getdims (netCDF4/_netCDF4.c:26297)()
RuntimeError: NetCDF: Not a valid ID
In [9]: a = netCDF4.Dataset("temp.nc")
In [10]: v
Out[10]:
class 'netCDF4._netCDF4.Variable'>
lon(lon)
dimensions: lon
shape = (0,)
on, default _FillValue of 9.969209968386869e+36 used |
OK, I think you could also just add an ensured_open() to the repr() method. Right now that class is inheriting it from NDArrayMixin. On Fri, Sep 25, 2015 at 5:11 PM, Christoph Paulik
|
I'm also running into this error - but strangely it only happens when using IPython interactive backend. I have some tests which work fine, but doing the same in IPython fails. I'm opening a few hundred files (about 10Mb each, one per month across a few variables). I'm using the default NetCDF backend. |
I suspect you hit this in IPython after rerunning cells, because file On Fri, Jun 3, 2016 at 11:08 AM, mangecoeur [email protected]
|
It seems to happen even with a freshly restarted notebook, but I'll try a
|
I still hit this issue after wrapping my open_mfdataset in a with statement. I'm suspecting to be an OSX problem, MacOS has a very low default max-open-files limit for applications started from the shell (like 256). It's not yet clear to me whether my datasets are being correctly closed, investigating... |
So on investigation, even though my dataset creation is wrapped in a |
@mangecoeur I can take a look. Can you share an example of how you use the |
@shoyer thanks - here's how i'm using mfdataset - not using any options. I'm going to try using the def weather_dataset(root_path: Path, *, start_date: datetime = None, end_date: datetime = None):
flat_files_paths = get_dset_file_paths(root_path, start_date=start_date, end_date=end_date)
# Convert Paths to list of strings for xarray
dataset = xr.open_mfdataset([str(f) for f in flat_files_paths])
return dataset
def cfsr_weather_loader(db, site_lookup_fn=None, dset_start=None, dset_end=None, site_conf=None):
# Pull values out of the
dt_conf = site_conf if site_conf else WEATHER_CFSR
dset_start = dset_start if dset_start else dt_conf['start_dt']
dset_end = dset_end if dset_end else dt_conf['end_dt']
if site_lookup_fn is None:
site_lookup_fn = site_lookup_postcode_district
def weather_loader(site_id, start_date, end_date, resample=None):
# using the tuple because always getting mixed up with lon/lat
geo_lookup = site_lookup_fn(site_id, db)
# With statement should ensure dset is closed after loading.
with weather_dataset(WEATHER_CFSR['path'],
start_date=dset_start,
end_date=dset_end) as weather:
data = weighted_regional_timeseries(weather, start_date, end_date,
lon=geo_lookup.lon,
lat=geo_lookup.lat,
weights=geo_lookup.weights)
# RENAME from CFSR standard
data = data.rename(columns=WEATHER_RENAME)
if resample is not None:
data = data.resample(resample).mean()
data.irradiance /= 1000.0 # convert irradiance to kW
return data
return weather_loader |
So using a cleaner minimal example it does appear that the files are closed after the dataset is closed. However, they are all open during dataset loading - this is what blows past the OSX default max open file limit. I think this could be a real issue when using Xarray to handle too-big-for-ram datasets - you could easily be trying to access 1000s of files (especially with weather data), so Xarray should limit the number it holds open at any one time during data load. Not being familiar with the internals I'm not sure if this is an issue in Xarray itself or in the Dask backend. |
@mangecoeur, although it's not an xarray-based solution, I've found that by far the best solution to this problem is to transform your dataset from the "timeslice" format (which is convenient for models to write out - all the data at a given point in time, often in separate files for each time step) to "timeseries" format - a continuous format, where you have all the data for a single variable in a single (or much smaller collection of) files. NCAR published a great utility for converting batches of NetCDF output from timeslice to timeseries format here; it's significantly faster than any shell-script/CDO/NCO solution I've ever encountered, and it parallelizes extremely easily. Adding a simple post-processing step to convert my simulation output to timeseries format dramatically reduced my overall work time. Before, I had a separate handler which re-implemented open_mfdataset(), performed an intermediate reduction (usually extracting a variable), and then concatenated within xarray. This could get around the open file limit, but it wasn't fast. My pre-processed data is often still big - barely fitting within memory - but it's far easier to handle, and you can throw dask at it no problem to get huge speedups in analysis. |
We (+ @milenaveneziani and @xylar) are running into this issue again. Ideally, this should be resolved and after following up with everyone on strategy I may have another look at this issue if it sounds straightforward to fix. @shoyer and @mrocklin, if I understand correctly, incorporation of the LRU cache could help with this problem assuming time series were sliced into small chunks for access, correct? We would still run into problems, however, if there were say 10^6 files and we wanted to get a time-series spanning these files, right? If so, we may need a more robust solution than just the LRU cache. In the short term, PyReshaper may provide a temporary solution for us. cc @kmpaul to provide some perspective here too regarding use of https://github.com/NCAR/PyReshaper. |
The LRU cache solution proposed in #798 would work in either case. It just would have poor performance when accessing a small piece of each of 10^6 files, both to build the graph (because xarray needs to open each file to read the metadata) and to do the actual computation (again, because of the need to open so many files). If you only need a small amount of data from many files, you probably want to reshape your data to minimize the amount of necessary file access no matter what, whether you do that reshaping with PyReshaper or xarray/dask.array/dask-distributed. |
@shoyer is it ever feasible to read the first NetCDF file in a sequence and
assume that they are all the same except to increment a datetime dimension
by increasing days?
…On Mon, Nov 28, 2016 at 7:19 PM, Stephan Hoyer ***@***.***> wrote:
if I understand correctly, incorporation of the LRU cache could help with
this problem assuming time series were sliced into small chunks for access,
correct? We would still run into problems, however, if there were say 10^6
files and we wanted to get a time-series spanning these files, right?
The LRU cache solution proposed in #798
<#798> would work in either case.
It just would have poor performance when accessing a small piece of each of
10^6 files, both to build the graph (because xarray needs to open each file
to read the metadata) and to do the actual computation (again, because of
the need to open so many files). If you only need a small amount of data
from many files, you probably want to reshape your data to minimize the
amount of necessary file access no matter what, whether you do that
reshaping with PyReshaper or xarray/dask.array/dask-distributed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#463 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszK5My19y5DB7i-PBj-0L0-XM8dcKks5rC2-qgaJpZM4FWKen>
.
|
Sorry for the delay... I saw the reference and then needed to find some time to read back over the issues to get some context. You are correct. The PyReshaper was designed to address this type of problem, though not exactly the issue with xarray and dask. It's a pretty common problem, and it's the reason that the CESM developers are moving to long-term archival of time-series files ONLY. (In other words, PyReshaper is being incorporated into the automated CESM run-processes.) ...Of course, one could argue that this step shouldn't be necessary with some clever I/O in the models themselves to write time-series directly. The PyReshaper opens and closes each time-slice file explicitly before and after each read, respectively. And, if fully scaled (i.e., 1 MPI process per output file), you only ever have 2 files open at a time per process. In this particular operation, the overhead associated with open/close on the input files is negligible compared to the total R/W times. So, anyway, the PyReshaper (https://github.com/NCAR/PyReshaper) can definitely help...though I consider it a stop-gap for the moment. I'm happy to help people figure out how to get it to work for you problems, if that's a path you want to consider. |
Sure. This should probably be a different wrapper function than @kmpaul thanks for sharing! This is useful background. There is at least one other option worth considering. Instead of using the open file LRU cache, a simpler option could be to add an optional argument to xarray backends (building on |
@shoyer, you probably have the very best feel for what the most efficacious solution is to this problem in terms of fixing the issue, performance, longer utility, etc. Is there any clear winner from the following potentially non-exhaustive options?
My current analysis: I could see our team using PyReshaper because our data output format already has inertia but this adds complexity to a workflow that intuitively should be handled inside xarray. However, I think we want to get around the file number limitation eventually because it is an issue that multiple groups keep bringing up. This is perhaps the simplest solution but it is specific to our uses and not necessarily general. Towards a general solution, we would intuitively have a fixed cost performance penalty for the |
@pwolfram NcML is just an XML specification for how variables in a set of NetCDF files can be combined into a single virtual NetCDF file. This would be useful because it would allow building a version of I suspect that even the LRU cache approach would build on |
I just realized I didn't say thank you to @shoyer et al for the advice and help. Please forgive my rudeness. |
Yes, exactly. I plan to merge that PR very shortly, after a few fixes for the failing tests on Windows (less than an hour of work). |
Not sure this is good feedback at all but I just wanted to provide an additional problematic case, from my end, that is returning this "too many files" problem: NOTE: I have the latest xarray package. |
@ajoros, can you try something like @shoyer should correct me if I'm wrong but we are almost ready to merge the code in this PR and this would be a great "in the field" check if you could try it out soon. |
OK, I'm closing this issue as "Fixed" by #1198. Feel free to open new issue for any follow-up concerns. |
Thanks @pwolfram ... shot you a follow up email at your Gmail... |
@ajoros should correct me if I'm wrong but it sounds like everything is working for his use case. |
Yessir @pwolfram we are in business.! |
@shoyer I just ran into this issue again (with 8000 files, each 50 kB), I'm using xarray 0.9.6 and work on some performance tests. Is there any upper limit of number of files?
|
Ok, I found my problem. I had to increase |
Using autoclose=True should also fix this.
…On Mon, Nov 27, 2017 at 10:26 AM Sebastian Hahn ***@***.***> wrote:
Ok, I found my problem. I had to increase ulimit -n
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#463 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1mu2bDkvJoV-fAz8DVAKp22bOMATks5s6o5xgaJpZM4FWKen>
.
|
Thanks, I'll test it! |
I am very excited to try xray.
On my first attempt, I tried to use open_mfdataset on a set of ~8000 netcdf files. I hit a "RuntimeError: Too many open files". The ulimit on my system is 1024, so clearly that is the source of the error.
I am curious whether this is the desired behavior for open_mfdataset. Does xray have to keep all the files open? If so, I will work with my sysadmin to increase the ulimit.
It seems like the whole point of this function is to work with large collections of files, so this could be a significant limitation.
The text was updated successfully, but these errors were encountered: