-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidating all tasks that write to a file on a single worker #2163
Comments
Actors are experimental and may be removed at any time. I don't recommend that XArray depend on them. They're also advanced technology, and probably bring along problems that are hard to foresee. That being said, yes, they would be a possible solution here to manage otherwise uncomfortable state. I wonder how much of this problem could be removed by consolidating data beforehand into a single task or into a chain of dependant tasks? Are files are going to be much larger than an individual task? Would creating artificial dependencies between tasks help in some way? Do you have more information about what is wrong with distributed locking? This approach seems simplest to me if it is cheap. |
Often, yes. It's pretty common to encounter netCDF files consisting of 1-20 arrays with total size in the 200MB-10GB range. This is solidly in the "medium data" range where streaming computation is valuable. We could encourage changing best practices to write smaller files, but users will be surprised/disappointed if switching to dask-distributed suddenly means they can't write netCDF files that don't fit in memory on a single node. This does probably make sense when using netCDF backends like scipy that don't (yet?) support writes without loading the entire file into memory.
Yes, I think this could also work nicely, at least to resolve any need for lockings. The downside is that we would need a priori knowledge of the proper task ordering to handle streaming computation use cases.
I'm pretty sure that with more futzing/refactoring I could locks and reopening files for every operation working. The overhead could be minimized with appropriate (automatic?) rechunking. Maybe this is the better way to go. |
We could also write to something else that was more concurrent friendly and
then have a final task that copied/merged things over to a single NetCDF
file. This would double/triple our I/O costs but would remove dask
complications.
…On Mon, Aug 6, 2018 at 12:02 PM, Stephan Hoyer ***@***.***> wrote:
Are files are going to be much larger than an individual task?
Often, yes. It's pretty common to encounter netCDF files consisting of
1-20 arrays with total size in the 200MB-10GB range. This is solidly in the
"medium data" range where streaming computation is valuable.
We could encourage changing best practices to write smaller files, but
users will be surprised/disappointed if switching to dask-distributed
suddenly means they can't write netCDF files that don't fit in memory on a
single node.
This does probably make sense when using netCDF backends like scipy that
don't (yet?) support writes without loading the entire file into memory.
Would creating artificial dependencies between tasks help in some way?
Yes, I think this could also work nicely, at least to resolve any need for
lockings. The downside is that we would need a priori knowledge of the
proper task ordering to handle streaming computation use cases.
Do you have more information about what is wrong with distributed locking?
This approach seems simplest to me if it is cheap.
I'm pretty sure that with more futzing/refactoring I could locks and
reopening files for every operation working. The overhead could be
minimized with appropriate (automatic?) rechunking.
Maybe this is the better way to go.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2163 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGQDlwJ41Zs_m9rBnKwrdHlXG0NJks5uOGiDgaJpZM4VvrXD>
.
|
If you do go the copying direction, this may be helpful. |
Other things to consider, had been playing with the idea of sending the graph over to a worker to run. ( dask/dask#3275 ) Maybe something with |
A common pattern when using xarray with dask is to have a large number of tasks writing to a smaller number of files, e.g., an
xarray.Dataset
consisting of a handful of dask arrays gets stored into a single netCDF file.This works pretty well with the non-distributed version of dask, but doing it with dask-distributed presents two challenges:
It would be nice if we could simply consolidate all tasks that involve writing to a single file onto a single worker. This would avoid the necessity to reopen files, pass around open files between processes or worry about distributed locks.
Does dask-distributed have any sort of existing machinery that would facilitate this? In particular, I wonder if this could be a good use-case for actors (#2133).
The text was updated successfully, but these errors were encountered: