-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add callback functionality for dask-delayed dataset saving #168
Conversation
Added a test for functionality of a callback to be executed upon every completed delayed writing.
Added functionality for a callback function that is called as soon as the computation is completed.
Fix documentation for call_on_done and add example to the example playlist.
With thanks to @martindurant for the StackOverflow answer at https://stackoverflow.com/a/74354842/974555 pointing out how to do this :) |
Codecov Report
@@ Coverage Diff @@
## main #168 +/- ##
==========================================
+ Coverage 95.56% 95.73% +0.16%
==========================================
Files 13 13
Lines 2821 2954 +133
==========================================
+ Hits 2696 2828 +132
- Misses 125 126 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Putting on draft as I'm trying to figure out a good way for the callback function to get the information it may need. |
For the call-on-done functionality, allow multiple callbacks to be called each in turn.
When having mulitp.le call on dones, don't call save_dataset multiple times
Add stronger test to confirm writing was successful. Turns out it wasn't.
Works wonderfully in the unit tests, but in the real world, the files to be written are empty when the callbacks are called, at least for GeoTIFF. And the trollflow2 launcher complains about missing files. |
It works for some writers but not others. Example: import os.path
import xarray as xr
import dask.array as da
from dask import delayed
from pyresample import create_area_def
from satpy.tests.utils import make_fake_scene
fake_area = create_area_def("sargasso", 4326, resolution=1, width=5, height=5, center=(0, 0))
fake_scene = make_fake_scene(
{"dragon_top_height": (dat := xr.DataArray(
dims=("y", "x"),
data=da.arange(25).reshape((5, 5)))),
"penguin_bottom_height": dat,
"kraken_depth": dat},
daskify=True,
area=fake_area)
def on_done(result, fn):
print("Size for", fn, "is", os.path.getsize(fn))
return result
objs = []
for i in range(5):
fn = f"/tmp/test{i:d}.tif"
obj = fake_scene.save_dataset("dragon_top_height", filename=fn,
compute=False)
objs.append(delayed(on_done)(obj, fn))
print(objs)
da.compute(objs) For
For
For
|
The difference is that for the NetCDF and simple_image writers, the result of |
I also tried to replace
with
but the problem remains, when |
When I add targ.close() it results in
which corresponds to only the header |
When you run this code that you tried:
Can you add a print to |
The value of
Right. Example without any pytroll components: from dask import delayed
import dask.array as da
import os.path
import os
import h5py
ds = da.arange(100).reshape(10, 10)
out = "/tmp/myfile.h5"
os.unlink(out)
f = h5py.File(out, mode='a')
dset = f.create_dataset("data", ds.shape, dtype='f8')
def on_done(result, fn):
print("Size for", fn, "is now", os.path.getsize(fn))
return result
x = da.store(ds, dset, compute=False)
da.compute(delayed(on_done)(x, out))
print("Size after computing for", out, "is", os.path.getsize(out))
f.close()
print("Final size for", out, "is", os.path.getsize(out)) which results in
So the issue here appears to be that computing does not result in the file being flushed/closed. I guess for the other writers, the file gets closed before computation is finished. We might need to put a |
Add a callback that closes the file. This should be used before logging or moving.
I've added a callback that does a close (there's no flush exposed) and it seems to work for GeoTIFF now, at least the unit tests pass. Will do a live test later today. |
Passes unit tests, but fails the field test with
inspection reveals that
meaning that rather than target being a single |
A quick test shows that for this case, everything seems to work as expected if I loop through |
I mentioned the reasoning for the multi-element target on slack (tiff tags using dask arrays) and that we should maybe add a wrapper object in XRImage. For the |
Make callback_close more flexible, such that it can handle the case where there are dask-dependent attributes, that must also be closed.
Fix documentation for save_datasets: the callbacks pass four, not five, parameters.
I guess it makes sense that for large area the RAM usage is lower when the fully used data can be released and cleaned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just couple of comments, otherwise LGTM.
@gerritholl I made an attempt at fixing the merge conflicts |
I fixed the two logger names that were not present in my logging update already merged in |
Add a test for a callback when the writer returns a delayed-object, and fix a bug exposed by this test. Avoid duplicate documentation, just refer from one place to the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks sound in the logic. However it adds quite some complexity in the code, which I think calls for being extra careful on code cleanliness. Herein, some comments that could help with that.
Rename the private functions for applying to either single or multiple sources and targets.
Change the logic for the conditional context maneger, by using a nullcontext if we are not renaming files.
I fixed the couple merge conflicts. At least I hope I did. |
Fixing merge conflicts.
Use None rather than object() as a sentinel value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding this feature!
Add functionality for a callback function that is called as soon as a file has been written. This could be used to ship out written files as soon as they are done, rather than waiting for all files to be complete.
flake8 trollflow2