-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Automatic Dask-Zarr chunk alignment #10336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Dask-Zarr chunk alignment #10336
Conversation
for more information, see https://pre-commit.ci
… the logic used on the set_variables method, adding the docs of the align_chunks parameter
…t' into feature/automatic-chunk-alignment # Conflicts: # xarray/backends/common.py # xarray/tests/test_backends.py
for more information, see https://pre-commit.ci
…t' into feature/automatic-chunk-alignment
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…plicated test added by mistake on test_backends.py and also make mypy happy again
for more information, see https://pre-commit.ci
I think this is ready for a first review @max-sixty , when you have the time, please take a look. I did multiple modifications to try to clean the code a little. If you don't like the terminology used on the functions and the names of the parameters, I'm more than open to changing them to whatever adapts better to the Xarray standard. |
for more information, see https://pre-commit.ci
…chunks to align_nd_chunks to keep it consistent with the terminology used, make mypy happy changing the new_shape to a tuple, added some simple tests for the other functions inside the chunks module
…t' into feature/automatic-chunk-alignment
Hi @max-sixty, I will have some free time tomorrow in case that there is something that you want that I change on this PR, so if you have the time to review it I would be grateful. |
Hi @josephnowak ! sorry again for being somewhat absent. thank you again for your commitment here! I haven't thought of a good enough design where we either always align chunks or never align chunks. I've been keen to restrict the cardinality of the API, which has become quite large. but I'm not sure there's a way, so adding additional the additional arg here seems reasonable. one thing that I think would be helpful for future development, if you do have any time, is to put two really clear cases in the tests: one where aligning chunks is important, and another where not aligning chunks is important thank you again! I marked as merge. if anyone else has thoughts please do say, otherwise let's go ahead! |
Hi, don't worry for it. Yeah, I agree that the cardinality of the API has become quite large, probably some of the functionalities could be moved to Zarr as utilities functions to reduce the amount of code on Xarray. Related to the logic of always aligning or not, I think we can take the opportunity of having both parameters to take a decision based on the users of Xarray, at least from my perspective the best would be to always align to avoid any possible data corruption, but I'm probably biased because I used Xarray only for the Finance which is not the most common use case I think. For test that you are mentioning, I'm not sure on what scenario not aligning chunks is important, the only cases that I have in mind would be the following:
If you want I can add a test for the cases that I mentioned or if you have any other idea in mind for that test please tell me and I will try to implement it. |
thanks, and then I'm trying to ensure this is consistent with:
can we be explicit on what we're aligning to? if we're aligning the dask chunks to the zarr chunks, then this seems safe for zarr...? |
Do you think that this definition would be more explicit? """ If True, rechunks the Dask array to align with Zarr chunks before writing. This ensures each Dask chunk maps to one or more contiguous Zarr chunks (but not vice-versa), which avoids race conditions. Internally, the process sets safe_chunks=False and tries to preserve the original Dask chunking as much as possible. Note: While this alignment avoids write conflicts stemming from chunk boundary misalignment, it does not protect against race conditions if multiple uncoordinated processes write to the same Zarr array concurrently. |
I think that's great!! very clear (maybe cut the |
I have updated the doc of the align_chunks parameter, but now the tests are failing but they looks unrelated to this PR, apparently is some kind of limit in the amount of requests |
great, thank you very much @josephnowak hopefully at some point we can find a way of simplifying this; possibly always aligning. if you see an opportunity for a code change like this in the future, very open to this... |
chunks
file within the backends module, consolidating all chunk-related functionalities. This should benefit all backends that handle chunked data.align_chunks
parameter to theto_zarr
methods, which, if set to True, will rechunk the data to align it with the Zarr chunks.test_backends_chunks
file to prevent excessive test accumulation in a single file—Xarray was causing my IDE (PyCharm) to hang.I’d appreciate any feedback or suggestions regarding these concerns!
whats-new.rst
api.rst