Skip to content

Automatic Dask-Zarr chunk alignment #10336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 5, 2025

Conversation

josephnowak
Copy link
Contributor

@josephnowak josephnowak commented May 19, 2025

  1. Introduced a dedicated chunks file within the backends module, consolidating all chunk-related functionalities. This should benefit all backends that handle chunked data.
  2. Moved chunk alignment validation outside the function responsible for generating Zarr array encoding, improving code clarity.
  3. Added the align_chunks parameter to the to_zarr methods, which, if set to True, will rechunk the data to align it with the Zarr chunks.
  4. Created a separate test_backends_chunks file to prevent excessive test accumulation in a single file—Xarray was causing my IDE (PyCharm) to hang.
  5. Improved the error message for the safe_chunks, now it raise information about what chunk is unaligned.
  6. I fixed a mypy issue related to the use of a list instead of a tuple when using the resize method of Zarr inside the set_variables method, it also should be faster (micro optimization) when the number of dimensions is smaller than 100, which should always be the case.

I’d appreciate any feedback or suggestions regarding these concerns!

@josephnowak josephnowak marked this pull request as draft May 19, 2025 17:57
@josephnowak josephnowak changed the title Automatic chunk alignment Automatic Dask-Zarr chunk alignment May 19, 2025
josephnowak and others added 10 commits May 21, 2025 11:33
… the logic used on the set_variables method, adding the docs of the align_chunks parameter
…t' into feature/automatic-chunk-alignment

# Conflicts:
#	xarray/backends/common.py
#	xarray/tests/test_backends.py
…plicated test added by mistake on test_backends.py and also make mypy happy again
@josephnowak josephnowak marked this pull request as ready for review May 21, 2025 13:17
@josephnowak
Copy link
Contributor Author

josephnowak commented May 21, 2025

I think this is ready for a first review @max-sixty , when you have the time, please take a look. I did multiple modifications to try to clean the code a little. If you don't like the terminology used on the functions and the names of the parameters, I'm more than open to changing them to whatever adapts better to the Xarray standard.

@josephnowak
Copy link
Contributor Author

Hi @max-sixty, I will have some free time tomorrow in case that there is something that you want that I change on this PR, so if you have the time to review it I would be grateful.

@max-sixty max-sixty added the plan to merge Final call for comments label May 31, 2025
@max-sixty
Copy link
Collaborator

Hi @josephnowak ! sorry again for being somewhat absent. thank you again for your commitment here!

I haven't thought of a good enough design where we either always align chunks or never align chunks. I've been keen to restrict the cardinality of the API, which has become quite large. but I'm not sure there's a way, so adding additional the additional arg here seems reasonable.

one thing that I think would be helpful for future development, if you do have any time, is to put two really clear cases in the tests: one where aligning chunks is important, and another where not aligning chunks is important

thank you again! I marked as merge. if anyone else has thoughts please do say, otherwise let's go ahead!

@josephnowak
Copy link
Contributor Author

josephnowak commented May 31, 2025

Hi, don't worry for it.

Yeah, I agree that the cardinality of the API has become quite large, probably some of the functionalities could be moved to Zarr as utilities functions to reduce the amount of code on Xarray.

Related to the logic of always aligning or not, I think we can take the opportunity of having both parameters to take a decision based on the users of Xarray, at least from my perspective the best would be to always align to avoid any possible data corruption, but I'm probably biased because I used Xarray only for the Finance which is not the most common use case I think.

For test that you are mentioning, I'm not sure on what scenario not aligning chunks is important, the only cases that I have in mind would be the following:

  1. You are using a synchronizer and you would like to avoid any modification of your array.
  2. Validate that you are not writing on partial chunks, this probably can mitigate issues when multiple uncoordinated process are writing on the same Zarr array and you would like to prevent that they overlap (or at least reduce the probability).

If you want I can add a test for the cases that I mentioned or if you have any other idea in mind for that test please tell me and I will try to implement it.

@max-sixty
Copy link
Collaborator

  • You are using a synchronizer and you would like to avoid any modification of your array.

  • Validate that you are not writing on partial chunks, this probably can mitigate issues when multiple uncoordinated process are writing on the same Zarr array and you would like to prevent that they overlap (or at least reduce the probability).

thanks, and then I'm trying to ensure this is consistent with:

            If True, the data will be rechunked before being written to the zarr store to
            prevent data corruption caused by the overlap of Dask and Zarr chunks.

can we be explicit on what we're aligning to? if we're aligning the dask chunks to the zarr chunks, then this seems safe for zarr...?

@josephnowak
Copy link
Contributor Author

josephnowak commented Jun 3, 2025

Do you think that this definition would be more explicit?

"""
align_chunks: bool, default False

If True, rechunks the Dask array to align with Zarr chunks before writing. This ensures each Dask chunk maps to one or more contiguous Zarr chunks (but not vice-versa), which avoids race conditions. Internally, the process sets safe_chunks=False and tries to preserve the original Dask chunking as much as possible. Note: While this alignment avoids write conflicts stemming from chunk boundary misalignment, it does not protect against race conditions if multiple uncoordinated processes write to the same Zarr array concurrently.
"""
Not sure if it would be good to mention Icechunk as a solution for the uncoordinated processes writes

@max-sixty
Copy link
Collaborator

I think that's great!! very clear

(maybe cut the (but not vice-versa), that's very explicit with the "one or more")

@josephnowak
Copy link
Contributor Author

I have updated the doc of the align_chunks parameter, but now the tests are failing but they looks unrelated to this PR, apparently is some kind of limit in the amount of requests

@max-sixty
Copy link
Collaborator

great, thank you very much @josephnowak

hopefully at some point we can find a way of simplifying this; possibly always aligning. if you see an opportunity for a code change like this in the future, very open to this...

@max-sixty max-sixty merged commit 15ed8e5 into pydata:main Jun 5, 2025
30 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-documentation topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic Dask-Zarr chunk alignment on the to_zarr method
2 participants