-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resample and boolean indexing with dask-arrays #356
Comments
You can try resampling_iterations instead of resampling_iterations. This function doesn’t use vectorised indexing and is more robust on very large datasets in my experience |
I am so stupid. Maybe, I found two solutions for this problem. I increased the size of the arrays to 200200150. Solution A: Prior resampling process, I can remove the chunks from the dask arrays using Solution B: I remembered that the sign-tests dsskill = xr.DataArray(data=15 + 2.1 * numpy.random.randn(200, 200, 150),dims=["x", "y", "time"]).chunk({'x':10,'y':20})
dsref = xr.DataArray(data=15 + 0.15 * numpy.random.randn(200, 200, 150),dims=["x", "y", "time"]).chunk({'x':10,'y':20})
dsproof = xr.DataArray(data=15 + 2.0 * numpy.random.randn(200, 200, 150),dims=["x", "y", "time"]).chunk({'x':10,'y':20}) And the memory statistics reads:
The robustness of the memory consumption measurement is not clear. However, I decided to track also child processes during the execution of the bootstrapping code: There is still a memory peak of 8GB using dask (it is hard to get the reason for it using this tracking, but maybe the resampling of the indices requires to group all the small chunks to larger chunks). However, the dask-arrays a) prevent for allocation of nearly 50GB for Nevertheless, I would like to further reduce memory allocation and prevent from applying the resampling process twice, i.e. p2divp1 = ( numpy.square( forecast1[rand_ind] - observations[rand_ind] ) ).mean(dim=coordtime) / \
( numpy.square( forecast2[rand_ind] - observations[rand_ind] ) ).mean(dim=coordtime) where Is there a small code example, which would illustrate how to realize this? |
I would start from rebuilding resample_iterations(_idx) manually |
I want to use the
resample_iterations_idx
functionality to bootstrap evaluation metrics of hindcasts. The challenge with huge datasets is the memory allocation when storing all the iteration samples.I started with the Mean squared error skill score and tried to understand the memory consumption. The following small example script demonstrates the usage of the metric and the memory consumption:
Running the script from the linux console gives me the following:
Thus, we need 5.7GB to store each of the iteration samples
bsp1
andbsp2
. That is consistent to the size of the arrays. However, often the climate datasets are much larger. So, I started working with dask arrays and changed the three lines:Now, the dask scheduler starts to collect all operations and performs the computation as the netcdf-file is written. However, the
resampling_iterations_idx
seems to refer indices which belong to the unchunked fields but not to the chunked fields:Is there a way, to use the resampling functionality on dask arrays to save memory.? It is not clear to me if this is really parallelizable. As already commented in issue #221 by @dougiesquire (#221 (comment)), there is a problem with boolean indexing in numpy arrays which utilize dask arrays.
The text was updated successfully, but these errors were encountered: