-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance with concatenating and saving netCDF files #4448
Comments
👍 great example code ! It will really help to assess performance. Also great timing... |
My best guess is that it's a dask/netCDF chunking issue - but that is a fairly speculative guess (and, frankly, investigating that in more detail is a bit beyond my dask knowledge). Notice that there are a lot of times in the sample data, since the real data is 30 years of daily data. |
Just linking related issues : This might help #4572 |
So if I'm interpreting #4572 and our earlier conversation correctly, the issue here is that what we're wanting to happen is:
whereas what is actually going on under the hood is:
(I initially wrote this with the slice after the concatenate - I don't think this can be the case, though, since that would imply loading all the data at once which would exceed the memory allocation and the job would be killed long before the 3 day timeout.) Hence there's a lot of unneccessary I/O. And the user controlled chunking operations being added in #4572 should mean that we can do the load for each time in turn rather than loading all the times from all the tiles for each time? |
Following advice from @pp-mo, rechunking the data so that the innermost dimension is a single chunk seems to resolve this issue for our use case. Adding two lines to the def concatenate_and_save():
filename_input = os.path.join(
os.path.expandvars("$SCRATCH"), "iris_example", "temp_*.nc"
)
filename_output = os.path.join(
os.path.expandvars("$SCRATCH"), "iris_example", "concatenated.nc"
)
source_cubes = iris.load(filename_input)
cubes = iris.cube.CubeList(source_cubes)
cubes = cubes.concatenate()
# These are the additional two lines:
for cube in cubes:
cube.data = cube.core_data().rechunk({0: "auto", 1: "auto", 2: -1})
iris.save(cubes, filename_output) is sufficient to get the save time down to around 1 hour. For a concatenation of ~0.5Tb of data in 80 files to a single file of ~0.5Tb, that doesn't seem excessive to me. I think this gives us a workaround we can use in ANTS, so I'm happy to close this issue. If more thorough testing reveals that this is isn't sufficient, I can re-open it. It would be great if the eventual solution in #4572 either removed the need for this workaround, or at least didn't break the workaround 😃 |
Also noted : without / with the above rechunk fix, @hdyson finds that
All run on 1cpu / 2 cores only (with control of workers/cpus both numpy and Dask) Don't really know what the time cost is, but this is on SPICE where there is no swapfile space, so it should simply fail if exceeding memory, so it can't be that Interest here from 2 points :
|
🐛 Bug Report
We have mutliple netCDF files that need to be concatenated together and saved. Typically, this is about 80 files of 5Gb each. The load, concatenate and save takes multiple days until the job is killed (so we don't know if this is an infinite loop, or if more time would have resulted in the task completing). Below, I've included the script to do the concatenation and save, and a script to generate sample data which is analogous to the real data we're working with.
How To Reproduce
Steps to reproduce the behaviour:
Expected behaviour
Saving to complete in a sensible time.
Environment
Additional context
For the real world case that led to this example, we do have control over how the individual 80 files are saved - so if there's more sensible chunk sizes we should use for the netCDFs to workaround this problem, this would be a solution for us.
The text was updated successfully, but these errors were encountered: