Description
🐛 Bug Report
We have mutliple netCDF files that need to be concatenated together and saved. Typically, this is about 80 files of 5Gb each. The load, concatenate and save takes multiple days until the job is killed (so we don't know if this is an infinite loop, or if more time would have resulted in the task completing). Below, I've included the script to do the concatenation and save, and a script to generate sample data which is analogous to the real data we're working with.
How To Reproduce
Steps to reproduce the behaviour:
- Generate sample data with this script (you may need to adjust the destination filename in line 74) - this takes ~15 minutes:
#!/usr/bin/env python
"""
Generates suitable sample data to demonstrate issue where loading multiple
files, concatenating, and saving has unexpected performance.
"""
import itertools
import os
import iris
import numpy as np
def _build_coordinate(start, stop, number_of_points, name):
coordinate = iris.coords.DimCoord(np.linspace(start, stop, number_of_points))
coordinate.guess_bounds()
coordinate.units = "degrees"
coordinate.rename(name)
return coordinate
def generate_cubes(times=11280, longitudes=2560, latitudes=1920, x_split=8, y_split=10):
"""
Generates cubes that tesselate the globe.
Specifically, each individual cube has all times, but contains a longitude
range of longitudes/x_split (and similarly for latitude), where all of
the cubes combined cover the range -180 to +180 degrees longitude, and
-90 to +90 degrees latitude.
All cubes are contiguously chunked.
longitudes needs to be multiple of x_split; latitudes needs to be a
multiple of y_split. These are not checked.
"""
tc = iris.coords.DimCoord(
np.arange(times), units="days since epoch", standard_name="time"
)
delta_lon = 360.0 / longitudes
delta_lat = 180.0 / latitudes
xc = _build_coordinate(
-180.0 + delta_lon / 2.0, 180.0 - delta_lon / 2.0, longitudes, "longitude"
)
yc = _build_coordinate(
-90.0 + delta_lat / 2.0, 90.0 - delta_lat / 2.0, latitudes, "latitude"
)
# Check coordinates cover globe:
assert np.min(xc.bounds) == -180.0
assert np.max(xc.bounds) == 180.0
assert np.min(yc.bounds) == -90.0
assert np.max(yc.bounds) == 90.0
x_points_per_cube = longitudes // x_split
y_points_per_cube = latitudes // y_split
for x_index, y_index in itertools.product(range(x_split), range(y_split)):
subset_x = xc[x_index * x_points_per_cube : (x_index + 1) * x_points_per_cube]
subset_y = yc[y_index * y_points_per_cube : (y_index + 1) * y_points_per_cube]
x_points = len(subset_x.points)
y_points = len(subset_y.points)
data = np.arange(times * x_points * y_points).reshape(times, x_points, y_points)
print(data.shape)
cube = iris.cube.Cube(data)
cube.rename("sample_data")
cube.add_dim_coord(tc, 0)
cube.add_dim_coord(subset_x, 1)
cube.add_dim_coord(subset_y, 2)
yield cube
def save_cubes(cubes):
"""Given an iterator of cubes, saves them in sequentially numbered files."""
for index, cube in enumerate(cubes):
filename = os.path.join(
os.path.expandvars("$SCRATCH"), "iris_example", f"temp_{index}.nc"
)
iris.save(cube, filename)
def main():
cubes = generate_cubes()
save_cubes(cubes)
if __name__ == "__main__":
main()
- Run this script to demonstrate the actual issue (again, filenames may need to be adjusted):
import iris
import os
def main():
filename_input = os.path.join(
os.path.expandvars("$SCRATCH"), "iris_example", "temp_*.nc"
)
filename_output = os.path.join(
os.path.expandvars("$SCRATCH"), "iris_example", "concatenated.nc"
)
source_cubes = iris.load(filename_input)
cubes = iris.cube.CubeList(source_cubes)
cubes = cubes.concatenate()
iris.save(cubes, filename_output)
if __name__ == "__main__":
main()
Expected behaviour
Saving to complete in a sensible time.
Environment
- OS & Version: RHEL7
- Iris Version: 2.3, but also observed with iris 2.4 (and possibly later - @bjlittle was investigating)
Additional context
For the real world case that led to this example, we do have control over how the individual 80 files are saved - so if there's more sensible chunk sizes we should use for the netCDFs to workaround this problem, this would be a solution for us.