Skip to content

Poor performance with concatenating and saving netCDF files #4448

Closed
@hdyson

Description

@hdyson

🐛 Bug Report

We have mutliple netCDF files that need to be concatenated together and saved. Typically, this is about 80 files of 5Gb each. The load, concatenate and save takes multiple days until the job is killed (so we don't know if this is an infinite loop, or if more time would have resulted in the task completing). Below, I've included the script to do the concatenation and save, and a script to generate sample data which is analogous to the real data we're working with.

How To Reproduce

Steps to reproduce the behaviour:

  1. Generate sample data with this script (you may need to adjust the destination filename in line 74) - this takes ~15 minutes:
#!/usr/bin/env python
"""
Generates suitable sample data to demonstrate issue where loading multiple
files, concatenating, and saving has unexpected performance.

"""

import itertools
import os

import iris
import numpy as np


def _build_coordinate(start, stop, number_of_points, name):
    coordinate = iris.coords.DimCoord(np.linspace(start, stop, number_of_points))
    coordinate.guess_bounds()
    coordinate.units = "degrees"
    coordinate.rename(name)
    return coordinate


def generate_cubes(times=11280, longitudes=2560, latitudes=1920, x_split=8, y_split=10):
    """
    Generates cubes that tesselate the globe.

    Specifically, each individual cube has all times, but contains a longitude
    range of longitudes/x_split  (and similarly for latitude), where all of
    the cubes combined cover the range -180 to +180 degrees longitude, and
    -90 to +90 degrees latitude.

    All cubes are contiguously chunked.

    longitudes needs to be multiple of x_split; latitudes needs to be a
    multiple of y_split.  These are not checked.

    """
    tc = iris.coords.DimCoord(
        np.arange(times), units="days since epoch", standard_name="time"
    )
    delta_lon = 360.0 / longitudes
    delta_lat = 180.0 / latitudes
    xc = _build_coordinate(
        -180.0 + delta_lon / 2.0, 180.0 - delta_lon / 2.0, longitudes, "longitude"
    )
    yc = _build_coordinate(
        -90.0 + delta_lat / 2.0, 90.0 - delta_lat / 2.0, latitudes, "latitude"
    )
    # Check coordinates cover globe:
    assert np.min(xc.bounds) == -180.0
    assert np.max(xc.bounds) == 180.0
    assert np.min(yc.bounds) == -90.0
    assert np.max(yc.bounds) == 90.0
    x_points_per_cube = longitudes // x_split
    y_points_per_cube = latitudes // y_split
    for x_index, y_index in itertools.product(range(x_split), range(y_split)):
        subset_x = xc[x_index * x_points_per_cube : (x_index + 1) * x_points_per_cube]
        subset_y = yc[y_index * y_points_per_cube : (y_index + 1) * y_points_per_cube]
        x_points = len(subset_x.points)
        y_points = len(subset_y.points)
        data = np.arange(times * x_points * y_points).reshape(times, x_points, y_points)
        print(data.shape)
        cube = iris.cube.Cube(data)
        cube.rename("sample_data")
        cube.add_dim_coord(tc, 0)
        cube.add_dim_coord(subset_x, 1)
        cube.add_dim_coord(subset_y, 2)
        yield cube


def save_cubes(cubes):
    """Given an iterator of cubes, saves them in sequentially numbered files."""
    for index, cube in enumerate(cubes):
        filename = os.path.join(
            os.path.expandvars("$SCRATCH"), "iris_example", f"temp_{index}.nc"
        )
        iris.save(cube, filename)


def main():
    cubes = generate_cubes()
    save_cubes(cubes)


if __name__ == "__main__":
    main()
  1. Run this script to demonstrate the actual issue (again, filenames may need to be adjusted):
import iris
import os

def main():
    filename_input = os.path.join(
        os.path.expandvars("$SCRATCH"), "iris_example", "temp_*.nc"
    )
    filename_output = os.path.join(
        os.path.expandvars("$SCRATCH"), "iris_example", "concatenated.nc"
    )
    source_cubes = iris.load(filename_input)
    cubes = iris.cube.CubeList(source_cubes)
    cubes = cubes.concatenate()
    iris.save(cubes, filename_output)


if __name__ == "__main__":
    main()

Expected behaviour

Saving to complete in a sensible time.

Environment

  • OS & Version: RHEL7
  • Iris Version: 2.3, but also observed with iris 2.4 (and possibly later - @bjlittle was investigating)

Additional context

For the real world case that led to this example, we do have control over how the individual 80 files are saved - so if there's more sensible chunk sizes we should use for the netCDFs to workaround this problem, this would be a solution for us.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions