Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk issue with sdba.MBCn.train #1903

Open
1 task done
sylvainmarchi opened this issue Sep 3, 2024 · 10 comments
Open
1 task done

Chunk issue with sdba.MBCn.train #1903

sylvainmarchi opened this issue Sep 3, 2024 · 10 comments
Assignees

Comments

@sylvainmarchi
Copy link

Generic Issue

Xclim version: 0.52.0
Python version: 3.10.14
Operating System: Linux

Description

I apply the MBCn algorithm for bias-correcting Euro CORDEX data. I successfully reproduced the example presented in the notebook. Now I follow the same procedure using my data.

What I Did

I loaded my files using open_mfdataset and merged them (one file per variable) to create dref and dhist. The reference and historical datasets are made of a single chunk of size 7305 for the time coordinate.
This is my reference file:

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 120MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float64, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 00:00:00 ... 2005-12-31 00:00:00
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes:
    CDI:            Climate Data Interface version 1.9.5 (http://mpimet.mpg.d...
    Conventions:    CF-1.6
    history:        Thu Jul 11 11:29:46 2024: cdo remapcon,grid_description_0...
    creation_date:  25-06-2024
    creators:       Ghilain N., Van Schaeybroeck B., Vanderkelen I.
    contact:        [email protected]
    version:        1.1
    affiliation:    Royal Meteorological Institute of Belgium
    projection:     +proj=lcc +lat_2=50.569898649999999 +lat_1=50.56989864999...
    CDO:            Climate Data Operators version 1.9.5 (http://mpimet.mpg.d...
    units:

This is my historical file:

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 1827, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 12:00:00 ... 2005-12-31 12:00:00
    height    float64 8B ...
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes: (12/32)
    CDI:                            Climate Data Interface version 1.9.6 (htt...
    history:                        Tue May 19 20:33:10 2020: cdo -z szip rem...
    source:                         CLMcom-CCLM4-8-17
    institution:                    Climate Limited-area Modelling Community ...
    Conventions:                    CF-1.4
    institute_id:                   CLMcom
    ...                             ...
    table_id:                       Table day (Sept 2013) 0cf1782745489246c9f...
    modeling_realm:                 atmos
    realization:                    1
    cmor_version:                   2.9.1
    CDO:                            Climate Data Operators version 1.9.6 (htt...
    units:

What I Received

I got the following error message when I executed:

group = 'time'
ADJ = sdba.MBCn.train(
    dref.chunk(dict(time=-1)),
    dhist,
    base_kws={"nquantiles": 20, "group": group},
    adj_kws={"interp": "nearest", "extrapolation": "constant"},
    n_iter=20,  # perform 20 iteration
    n_escore=1000,  # only send 1000 points to the escore metric
)

File /.../conda/envs/xclim_metpy/lib/python3.10/site-packages/xarray/core/computation.py:767, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
765 for axis, dim in enumerate(core_dims, start=-len(core_dims)):
766 if len(data.chunks[axis]) != 1:
--> 767 raise ValueError(
768 f"dimension {dim} on {n}th function argument to "
769 "apply_ufunc with dask='parallelized' consists of "
770 "multiple chunks, but is also a core dimension. To "
771 "fix, either rechunk into a single array chunk along "
772 f"this dimension, i.e., .chunk(dict({dim}=-1)), or "
773 "pass allow_rechunk=True in dask_gufunc_kwargs "
774 "but beware that this may significantly increase memory usage."
775 )
776 dask_gufunc_kwargs["allow_rechunk"] = True
778 output_sizes = dask_gufunc_kwargs.pop("output_sizes", {})
ValueError: dimension time on 1th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single array chunk along this dimension, i.e., .chunk(dict(time=-1)), or pass allow_rechunk=True in dask_gufunc_kwargs but beware that this may significantly increase memory usage.

What is going wrong here?

Code of Conduct

  • I agree to follow this project's Code of Conduct
@sylvainmarchi
Copy link
Author

It seems to be related to the group attribute. I do not face this error when I use

group = sdba.Grouper("time.dayofyear", window=31)

@coxipi
Copy link
Contributor

coxipi commented Sep 3, 2024

Hi!

The reference and historical datasets are made of a single chunk of size 7305 for the time coordinate.

Are you sure? That doesn't seem to be the case in the code snippet you shared, for your historical dataset you have chunksize != size of time, 1827 != 7305.

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305 , lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 1827, 25, 41), chunktype=numpy.ndarray>

Try running

group = 'time'
ADJ = sdba.MBCn.train(
    dref.chunk(dict(time=-1)),
    dhist.chunk(dict(time=-1)),
    base_kws={"nquantiles": 20, "group": group},
    adj_kws={"interp": "nearest", "extrapolation": "constant"},
    n_iter=20,  # perform 20 iteration
    n_escore=1000,  # only send 1000 points to the escore metric
)

Cheers,
Éric

@sylvainmarchi
Copy link
Author

Hello Éric,
My bad. It now works. Unfortunately, I am stuck at the next step. Maybe I should create a new "issue" or discussion. When executing the next piece of code

scenh, scens = (
    ADJ.adjust(
        sim=ds,
        ref=dref,
        hist=dhist,
        base=sdba.QuantileDeltaMapping,
        base_kws_vars={
            "pr": {
                "kind": "*",
                "jitter_under_thresh_value": "0.01 mm d-1", # Trick to remove zeros, as they can be problematic for bias-adusting (as far as I understood)
                "adapt_freq_thresh": "0.1 mm d-1", # Value used to biascorrect the frequency of dry days, that is days with almost no precipitation.
            },
            "tas": {"kind": "+"},
        },
        adj_kws={"interp": "nearest", "extrapolation": "constant"},
    )
    for ds in (dhist, dsim)
)

I get:
ValueError: The dimension(s) over which we group, reduce or interpolate cannot be chunked ({'time': (7305, 7305)}).

Below is how my inputs look like.
dref

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 120MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float64, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 00:00:00 ... 2005-12-31 00:00:00
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes:
    CDI:            Climate Data Interface version 1.9.5 (http://mpimet.mpg.d...
    Conventions:    CF-1.6
    history:        Thu Jul 11 11:29:46 2024: cdo remapcon,grid_description_0...
    creation_date:  25-06-2024
    creators:       Ghilain N., Van Schaeybroeck B., Vanderkelen I.
    contact:        [email protected]
    version:        1.1
    affiliation:    Royal Meteorological Institute of Belgium
    projection:     +proj=lcc +lat_2=50.569898649999999 +lat_1=50.56989864999...
    CDO:            Climate Data Operators version 1.9.5 (http://mpimet.mpg.d...
    units:

dhist

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 12:00:00 ... 2005-12-31 12:00:00
    height    float64 8B ...
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes: (12/32)
    CDI:                            Climate Data Interface version 1.9.6 (htt...
    history:                        Tue May 19 20:33:10 2020: cdo -z szip rem...
    source:                         CLMcom-CCLM4-8-17
    institution:                    Climate Limited-area Modelling Community ...
    Conventions:                    CF-1.4
    institute_id:                   CLMcom
    ...                             ...
    table_id:                       Table day (Sept 2013) 0cf1782745489246c9f...
    modeling_realm:                 atmos
    realization:                    1
    cmor_version:                   2.9.1
    CDO:                            Climate Data Operators version 1.9.6 (htt...
    units:

dsim

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 2036-01-01 12:00:00 ... 2055-12-31 12:00:00
    height    float64 8B ...
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes: (12/32)
    CDI:                            Climate Data Interface version 1.9.6 (htt...
    history:                        Tue May 19 21:32:31 2020: cdo -z szip rem...
    source:                         CLMcom-CCLM4-8-17
    institution:                    Climate Limited-area Modelling Community ...
    Conventions:                    CF-1.4
    institute_id:                   CLMcom
    ...                             ...
    table_id:                       Table day (Sept 2013) 0cf1782745489246c9f...
    modeling_realm:                 atmos
    realization:                    1
    cmor_version:                   2.9.1
    CDO:                            Climate Data Operators version 1.9.6 (htt...
    units:

ADJ.ds

<xarray.Dataset> Size: 3MB
Dimensions:         (multivar_prime: 2, win_dim0: 1, lat: 25, lon: 41,
                     iterations: 20, quantiles: 20, multivar: 2)
Coordinates:
    height          float64 8B 2.0
  * multivar_prime  (multivar_prime) <U3 24B 'pr' 'tas'
  * quantiles       (quantiles) float64 160B 0.025 0.075 0.125 ... 0.925 0.975
  * win_dim0        (win_dim0) int64 8B -1
  * multivar        (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon, iterations
Data variables:
    af_q            (win_dim0, lat, lon, iterations, multivar_prime, quantiles) float32 3MB dask.array<chunksize=(1, 25, 41, 20, 2, 20), meta=np.ndarray>
    escores         (win_dim0, lat, lon, iterations) float32 82kB dask.array<chunksize=(1, 25, 41, 20), meta=np.ndarray>
    rot_matrices    (iterations, multivar, multivar_prime) float32 320B 0.995...
Attributes:
    _xclim_adjustment:  {"py/object": "xclim.sdba.adjustment.MBCn", "py/state...
    adj_params:         MBCn(quantiles=array([0.025, 0.075, 0.125, 0.175, 0.2...

Is it OK to use proleptic gregorian calendar and 2 dimensions for space?

@coxipi
Copy link
Contributor

coxipi commented Sep 3, 2024

No worries!

Space dimensions are not an issue. I will look into it, but if you're using group = "time.dayofyear", you will need a regular calendar yes. I think this is assumed for other sdba methods as well. For group="time" this should not be a constraint.

Can you try:

scenh, scens = (
    ADJ.adjust(
        sim=ds.convert_calendar("noleap"),
        ref=dref.convert_calendar("noleap"),
        hist=dhist.convert_calendar("noleap"),
        base=sdba.QuantileDeltaMapping,
        base_kws_vars={
            "pr": {
                "kind": "*",
                "jitter_under_thresh_value": "0.01 mm d-1", # Trick to remove zeros, as they can be problematic for bias-adusting (as far as I understood)
                "adapt_freq_thresh": "0.1 mm d-1", # Value used to biascorrect the frequency of dry days, that is days with almost no precipitation.
            },
            "tas": {"kind": "+"},
        },
        adj_kws={"interp": "nearest", "extrapolation": "constant"},
    )
    for ds in (dhist, dsim)
)

and see if this removes the error messages? In the meantime I will investigate.

@coxipi
Copy link
Contributor

coxipi commented Sep 3, 2024

Can you share more information on the error? If you could share 3 .nc files for dref, dhist,dsim with a single lon/lat point it would be helpful so I can troubleshoot.

@sylvainmarchi
Copy link
Author

I still have the same error message after converting calendars. Please find my netCDF files as well as my python script at (.py and .nc files are not supported):
https://cloud.meteo.be/s/8c9KMpcwxGb4gE2

@coxipi
Copy link
Contributor

coxipi commented Sep 3, 2024

The python script in your folder is for the old way of working things with NpdfTransform? Do you have the script you used for this above?

@sylvainmarchi
Copy link
Author

I have updated the file.

@coxipi
Copy link
Contributor

coxipi commented Sep 4, 2024

Thanks! The problem is that you have different time coordinates:

dref.time 
>>> '1986-01-01T00:00:00.000000000', '1986-01-02T00:00:00.000000000',
       '1986-01-03T00:00:00.000000000', ...,
dhist.time 
>>> '1986-01-01T12:00:00.000000000', '1986-01-02T12:00:00.000000000',
       '1986-01-03T12:00:00.000000000', ...,

for instance if you try sdba.QuantileDeltaMapping.train(ref=dref, hist=dhist) you will get the same error. If you change the time in dhist:

dhist["time"] = dref.time

the error disappears.

I think we could control this earlier in the process since our methods expect equal time coordinates and give a clearer error message. Thanks for this feedback.

Let me know if there are other problems after that.

@coxipi
Copy link
Contributor

coxipi commented Sep 4, 2024

For xclim devs: We could have a similar function to _harmonize_units which simply checks if times match? This would look more like _check_inputs, but it could not really fit in _check_inputs since the input list is fairly general, inputs could be [ref, hist] which would work for our needs of matching times, but it can also be [ref, hist, sim] or [sim].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants