Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary operations are not internal #23

Open
aasdelat opened this issue Jun 8, 2024 · 9 comments
Open

Binary operations are not internal #23

aasdelat opened this issue Jun 8, 2024 · 9 comments

Comments

@aasdelat
Copy link

aasdelat commented Jun 8, 2024

If I have a dataset ds, then the result of ds + ds is a matrix, not a dataset.
I have followed the example of the groupby function, that calculates anomalies, and we end up with a matrix, not a dataset. This prevents the use of the result again in the gropuby function if we want to calculate the standard deviations in order to get the standardized anomalies. The groupby function needs a dataset and does not work with a matrix.

@Alexander-Barth
Copy link
Member

It is not clear to me what you mean. Can your provide all information in the issue template,
(https://github.com/JuliaGeo/CommonDataModel.jl/issues/new) in particular an minimal reproducible code example?

@aasdelat
Copy link
Author

Sorry, I apologise for having been too short.
It is really a feature request, not a bug.
The fact is that, when I have a dataset, ds with a datavar, msl, the results of the operations involving that datavar is not a datavar, but a matrix.
For example, I have the following dataset:

julia> ds
Dataset: /home/antonio/Documentos/datos/ERA5/ERA5_N320_1h_sl_sfc_NH15S_1978_2023_msl/ERA5_N320_1h_sl_sfc_NH15S_1982-06-01_1982-06-30_msl.grib
Group: /

Dimensions
   values = 90076
   valid_time = 0

Variables
  valid_time   (0)
    Datatype:    DateTime
    Dimensions:  valid_time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = time
     standard_name        = time

  longitude   (90076)
    Datatype:    Float64
    Dimensions:  values
    Attributes:
     units                = degrees_east
     long_name            = longitude
     standard_name        = longitude

  latitude   (90076)
    Datatype:    Float64
    Dimensions:  values
    Attributes:
     units                = degrees_north
     long_name            = latitude
     standard_name        = latitude

  msl   (90076 × 0)
    Datatype:    Union{Missing, Float64}
    Dimensions:  values × valid_time
    Attributes:
     units                = Pa
     long_name            = Mean sea level pressure
     standard_name        = air_pressure_at_mean_sea_level

I will use the "msl" datavar. If I sum that datavar with itself, the result is not a datavar, but a matrix:

julia> ds["msl"] + ds["msl"]
90076×0 Matrix{Union{Missing, Float64}}

The same occurs when I multiply by a scalar:

julia> 2 .* ds["msl"]
90076×0 Matrix{Union{Missing, Float64}}

So, I am really asking for a new feature: that the result of operations with datavars, should be also datavars.

@rafaqz
Copy link
Member

rafaqz commented Jun 10, 2024

But why? If you do a broadcast like that all of the variable attributes like "units" will likely be invalid, and actually misleading to keep.

What do you lose making them a regular Array?

@aasdelat
Copy link
Author

Well, the only case I have faced so far, is when attempting to store the results into a NetCDF, just no not to have to re-compose the dimensions before to save to disk.

@Alexander-Barth
Copy link
Member

Actually, it is sometimes the case that attributes are wrong or missing in a NetCDF/Zarr file and that one wants to use the data without recreating a big file on the disk. It would be nice to have a wrapper type (like WrapperDataset, WrapperVariable, but with a better name :-)) that allows you to edit attributes (and maybe other thinks) in memory without affecting the data on the disk. For example:

ds_orig = NCDataset("original_file_with_missing_units.nc")
nctemp_orig = ds_orig["temp"];
nctemp = WrapperVariable(nctemp);
nctemp.attrib["units"] = "degree Celsuis" # nctemp has now the correct units but the file is not changed

I am wondering if such wrapper classes can not also be useful here to correct units (or other metadata), like

ds_orig = NCDataset("velocity.nc")
ncu = ds_orig["u"];
ncu2 = ncu.^2 # automatically create a WrapperVariable? computation eager or lazy?
ncu2.attrib["units"] = "m2 s-2" # the file will not be changed

There is also a MemoryDataset and MemoryVariable that I use for testing, which I guess could be adapted.

@aasdelat
Copy link
Author

And not only the attributes, but also the values should also be modifiable, so that, for example, you can convert longitudes in the range [0,360] to the range [-180, 180].
But these all (modifiable attributes and values, dataset independent of file) and much more, is already available in YAXArrays.
However, the selection in YAXArrays is not as powerful as in CommonDatasets (I have opened an issue in YAXArrays regarding this).
So, if the developers are interested in not to duplicate efforts, I would let YAXArrays implement all those high level things and maintain CommonDatamodel and GRIBDatasets and NCDatasets in order to guarantee access to these formats.
YAXSArrays cannot read Grib files (by now), but I think it could do it by means of GRIBDatasets (but give time to the developers, I think they are working on it).

@rafaqz
Copy link
Member

rafaqz commented Jun 11, 2024

I agree very much on not duplicating efforts.

Rasters.jl also does all this, pretty similar to YAX (it's Dimensional data underneath for both now) but Rasters wraps CommonDataModel and can read GRIB if you need GRIB. You can then easily convert a Raster to YAX too.

But I'm wondering what selection is more powerful here than in YAX?

@aasdelat
Copy link
Author

Wow! I tested Rasters a long time ago but after your post, I have tested it again right now. It correctly reads the grib files and I can change values and attributes, which means that the dataset is independent of the file.
I will test it and, if I need YAXArrays functionality, I will convert it into a YAXArray by means of YAXArray(dims(raster_dataset), raster_dataset.data)
Regarding the selection by YAXArrays, you can see the issue I posted:
JuliaDataCubes/YAXArrays.jl#404
I have to test if Rasters does it.

@rafaqz
Copy link
Member

rafaqz commented Jun 11, 2024

Great! I don't want to discourage doing this directly with the other packages like NCDatasets.jl.

But Rasters is pretty commited to using CommonDataModel.jl for file IO standardisation now, and doing that lazily with lazy=true when you need it.

So it's probably best we don't all try to cover the same functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants