diff --git a/CHANGELOG.md b/CHANGELOG.md index 45e7d68..03159a1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- add ability to derive variables from input datasets [\#34](https://github.com/mllam/mllam-data-prep/pull/34), @ealerskans, @mafdmi - add github PR template to guide development process on github [\#44](https://github.com/mllam/mllam-data-prep/pull/44), @leifdenby - add support for zarr 3.0.0 and above [\#51](https://github.com/mllam/mllam-data-prep/pull/51), @kashif diff --git a/README.md b/README.md index 5f5fcdf..fe19134 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,7 @@ The package can also be used as a python module to create datasets directly, for import mllam_data_prep as mdp config_path = "example.danra.yaml" -config = mdp.Config.from_yaml_file(config_path) +config = mdp.Config.load_config(config_path) ds = mdp.create_dataset(config=config) ``` @@ -112,7 +112,7 @@ ds = mdp.create_dataset(config=config) A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness: ```yaml -schema_version: v0.5.0 +schema_version: v0.6.0 dataset_version: v0.1.0 output: @@ -175,6 +175,26 @@ inputs: variables: # use surface incoming shortwave radiation as forcing - swavr0m + derived_variables: + # derive variables to be used as forcings + toa_radiation: + kwargs: + time: time + lat: lat + lon: lon + function: mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation + hour_of_day_sin: + kwargs: + time: time + extra_kwargs: + component: sin + function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day + hour_of_day_cos: + kwargs: + time: time + extra_kwargs: + component: cos + function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day dim_mapping: time: method: rename @@ -286,15 +306,34 @@ inputs: grid_index: method: stack dims: [x, y] - target_architecture_variable: state + target_output_variable: state danra_surface: path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr dims: [time, x, y] variables: - # shouldn't really be using sea-surface pressure as "forcing", but don't - # have radiation varibles in danra yet - - pres_seasurface + # use surface incoming shortwave radiation as forcing + - swavr0m + derived_variables: + # derive variables to be used as forcings + toa_radiation: + kwargs: + time: time + lat: lat + lon: lon + function: mllam_data_prep.derive_variable.physical_field.calculate_toa_radiation + hour_of_day_sin: + kwargs: + time: time + extra_kwargs: + component: sin + function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day + hour_of_day_cos: + kwargs: + time: time + extra_kwargs: + component: cos + function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day dim_mapping: time: method: rename @@ -305,7 +344,7 @@ inputs: forcing_feature: method: stack_variables_by_var_name name_format: "{var_name}" - target_architecture_variable: forcing + target_output_variable: forcing ... ``` @@ -315,11 +354,45 @@ The `inputs` section defines the source datasets to extract data from. Each sour - `path`: the path to the source dataset. This can be a local path or a URL to e.g. a zarr dataset or netCDF file, anything that can be read by `xarray.open_dataset(...)`. - `dims`: the dimensions that the source dataset is expected to have. This is used to check that the source dataset has the expected dimensions and also makes it clearer in the config file what the dimensions of the source dataset are. - `variables`: selects which variables to extract from the source dataset. This may either be a list of variable names, or a dictionary where each key is the variable name and the value defines a dictionary of coordinates to do selection on. When doing selection you may also optionally define the units of the variable to check that the units of the variable match the units of the variable in the model architecture. -- `target_architecture_variable`: the variable in the model architecture that the source dataset should be mapped to. +- `target_output_variable`: the variable in the model architecture that the source dataset should be mapped to. - `dim_mapping`: defines how the dimensions of the source dataset should be mapped to the dimensions of the model architecture. This is done by defining a method to apply to each dimension. The methods are: - `rename`: simply rename the dimension to the new name - `stack`: stack the listed dimension to create the dimension in the output - `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable. +- `derived_variables`: defines the variables to be derived from the variables available in the source dataset. This should be a dictionary where each key is the name of the variable to be derived and the value defines a dictionary with the following additional information. See also the 'Derived Variables' section for more details. + - `function`: the function used to derive a variable. This should be a string the full namespace of the function, e.g. `mllam_data_prep.ops.derived_variables.physical_field.calculate_toa_radiation`. + - `kwargs`: `function` arguments that should be extracted from the source dataset. This is a dictionary where each key is the name of the variables to select from the source dataset and each value is the named argument to `function`. + - `extra_kwargs`: `function` arguments that should not be extracted from the source dataset, such as the extra argument `component` to `mllam_data_prep.ops.derived_variables.time_components.calculate_hour_of_day` which is a string (either "sin" or "cos") the decides if the returned field is the sine or cosine component of the cyclically encoded hour of day variable. + +#### Derived Variables +Variables that are not part of the source dataset but can be derived from variables in the source dataset can also be included. They should be defined in their own section, called `derived_variables` as illustrated in the example config above and in the `example.danra.yaml` config file. + +To derive the variables, the function used to derive the variable (`function`) and the arguments to this function (`kwargs` and `extra_kwargs`) need to be specified, as explained above. In addition, an optional section called `attrs` can be added. In this section, the user can add attributes to the derived variable, as illustrated below. +```yaml + derived_variables: + toa_radiation: + kwargs: + time: time + lat: lat + lon: lon + function: mllam_data_prep.derive_variable.physical_field.calculate_toa_radiation + attrs: + units: W*m**-2 + long_name: top-of-atmosphere incoming radiation +``` + +Note that the attributes `units` and `long_name` are required. This means that if the function used to derive a variable does not set these attributes they are **required** to be set in the config file. If using a function defined in `mllam_data_prep.ops.derive_variable` the `attrs` section is optional as the attributes should already be defined. In this case, adding the `units` and `long_name` attributes to the `attrs` section of the derived variable in config file will **overwrite** the already-defined attributes in the function. + +Currently, the following derived variables are included as part of `mllam-data-prep`: +- `toa_radiation`: + - Top-of-atmosphere incoming radiation + - function: `mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation` +- `hour_of_day_[sin/cos]`: + - Sine of cosine part of cyclically encoded hour of day + - function: `mllam_data_prep.ops.derive_variable.time_compoents.calculate_hour_of_day` +- `day_of_year_[sin/cos]`: + - Sine of cosine part of cyclically encoded day of year + - function: `mllam_data_prep.ops.derive_variable.time_compoents.calculate_day_of_year` ### Config schema versioning diff --git a/example.danra.yaml b/example.danra.yaml index 3edf126..5101005 100644 --- a/example.danra.yaml +++ b/example.danra.yaml @@ -1,4 +1,4 @@ -schema_version: v0.5.0 +schema_version: v0.6.0 dataset_version: v0.1.0 output: @@ -61,6 +61,26 @@ inputs: variables: # use surface incoming shortwave radiation as forcing - swavr0m + derived_variables: + # derive variables to be used as forcings + toa_radiation: + kwargs: + time: time + lat: lat + lon: lon + function: mllam_data_prep.ops.derive_variable.physical_field.calculate_toa_radiation + hour_of_day_sin: + kwargs: + time: time + extra_kwargs: + component: sin + function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day + hour_of_day_cos: + kwargs: + time: time + extra_kwargs: + component: cos + function: mllam_data_prep.ops.derive_variable.time_components.calculate_hour_of_day dim_mapping: time: method: rename diff --git a/mllam_data_prep/config.py b/mllam_data_prep/config.py index 14e80ef..9735b9b 100644 --- a/mllam_data_prep/config.py +++ b/mllam_data_prep/config.py @@ -9,6 +9,50 @@ class InvalidConfigException(Exception): pass +def validate_config(config_inputs): + """ + Validate that, in the config: + - either `variables` or `derived_variables` are present in the config + - if both `variables` and `derived_variables` are present, that they don't + add the same variables to the dataset + + Parameters + ---------- + config_inputs: Dict[str, InputDataset] + + Returns + ------- + """ + + for input_dataset_name, input_dataset in config_inputs.items(): + if not input_dataset.variables and not input_dataset.derived_variables: + raise InvalidConfigException( + f"Input dataset '{input_dataset_name}' is missing the keys `variables` and/or" + " `derived_variables`. Make sure that you update the config so that the input" + f" dataset '{input_dataset_name}' contains at least either a `variables` or" + " `derived_variables` section." + ) + elif input_dataset.variables and input_dataset.derived_variables: + # Check so that there are no overlapping variables + if isinstance(input_dataset.variables, list): + variable_vars = input_dataset.variables + elif isinstance(input_dataset.variables, dict): + variable_vars = input_dataset.variables.keys() + else: + raise TypeError( + f"Expected an instance of list or dict, but got {type(input_dataset.variables)}." + ) + derived_variable_vars = input_dataset.derived_variables.keys() + common_vars = list(set(variable_vars) & set(derived_variable_vars)) + if len(common_vars) > 0: + raise InvalidConfigException( + "Both `variables` and `derived_variables` include the following variables name(s):" + f" '{', '.join(common_vars)}'. This is not allowed. Make sure that there" + " are no overlapping variable names between `variables` and `derived_variables`," + f" either by renaming or removing '{', '.join(common_vars)}' from one of them." + ) + + @dataclass class Range: """ @@ -52,6 +96,34 @@ class ValueSelection: units: str = None +@dataclass +class DerivedVariable: + """ + Defines a derived variables, where the kwargs (variables required for the + calculation, to be extracted from the input dataset) and the function (for + calculating the variable) are specified. Also, if the function has other arguments + which should not be extracted from the dataset (e.g. a string to indicate if the + sine or cosine component should be extracted) these can be specified in the extra_kwargs. + Optionally, in case a function does not return an `xr.DataArray` with the required + attributes (`units` and `long_name`) set, these should be specified in `attrs`, e.g.: + {"attrs": "units": "W*m**-2, "long_name": "top-of-the-atmosphere radiation"}. + Additional attributes can also be set if desired. + + Attributes: + kwargs: Variables required for calculating the derived variable, to be extracted + from the input dataset. + function: Function used to calculate the derived variable. + extra_kwargs: Extra arguments for `function` that should not be extracted from + the input dataset, such as a string. + attrs: Attributes (e.g. `units` and `long_name`) to set for the derived variable. + """ + + kwargs: Dict[str, str] + function: str + extra_kwargs: Optional[Dict[str, str]] = field(default_factory=dict) + attrs: Optional[Dict[str, str]] = field(default_factory=dict) + + @dataclass class DimMapping: """ @@ -120,7 +192,8 @@ class InputDataset: 1) the path to the dataset, 2) the expected dimensions of the dataset, 3) the variables to select from the dataset (and optionally subsection - along the coordinates for each variable) and finally + along the coordinates for each variable) or the variables to derive + from the dataset, and finally 4) the method by which the dimensions and variables of the dataset are mapped to one of the output variables (this includes stacking of all the selected variables into a new single variable along a new coordinate, @@ -134,11 +207,6 @@ class InputDataset: dims: List[str] List of the expected dimensions of the dataset. E.g. `["time", "x", "y"]`. These will be checked to ensure consistency of the dataset being read. - variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]] - List of the variables to select from the dataset. E.g. `["temperature", "precipitation"]` - or a dictionary where the keys are the variable names and the values are dictionaries - defining the selection for each variable. E.g. `{"temperature": levels: {"values": [1000, 950, 900]}}` - would select the "temperature" variable and only the levels 1000, 950, and 900. dim_mapping: Dict[str, DimMapping] Mapping of the variables and dimensions in the input dataset to the dimensions of the output variable (`target_output_variable`). The key is the name of the output dimension to map to @@ -151,14 +219,23 @@ class InputDataset: (e.g. two datasets that coincide in space and time will only differ in the feature dimension, so the two will be combined by concatenating along the feature dimension). If a single shared coordinate cannot be found then an exception will be raised. + variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]] + List of the variables to select from the dataset. E.g. `["temperature", "precipitation"]` + or a dictionary where the keys are the variable names and the values are dictionaries + defining the selection for each variable. E.g. `{"temperature": levels: {"values": [1000, 950, 900]}}` + would select the "temperature" variable and only the levels 1000, 950, and 900. + derived_variables: Dict[str, DerivedVariable] + Dictionary of variables to derive from the dataset, where the keys are the names variables will be given and + the values are `DerivedVariable` definitions that specify how to derive a variable. """ path: str dims: List[str] - variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]] dim_mapping: Dict[str, DimMapping] target_output_variable: str - attributes: Dict[str, Any] = None + variables: Optional[Union[List[str], Dict[str, Dict[str, ValueSelection]]]] = None + derived_variables: Optional[Dict[str, DerivedVariable]] = None + attributes: Optional[Dict[str, Any]] = field(default_factory=dict) @dataclass @@ -258,7 +335,7 @@ class Output: variables: Dict[str, List[str]] coord_ranges: Dict[str, Range] = None - chunking: Dict[str, int] = None + chunking: Dict[str, int] = field(default_factory=dict) splitting: Splitting = None @@ -298,6 +375,9 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard): dataset_version: str extra: Dict[str, Any] = None + def __post_init__(self): + validate_config(self.inputs) + class _(JSONWizard.Meta): raise_on_unknown_json_key = True diff --git a/mllam_data_prep/create_dataset.py b/mllam_data_prep/create_dataset.py index ce53986..b5d06df 100644 --- a/mllam_data_prep/create_dataset.py +++ b/mllam_data_prep/create_dataset.py @@ -11,19 +11,24 @@ from . import __version__ from .config import Config, InvalidConfigException -from .ops.loading import load_and_subset_dataset +from .ops.chunking import chunk_dataset +from .ops.derive_variable import derive_variable +from .ops.loading import load_input_dataset from .ops.mapping import map_dims_and_variables from .ops.selection import select_by_kwargs from .ops.statistics import calc_stats +from .ops.subsetting import extract_variable if Version(zarr.__version__) >= Version("3"): from zarr.codecs import BloscCodec, BloscShuffle else: from numcodecs import Blosc -# the `extra` field in the config that was added between v0.2.0 and v0.5.0 is -# optional, so we can support both v0.2.0 and v0.5.0 -SUPPORTED_CONFIG_VERSIONS = ["v0.2.0", "v0.5.0"] +# The config versions defined in SUPPORTED_CONFIG_VERSIONS are the ones currently supported. +# The `extra` field in the config that was added between v0.2.0 and v0.5.0 is optional, and +# the `derived_variables` field in the config added in v0.6.0 is also optional, so we can +# support v0.2.0, v0.5.0, and v0.6.0 +SUPPORTED_CONFIG_VERSIONS = ["v0.2.0", "v0.5.0", "v0.6.0"] def _check_dataset_attributes(ds, expected_attributes, dataset_name): @@ -36,11 +41,14 @@ def _check_dataset_attributes(ds, expected_attributes, dataset_name): # check for attributes having the wrong value incorrect_attributes = { - k: v for k, v in expected_attributes.items() if ds.attrs[k] != v + key: val for key, val in expected_attributes.items() if ds.attrs[key] != val } if len(incorrect_attributes) > 0: s_list = "\n".join( - [f"{k}: {v} != {ds.attrs[k]}" for k, v in incorrect_attributes.items()] + [ + f"{key}: {val} != {ds.attrs[key]}" + for key, val in incorrect_attributes.items() + ] ) raise ValueError( f"Dataset {dataset_name} has the following incorrect attributes: {s_list}" @@ -126,23 +134,58 @@ def create_dataset(config: Config): output_config = config.output output_coord_ranges = output_config.coord_ranges + chunking_config = config.output.chunking dataarrays_by_target = defaultdict(list) for dataset_name, input_config in config.inputs.items(): path = input_config.path - variables = input_config.variables + selected_variables = input_config.variables + derived_variables = input_config.derived_variables target_output_var = input_config.target_output_variable - expected_input_attributes = input_config.attributes or {} + expected_input_attributes = input_config.attributes expected_input_var_dims = input_config.dims output_dims = output_config.variables[target_output_var] logger.info(f"Loading dataset {dataset_name} from {path}") try: - ds = load_and_subset_dataset(fp=path, variables=variables) + ds_input = load_input_dataset(fp=path) except Exception as ex: raise Exception(f"Error loading dataset {dataset_name} from {path}") from ex + + # Initialize the output dataset and add dimensions + ds = xr.Dataset() + ds.attrs.update(ds_input.attrs) + for dim in ds_input.dims: + ds = ds.assign_coords({dim: ds_input.coords[dim]}) + + if selected_variables: + logger.info(f"Extracting selected variables from dataset {dataset_name}") + if isinstance(selected_variables, dict): + for var_name, coords_to_sample in selected_variables.items(): + ds[var_name] = extract_variable( + ds=ds_input, + var_name=var_name, + coords_to_sample=coords_to_sample, + ) + elif isinstance(selected_variables, list): + for var_name in selected_variables: + ds[var_name] = extract_variable(ds=ds_input, var_name=var_name) + else: + raise ValueError( + "The `variables` argument should be a list or a dictionary" + ) + + if derived_variables: + logger.info(f"Deriving variables from {dataset_name}") + for var_name, derived_variable in derived_variables.items(): + ds[var_name] = derive_variable( + ds=ds_input, + derived_variable=derived_variable, + chunking=chunking_config, + ) + _check_dataset_attributes( ds=ds, expected_attributes=expected_input_attributes, @@ -197,10 +240,9 @@ def create_dataset(config: Config): # default to making a single chunk for each dimension if chunksize is not specified # in the config - chunking_config = config.output.chunking or {} logger.info(f"Chunking dataset with {chunking_config}") - chunks = {d: chunking_config.get(d, int(ds[d].count())) for d in ds.dims} - ds = ds.chunk(chunks) + chunks = {dim: chunking_config.get(dim, int(ds[dim].count())) for dim in ds.dims} + ds = chunk_dataset(ds, chunks) splitting = config.output.splitting diff --git a/mllam_data_prep/ops/__init__.py b/mllam_data_prep/ops/__init__.py index e69de29..877cdfb 100644 --- a/mllam_data_prep/ops/__init__.py +++ b/mllam_data_prep/ops/__init__.py @@ -0,0 +1 @@ +from . import derive_variable diff --git a/mllam_data_prep/ops/chunking.py b/mllam_data_prep/ops/chunking.py new file mode 100644 index 0000000..dfac4b1 --- /dev/null +++ b/mllam_data_prep/ops/chunking.py @@ -0,0 +1,72 @@ +import numpy as np +from loguru import logger + +# Max chunk size warning +CHUNK_MAX_SIZE_WARNING = 1 * 1024**3 # 1GB + + +def check_chunk_size(ds, chunks): + """ + Check the chunk size and warn if it exceed CHUNK_MAX_SIZE_WARNING. + + Parameters + ---------- + ds: xr.Dataset + Dataset to be chunked + chunks: Dict[str, int] + Dictionary with keys as dimensions to be chunked and + chunk sizes as the values + + Returns + ------- + ds: xr.Dataset + Dataset with chunking applied + """ + + for var_name, var_data in ds.data_vars.items(): + total_size = 1 + + for dim, chunk_size in chunks.items(): + dim_size = ds.sizes.get(dim, None) + if dim_size is None: + raise KeyError(f"Dimension '{dim}' not found in the dataset.") + total_size *= chunk_size + + dtype = var_data.dtype + bytes_per_element = np.dtype(dtype).itemsize + + memory_usage = total_size * bytes_per_element + + if memory_usage > CHUNK_MAX_SIZE_WARNING: + logger.warning( + f"The chunk size for '{var_name}' exceeds '{CHUNK_MAX_SIZE_WARNING}' GB." + ) + + +def chunk_dataset(ds, chunks): + """ + Check the chunk size and chunk dataset. + + Parameters + ---------- + ds: xr.Dataset + Dataset to be chunked + chunks: Dict[str, int] + Dictionary with keys as dimensions to be chunked and + chunk sizes as the values + + Returns + ------- + ds: xr.Dataset + Dataset with chunking applied + """ + # Check the chunk size + check_chunk_size(ds, chunks) + + # Try chunking + try: + ds = ds.chunk(chunks) + except Exception as ex: + raise Exception(f"Error chunking dataset: {ex}") + + return ds diff --git a/mllam_data_prep/ops/derive_variable/__init__.py b/mllam_data_prep/ops/derive_variable/__init__.py new file mode 100644 index 0000000..cc455e7 --- /dev/null +++ b/mllam_data_prep/ops/derive_variable/__init__.py @@ -0,0 +1,3 @@ +from .main import derive_variable +from .physical_field import calculate_toa_radiation +from .time_components import calculate_day_of_year, calculate_hour_of_day diff --git a/mllam_data_prep/ops/derive_variable/main.py b/mllam_data_prep/ops/derive_variable/main.py new file mode 100644 index 0000000..da6ee58 --- /dev/null +++ b/mllam_data_prep/ops/derive_variable/main.py @@ -0,0 +1,276 @@ +""" +Handle deriving new variables (xr.DataArrays) from an individual input dataset +that has been loaded. This makes it possible to for example add fields that can +be derived from analytical expressions and are functions of coordinate values +(e.g. top-of-atmosphere incoming radiation is a function of time and lat/lon location), +but also of other physical fields (wind-speed is a function of both meridional +and zonal wind components). +""" + +import importlib +import sys + +import xarray as xr +from loguru import logger + +from ..chunking import chunk_dataset + +REQUIRED_FIELD_ATTRIBUTES = ["units", "long_name"] + + +def derive_variable(ds, derived_variable, chunking): + """ + Load the dataset, and derive the specified variables + + Parameters + --------- + ds : xr.Dataset + Input dataset + derived_variable : Dict[str, DerivedVariable] + Dictionary with the variables to derive with keys as the variable + names and values with entries for kwargs and function to use in + the calculation + chunking: Dict[str, int] + Dictionary with keys as the dimensions to chunk along and values + with the chunk size + + Returns + ------- + xr.Dataset + Dataset with derived variables included + """ + + target_dims = list(ds.sizes.keys()) + + ds_kwargs = derived_variable.kwargs + extra_kwargs = derived_variable.extra_kwargs + function_namespace = derived_variable.function + expected_field_attributes = derived_variable.attrs + + # Separate the lat,lon from the required variables as these will be derived separately + logger.warning( + "Assuming that the lat/lon coordinates are given as variables called" + " 'lat' and 'lon'." + ) + latlon_coords_to_include = {} + for key in list(ds_kwargs.keys()): + if key in ["lat", "lon"]: + latlon_coords_to_include[key] = ds_kwargs.pop(key) + + # Get subset of input dataset for calculating derived variables + ds_subset = ds[ds_kwargs.keys()] + + # Chunking is needed for coordinates used to derive a variable since they are + # not lazily loaded, as otherwise one might run into memory issues if using a + # large dataset as input. + # Any coordinates needed for the derivation, for which chunking should be performed, + # should be converted to variables since it is not possible for *indexed* coordinates + # to be chunked dask arrays + chunks = { + dim: chunking.get(dim, int(ds_subset[dim].count())) for dim in ds_subset.dims + } + required_coordinates = [ + ds_var for ds_var in ds_kwargs.keys() if ds_var in ds_subset.coords + ] + ds_subset = ds_subset.drop_indexes(required_coordinates, errors="ignore") + for req_coord in required_coordinates: + if req_coord in chunks: + ds_subset = ds_subset.reset_coords(req_coord) + + # Chunk the dataset + ds_subset = chunk_dataset(ds_subset, chunks) + + # Add function arguments to kwargs + kwargs = {} + # - Add lat, and lon, if used as arguments + if len(latlon_coords_to_include): + latlon = get_latlon_coords_for_input(ds) + for key, val in latlon_coords_to_include.items(): + kwargs[val] = latlon[key] + # Add variables extracted from the input dataset + kwargs.update({val: ds_subset[key] for key, val in ds_kwargs.items()}) + # Add extra arguments + kwargs.update(extra_kwargs) + + # Get the function + func = _get_derived_variable_function(function_namespace) + + # Calculate the derived variable + derived_field = func(**kwargs) + + if isinstance(derived_field, xr.DataArray): + # Check that the derived field has the necessary attributes + # (REQUIRED_FIELD_ATTRIBUTES) set, and set them if not + derived_field_attrs = _check_and_get_required_attributes( + derived_field, expected_field_attributes + ) + derived_field.attrs.update(derived_field_attrs) + + # Return any dropped/reset coordinates + derived_field = _return_dropped_coordinates( + derived_field, ds_subset, required_coordinates, chunks + ) + + # Align the derived field to the output dataset dimensions (if necessary) + derived_field = _align_derived_variable(derived_field, ds, target_dims) + else: + raise TypeError( + f"Expected an instance of xr.DataArray, but got {type(derived_field)}." + ) + + return derived_field + + +def _get_derived_variable_function(function_namespace): + """ + Function for getting the function for deriving + the specified variable. + + Parameters + ---------- + function_namespace: str + The full function namespace + + Returns + ------- + function: object + Function for deriving the specified variable + """ + # Get module and function names + module_name, _, function_name = function_namespace.rpartition(".") + + # Import the module (if necessary) + if module_name in sys.modules: + module = sys.modules[module_name] + else: + module = importlib.import_module(module_name) + + # Get the function from the module + function = getattr(module, function_name) + + return function + + +def _check_and_get_required_attributes(field, expected_attributes): + """ + Check if the required attributes of the derived variable are set. + If not set, get them from the config. + If set and defined in the config, get the attributes from the config + and use them for overwriting the attributes defined in the function. + + Parameters + ---------- + field: xr.DataArray + The derived field + expected_attributes: Dict[str, str] + Dictionary with expected attributes for the derived variables. + Defined in the config file. + + Returns + ------- + field: xr.DataArray + The derived field + """ + + attrs = {} + for attribute in REQUIRED_FIELD_ATTRIBUTES: + if attribute not in field.attrs or field.attrs[attribute] is None: + if attribute in expected_attributes.keys(): + attrs[attribute] = expected_attributes[attribute] + else: + # The expected attributes are empty and the attributes have not been + # set during the calculation of the derived variable + raise KeyError( + f'The attribute "{attribute}" has not been set for the derived' + f' variable "{field.name}". This is most likely because you are' + " using a function external to `mlllam-data-prep` to derive the field," + f" in which the required attributes ({', '.join(REQUIRED_FIELD_ATTRIBUTES)})" + " are not set. If they are not set in the function call when deriving the field," + ' they can be set in the config file by adding an "attrs" section under the' + f' "{field.name}" derived variable section. For example, if the required attributes' + f" ({', '.join(REQUIRED_FIELD_ATTRIBUTES)}) are not set for a derived variable named" + f' "toa_radiation" they can be set by adding the following to the config file:' + ' {"attrs": {"units": "W*m**-2", "long_name": "top-of-atmosphere incoming radiation"}}.' + ) + elif attribute in expected_attributes.keys(): + logger.warning( + f"The attribute '{attribute}' of the derived field" + f" {field.name} is being overwritten from" + f" '{field.attrs[attribute]}' to" + f" '{expected_attributes[attribute]}' according" + " to the specification in the config file." + ) + attrs[attribute] = expected_attributes[attribute] + else: + # Attributes are set in the function and nothing has been defined in the config file + attrs[attribute] = field.attrs[attribute] + + return attrs + + +def _return_dropped_coordinates(field, ds, required_coordinates, chunks): + """ + Return the coordinates that have been dropped/reset. + + Parameters + ---------- + field: xr.DataArray + Derived variable + ds: xr.Dataset + Dataset with required coordinatwes + required_coordinates: List[str] + List of coordinates required for the derived variable + chunks: Dict[str, int] + Dictionary with keys as dimensions to be chunked and + chunk sizes as the values + + Returns + ------- + field: xr.DataArray + Derived variable, now also with dropped coordinates returned + """ + for req_coord in required_coordinates: + if req_coord in chunks: + field.coords[req_coord] = ds[req_coord] + + return field + + +def _align_derived_variable(field, ds, target_dims): + """ + Align a derived variable to the target dimensions (ignoring non-dimension coordinates). + + Parameters + ---------- + field: xr.DataArray + Derived field to align + ds: xr.Dataset + Target dataset + target_dims: List[str] + Dimensions to align to (e.g. 'time', 'y', 'x') + + Returns + ------- + field: xr.DataArray + The derived field aligned to the target dimensions + """ + # Ensure that dimensions are ordered correctly + field = field.transpose( + *[dim for dim in target_dims if dim in field.dims], missing_dims="ignore" + ) + + # Add missing dimensions explicitly + for dim in target_dims: + if dim not in field.dims: + field = field.expand_dims({dim: ds.sizes[dim]}) + + # Broadcast to match only the target dimensions + broadcast_shape = {dim: ds[dim] for dim in target_dims if dim in ds.dims} + field = field.broadcast_like(xr.Dataset(coords=broadcast_shape)) + + return field + + +def get_latlon_coords_for_input(ds): + """Dummy function for getting lat and lon.""" + return ds[["lat", "lon"]].chunk(-1, -1) diff --git a/mllam_data_prep/ops/derive_variable/physical_field.py b/mllam_data_prep/ops/derive_variable/physical_field.py new file mode 100644 index 0000000..d7b9617 --- /dev/null +++ b/mllam_data_prep/ops/derive_variable/physical_field.py @@ -0,0 +1,74 @@ +""" +Contains functions used to derive physical fields. This can be both +fields that can be derived from analytical expressions and are functions +of coordinate values (e.g. top-of-atmosphere incoming radiation is a function +of time and lat/lon location), but also of other physical fields, such as +wind speed, which is a function of both meridional and zonal wind components. +""" +import datetime + +import numpy as np +import xarray as xr +from loguru import logger + + +def calculate_toa_radiation(lat, lon, time): + """ + Function for calculating top-of-atmosphere incoming radiation + + Parameters + ---------- + lat : Union[xr.DataArray, float] + Latitude values. Should be in the range [-90, 90] + lon : Union[xr.DataArray, float] + Longitude values. Should be in the range [-180, 180] or [0, 360] + time : Union[xr.DataArray, datetime.datetime] + Time + + Returns + ------- + toa_radiation : Union[xr.DataArray, float] + Top-of-atmosphere incoming radiation + """ + logger.info("Calculating top-of-atmosphere incoming radiation") + + # Solar constant + solar_constant = 1366 # W*m**-2 + + # Different handling if xr.DataArray or datetime object + if isinstance(time, xr.DataArray): + day = time.dt.dayofyear + hour_utc = time.dt.hour + elif isinstance(time, datetime.datetime): + day = time.timetuple().tm_yday + hour_utc = time.hour + else: + raise TypeError( + "Expected an instance of xr.DataArray or datetime object," + f" but got {type(time)}." + ) + + # Eq. 1.6.1a in Solar Engineering of Thermal Processes 4th ed. + # dec: declination - angular position of the sun at solar noon w.r.t. + # the plane of the equator + dec = np.pi / 180 * 23.45 * np.sin(2 * np.pi * (284 + day) / 365) + + utc_solar_time = hour_utc + lon / 15 + hour_angle = 15 * (utc_solar_time - 12) + + # Eq. 1.6.2 with beta=0 in Solar Engineering of Thermal Processes 4th ed. + # cos_sza: Cosine of solar zenith angle + cos_sza = np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos( + lat * np.pi / 180 + ) * np.cos(dec) * np.cos(hour_angle * np.pi / 180) + + # Where TOA radiation is negative, set to 0 + toa_radiation = xr.where(solar_constant * cos_sza < 0, 0, solar_constant * cos_sza) + + if isinstance(toa_radiation, xr.DataArray): + # Add attributes + toa_radiation.name = "toa_radiation" + toa_radiation.attrs["long_name"] = "top-of-atmosphere incoming radiation" + toa_radiation.attrs["units"] = "W*m**-2" + + return toa_radiation diff --git a/mllam_data_prep/ops/derive_variable/time_components.py b/mllam_data_prep/ops/derive_variable/time_components.py new file mode 100644 index 0000000..5329e12 --- /dev/null +++ b/mllam_data_prep/ops/derive_variable/time_components.py @@ -0,0 +1,113 @@ +""" +Contains functions used to derive time component fields, such as e.g. day of year +and hour of day. +""" +import datetime + +import numpy as np +import xarray as xr +from loguru import logger + + +def calculate_hour_of_day(time, component): + """ + Function for calculating hour of day features with a cyclic encoding + + Parameters + ---------- + time: Union[xr.DataArray, datetime.datetime] + Time + component: str + String indicating if the sine or cosine component of the encoding + should be returned + + Returns + ------- + hour_of_day_encoded: Union[xr.DataArray, float] + sine or cosine of the hour of day + """ + logger.info("Calculating hour of day") + + # Get the hour of the day + if isinstance(time, xr.DataArray): + hour_of_day = time.dt.hour + elif isinstance(time, datetime.datetime): + hour_of_day = time.hour + else: + raise TypeError( + "Expected an instance of xr.DataArray or datetime object," + f" but got {type(time)}." + ) + + # Cyclic encoding of hour of day + if component == "sin": + hour_of_day_encoded = np.sin((hour_of_day / 24) * 2 * np.pi) + elif component == "cos": + hour_of_day_encoded = np.cos((hour_of_day / 24) * 2 * np.pi) + else: + raise ValueError( + f"Invalid value of `component`: '{component}'. Expected one of: 'cos' or 'sin'." + " Please update the config accordingly." + ) + + if isinstance(hour_of_day_encoded, xr.DataArray): + # Add attributes + hour_of_day_encoded.name = "hour_of_day_" + component + hour_of_day_encoded.attrs[ + "long_name" + ] = f"{component.capitalize()} component of cyclically encoded hour of day" + hour_of_day_encoded.attrs["units"] = "1" + + return hour_of_day_encoded + + +def calculate_day_of_year(time, component): + """ + Function for calculating day of year features with a cyclic encoding + + Parameters + ---------- + time : Union[xr.DataArray, datetime.datetime] + Time + component: str + String indicating if the sine or cosine component of the encoding + should be returned + + Returns + ------- + day_of_year_encoded: Union[xr.DataArray, float] + sine or cosine of the day of year + """ + logger.info("Calculating day of year") + + # Get the day of year + if isinstance(time, xr.DataArray): + day_of_year = time.dt.dayofyear + elif isinstance(time, datetime.datetime): + day_of_year = time.timetuple().tm_yday + else: + raise TypeError( + "Expected an instance of xr.DataArray or datetime object," + f" but got {type(time)}." + ) + + # Cyclic encoding of day of year - use 366 to include leap years! + if component == "sin": + day_of_year_encoded = np.sin((day_of_year / 366) * 2 * np.pi) + elif component == "cos": + day_of_year_encoded = np.cos((day_of_year / 366) * 2 * np.pi) + else: + raise ValueError( + f"Invalid value of `component`: '{component}'. Expected one of: 'cos' or 'sin'." + " Please update the config accordingly." + ) + + if isinstance(day_of_year_encoded, xr.DataArray): + # Add attributes + day_of_year_encoded.name = "day_of_year_" + component + day_of_year_encoded.attrs[ + "long_name" + ] = f"{component.capitalize()} component of cyclically encoded day of year" + day_of_year_encoded.attrs["units"] = "1" + + return day_of_year_encoded diff --git a/mllam_data_prep/ops/loading.py b/mllam_data_prep/ops/loading.py index 955fafd..f6bfc34 100644 --- a/mllam_data_prep/ops/loading.py +++ b/mllam_data_prep/ops/loading.py @@ -1,20 +1,20 @@ import xarray as xr -def load_and_subset_dataset(fp, variables): +def load_input_dataset(fp): """ - Load the dataset, subset the variables along the specified coordinates and - check coordinate units + Load the dataset Parameters ---------- fp : str Filepath to the source dataset, for example the path to a zarr dataset or a netCDF file (anything that is supported by `xarray.open_dataset` will work) - variables : dict - Dictionary with the variables to subset - with keys as the variable names and values with entries for each - coordinate and coordinate values to extract + + Returns + ------- + ds: xr.Dataset + Source dataset """ try: @@ -22,36 +22,4 @@ def load_and_subset_dataset(fp, variables): except ValueError: ds = xr.open_dataset(fp) - ds_subset = xr.Dataset() - ds_subset.attrs.update(ds.attrs) - if isinstance(variables, dict): - for var, coords_to_sample in variables.items(): - da = ds[var] - for coord, sampling in coords_to_sample.items(): - coord_values = sampling.values - try: - da = da.sel(**{coord: coord_values}) - except KeyError as ex: - raise KeyError( - f"Could not find the all coordinate values `{coord_values}` in " - f"coordinate `{coord}` in the dataset" - ) from ex - expected_units = sampling.units - coord_units = da[coord].attrs.get("units", None) - if coord_units is not None and coord_units != expected_units: - raise ValueError( - f"Expected units {expected_units} for coordinate {coord}" - f" in variable {var} but got {coord_units}" - ) - ds_subset[var] = da - elif isinstance(variables, list): - try: - ds_subset = ds[variables] - except KeyError as ex: - raise KeyError( - f"Could not find the all variables `{variables}` in the dataset. " - f"The available variables are {list(ds.data_vars)}" - ) from ex - else: - raise ValueError("The `variables` argument should be a list or a dictionary") - return ds_subset + return ds diff --git a/mllam_data_prep/ops/subsetting.py b/mllam_data_prep/ops/subsetting.py new file mode 100644 index 0000000..abdd59a --- /dev/null +++ b/mllam_data_prep/ops/subsetting.py @@ -0,0 +1,50 @@ +def extract_variable(ds, var_name, coords_to_sample=dict()): + """ + Extract specified variable from the provided the input dataset. If + coordinates for subsetting are defines, then subset the variable along + them and check coordinate units + + Parameters + ---------- + ds : xr.Dataset + Input dataset + var_name : Union[Dict, List] + Either a list or dictionary with variables to extract. + If a dictionary the keys are the variable name and the values are + entries for each coordinate and coordinate values to extract + coords_to_sample: Dict + Optional argument for subsetting/sampling along the specified + coordinates + + Returns + ---------- + da: xr.DataArray + Extracted variable (subsetted along the specified coordinates) + """ + + try: + da = ds[var_name] + except KeyError as ex: + raise KeyError( + f"Could not find the variable `{var_name}` in the dataset. " + f"The available variables are {list(ds.data_vars)}" + ) from ex + + for coord, sampling in coords_to_sample.items(): + coord_values = sampling.values + try: + da = da.sel(**{coord: coord_values}) + except KeyError as ex: + raise KeyError( + f"Could not find the all coordinate values `{coord_values}` in " + f"coordinate `{coord}` in the dataset" + ) from ex + expected_units = sampling.units + coord_units = da[coord].attrs.get("units", None) + if coord_units is not None and coord_units != expected_units: + raise ValueError( + f"Expected units {expected_units} for coordinate {coord}" + f" in variable {var_name} but got {coord_units}" + ) + + return da diff --git a/pyproject.toml b/pyproject.toml index 3edf350..0059e32 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -9,6 +9,7 @@ authors = [ {name = "Eleni Briola", email = "elb@dmi.dk"}, {name = "Joel Oskarsson", email = "joel.oskarsson@liu.se"}, {name = "Kashif Rasul", email = "kashif.rasul@gmail.com"}, + {name = "Martin Frølund", email = "maf@dmi.dk"}, ] dependencies = [ "xarray>=2024.2.0", diff --git a/tests/derive_variable/conftest.py b/tests/derive_variable/conftest.py new file mode 100644 index 0000000..913d0cc --- /dev/null +++ b/tests/derive_variable/conftest.py @@ -0,0 +1,32 @@ +"""Fixtures for the derive_variable module tests.""" + +import datetime +from typing import List + +import isodate +import numpy as np +import pandas as pd +import pytest +import xarray as xr + + +@pytest.fixture(name="time") +def fixture_time(request) -> List[np.datetime64 | datetime.datetime | xr.DataArray]: + """Fixture that returns test time data + + The fixture has to be indirectly parametrized with the number of time steps. + """ + ntime = request.param + return [ + np.datetime64("2004-06-11T00:00:00"), # invalid type + isodate.parse_datetime("1999-03-21T00:00"), + xr.DataArray( + pd.date_range( + start=isodate.parse_datetime("1999-03-21T00:00"), + periods=ntime, + freq=isodate.parse_duration("PT1H"), + ), + dims=["time"], + name="time", + ), + ] diff --git a/tests/derive_variable/test_main.py b/tests/derive_variable/test_main.py new file mode 100644 index 0000000..804213a --- /dev/null +++ b/tests/derive_variable/test_main.py @@ -0,0 +1,115 @@ +"""Unit tests for the main module of the derive_variable operations.""" + +import sys +from types import ModuleType +from typing import Generator +from unittest.mock import MagicMock, patch + +import pytest +import xarray as xr + +from mllam_data_prep.ops.derive_variable.main import ( + _check_and_get_required_attributes, + _get_derived_variable_function, +) + + +@pytest.fixture(name="mock_import_module") +def fixture_mock_import_module() -> Generator[MagicMock, None, None]: + """Fixture to mock importlib.import_module.""" + with patch("importlib.import_module") as mock: + yield mock + + +@pytest.fixture() +def fixture_mock_sys_modules() -> Generator[None, None, None]: + """Fixture to mock sys.modules.""" + with patch.dict("sys.modules", {}): + yield + + +class TestGetDerivedVariableFunction: + """Tests for the _get_derived_variable_function.""" + + @pytest.mark.usefixtures("fixture_mock_sys_modules") + def test_function_in_sys_modules(self, mock_import_module: MagicMock) -> None: + """Test when the function to import is already in sys.modules.""" + # Mock the module and function + mock_module: ModuleType = MagicMock() + mock_function: MagicMock = MagicMock() + sys.modules["mock_module"] = mock_module + mock_module.mock_function = mock_function + + # Call the function + result = _get_derived_variable_function("mock_module.mock_function") + + # Assert the function is returned correctly + assert result == mock_function + + # Assert the module was not imported + mock_import_module.assert_not_called() + + def test_function_not_in_sys_modules(self, mock_import_module: MagicMock) -> None: + """Test when the function to import is not in sys.modules.""" + # Mock the module and function + mock_module: ModuleType = MagicMock() + mock_function: MagicMock = MagicMock() + mock_import_module.return_value = mock_module + mock_module.mock_function = mock_function + + # Call the function + result = _get_derived_variable_function("mock_module.mock_function") + + # Assert the function is returned correctly + assert result == mock_function + + +@patch( + "mllam_data_prep.ops.derive_variable.main.REQUIRED_FIELD_ATTRIBUTES", + ["units", "long_name"], +) +class TestCheckAndGetRequiredAttributes: + """Tests for the _check_and_get_required_attributes function.""" + + @pytest.mark.parametrize( + ["field_attrs", "expected_attributes", "expected_result"], + [ + [ + {"units": "m", "long_name": "test"}, + {"units": "m", "long_name": "test"}, + {"units": "m", "long_name": "test"}, + ], + [ + {"units": "m", "long_name": "test"}, + {}, + {"units": "m", "long_name": "test"}, + ], + [ + {"units": "m"}, + {"units": "m", "long_name": "test"}, + {"units": "m", "long_name": "test"}, + ], + [ + {"units": "m", "long_name": "old_name"}, + {"units": "m", "long_name": "new_name"}, + {"units": "m", "long_name": "new_name"}, + ], + ], + ) + def test_valid_input( + self, field_attrs, expected_attributes, expected_result + ) -> None: + """Test that the function returns the correct attributes with valid input.""" + field = xr.DataArray([1, 2, 3], attrs=field_attrs) + + result = _check_and_get_required_attributes(field, expected_attributes) + + assert result == expected_result + + def test_missing_attributes_raises_key_error(self) -> None: + """Test when required attributes are missing and not in expected attributes.""" + field = xr.DataArray([1, 2, 3], attrs={"units": "m"}) + expected_attributes = {"units": "m"} + + with pytest.raises(KeyError): + _check_and_get_required_attributes(field, expected_attributes) diff --git a/tests/derive_variable/test_physical_field.py b/tests/derive_variable/test_physical_field.py new file mode 100644 index 0000000..a3ee7b5 --- /dev/null +++ b/tests/derive_variable/test_physical_field.py @@ -0,0 +1,77 @@ +"""Unit tests for the `mllam_data_prep.ops.derive_variable.physical_field` module.""" + +import datetime +from typing import List + +import numpy as np +import pytest +import xarray as xr + +from mllam_data_prep.ops.derive_variable.physical_field import calculate_toa_radiation + + +@pytest.fixture(name="lat") +def fixture_lat(request) -> List[float | xr.DataArray]: + """Fixture that returns test latitude data + + The fixture has to be indirectly parametrized with the number of coordinates, + the minimum and maximum latitude values. + """ + ncoord, lat_min, lat_max = request.param + return [ + 55.711, + xr.DataArray( + np.random.uniform(lat_min, lat_max, size=(ncoord, ncoord)), + dims=["x", "y"], + coords={"x": np.arange(ncoord), "y": np.arange(ncoord)}, + name="lat", + ), + ] + + +@pytest.fixture(name="lon") +def fixture_lon(request) -> List[float | xr.DataArray]: + """Fixture that returns test longitude data + + The fixture has to be indirectly parametrized with the number of coordinates, + the minimum and maximum longitude values. + """ + ncoord, lon_min, lon_max = request.param + return [ + 12.564, + xr.DataArray( + np.random.uniform(lon_min, lon_max, size=(ncoord, ncoord)), + dims=["x", "y"], + coords={"x": np.arange(ncoord), "y": np.arange(ncoord)}, + name="lon", + ), + ] + + +@pytest.mark.parametrize( + "lat", + # Format: (ncoord, lat_min, lat_max) + [(10, -90, 90), (10, -40, 40), (10, 40, -40), (10, -10, 10), (1000, -40, 40)], + indirect=True, +) +@pytest.mark.parametrize( + "lon", + # Format: (ncoord, lon_min, lon_max) + [(10, 0, 360), (10, -180, 180), (10, -90, 90), (10, 100, 110), (1000, -180, 180)], + indirect=True, +) +@pytest.mark.parametrize("time", [1, 10, 100], indirect=True) +def test_toa_radiation( + lat: float | xr.DataArray, + lon: float | xr.DataArray, + time: np.datetime64 | datetime.datetime | xr.DataArray, +): + """Test the `calculate_toa_radiation` function. + + Function from mllam_data_prep.ops.derive_variable.physical_field. + """ + if isinstance(time, (xr.DataArray, datetime.datetime)): + calculate_toa_radiation(lat, lon, time) + else: + with pytest.raises(TypeError): + calculate_toa_radiation(lat, lon, time) diff --git a/tests/derive_variable/test_time_components.py b/tests/derive_variable/test_time_components.py new file mode 100644 index 0000000..69c8d54 --- /dev/null +++ b/tests/derive_variable/test_time_components.py @@ -0,0 +1,57 @@ +"""Unit tests for the `mllam_data_prep.ops.derive_variable.time_components` module.""" + +import datetime + +import numpy as np +import pytest +import xarray as xr + +from mllam_data_prep.ops.derive_variable.time_components import ( + calculate_day_of_year, + calculate_hour_of_day, +) + + +@pytest.mark.parametrize("time", [1, 10, 1000], indirect=True) +@pytest.mark.parametrize( + "component", + [ + "cos", + "sin", + ], +) +def test_hour_of_day( + time: np.datetime64 | datetime.datetime | xr.DataArray, component: str +): + """Test the `calculate_hour_of_day` function. + + Function from mllam_data_prep.ops.derive_variable.time_components. + """ + if isinstance(time, (xr.DataArray, datetime.datetime)): + calculate_hour_of_day(time, component=component) + else: + with pytest.raises(TypeError): + calculate_hour_of_day(time, component=component) + + +@pytest.mark.parametrize("time", [1, 10, 1000], indirect=True) +@pytest.mark.parametrize( + "component", + [ + "cos", + "sin", + ], +) +def test_day_of_year( + time: np.datetime64 | datetime.datetime | xr.DataArray, component: str +): + """Test the `calculate_day_of_year` function. + + Function from mllam_data_prep.ops.derive_variable.time_components. + """ + + if isinstance(time, (xr.DataArray, datetime.datetime)): + calculate_day_of_year(time, component=component) + else: + with pytest.raises(TypeError): + calculate_day_of_year(time, component=component) diff --git a/tests/old_config_schema_examples/v0.5.0/example.danra.yaml b/tests/old_config_schema_examples/v0.5.0/example.danra.yaml new file mode 100644 index 0000000..3edf126 --- /dev/null +++ b/tests/old_config_schema_examples/v0.5.0/example.danra.yaml @@ -0,0 +1,99 @@ +schema_version: v0.5.0 +dataset_version: v0.1.0 + +output: + variables: + static: [grid_index, static_feature] + state: [time, grid_index, state_feature] + forcing: [time, grid_index, forcing_feature] + coord_ranges: + time: + start: 1990-09-03T00:00 + end: 1990-09-09T00:00 + step: PT3H + chunking: + time: 1 + splitting: + dim: time + splits: + train: + start: 1990-09-03T00:00 + end: 1990-09-06T00:00 + compute_statistics: + ops: [mean, std, diff_mean, diff_std] + dims: [grid_index, time] + val: + start: 1990-09-06T00:00 + end: 1990-09-07T00:00 + test: + start: 1990-09-07T00:00 + end: 1990-09-09T00:00 + +inputs: + danra_height_levels: + path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr + dims: [time, x, y, altitude] + variables: + u: + altitude: + values: [100,] + units: m + v: + altitude: + values: [100, ] + units: m + dim_mapping: + time: + method: rename + dim: time + state_feature: + method: stack_variables_by_var_name + dims: [altitude] + name_format: "{var_name}{altitude}m" + grid_index: + method: stack + dims: [x, y] + target_output_variable: state + + danra_surface: + path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr + dims: [time, x, y] + variables: + # use surface incoming shortwave radiation as forcing + - swavr0m + dim_mapping: + time: + method: rename + dim: time + grid_index: + method: stack + dims: [x, y] + forcing_feature: + method: stack_variables_by_var_name + name_format: "{var_name}" + target_output_variable: forcing + + danra_lsm: + path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/lsm.zarr + dims: [x, y] + variables: + - lsm + dim_mapping: + grid_index: + method: stack + dims: [x, y] + static_feature: + method: stack_variables_by_var_name + name_format: "{var_name}" + target_output_variable: static + +extra: + projection: + class_name: LambertConformal + kwargs: + central_longitude: 25.0 + central_latitude: 56.7 + standard_parallels: [56.7, 56.7] + globe: + semimajor_axis: 6367470.0 + semiminor_axis: 6367470.0