mllam · leifdenby · Jun 26, 2024 · Jun 24, 2024 · Jun 25, 2024 · Jun 25, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - split dataset creation and storage to zarr into separate functions `mllam_data_prep.create_dataset(...)` and `mllam_data_prep.create_dataset_zarr(...)` respectively ![\#7](https://github.com/mllam/mllam-data-prep/pull/7)
 
+- changes to spec from v0.1.0:
+    - `sampling_dim` removed from `architectures` section of spec, this is not needed to create the training data
+    - selection on variable coordinates values is now set with `inputs.{dataset_name}.variables.{variable_name}.values`
+      rather than `inputs.{dataset_name}.variables.{variable_name}.sel`
+    - when dimension-mapping method `stack_variables_by_var_name` is used the formatting string for the new variable
+      is now called `name_format` rather than `name`
+    - when dimension-mapping is done by simply renaming a dimension this configuration now needs to be set by providing
+      the named method (`rename`) explicitly through the `method` key, i.e. rather than `{to_dim}: {from_dim}` it is now
+      `{to_dim}: {method: rename, dim: {from_dim}}` to match the signature of the other dimension-mapping methods.
+    - coordinate value ranges for the dimensions that the architecture expects as input has been renamed from
+      `architecture.input_ranges` to `architecture.input_coord_ranges` to make the use more clear
+    - attribute `inputs.{dataset_name}.name` attribute has been removed, with the key `dataset_name` this is
+      superfluous
+
 ## [v0.1.0](https://github.com/mllam/mllam-data-prep/releases/tag/v0.1.0)
 
 First tagged release of `mllam-data-prep` which includes functionality to

diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ A training dataset is constructed by declaring in a yaml configuration file (for
 
 The configuration is principally a means to represent how the dimensions of a given variable in a source dataset should be mapped to the dimensions and input variables of the model architecture to be trained.
 
-The full configuration file specification is given in [mllam_data_prep/config/spec.py](mllam_data_prep/config/spec.py).
+The configuration is given in yaml-format and the file specification is defined using python3 [dataclasses](https://docs.python.org/3/library/dataclasses.html) (serialised to yaml using [dataclasses-wizard](https://dataclass-wizard.readthedocs.io/en/latest/)) and defined in [mllam_data_prep/config.py](mllam_data_prep/config.py).
 
 
 ## Installation
@@ -55,3 +55,174 @@ python -m mllam_data_prep example.danra.yaml
 Example output:
 
 ![](docs/example_output.png)
+
+## Configuration file
+
+A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness:
+
+```yaml
+schema_version: v0.2.0
+dataset_version: v0.1.0
+
+architecture:
+  input_variables:
+    static: [grid_index, static_feature]
+    state: [time, grid_index, state_feature]
+    forcing: [time, grid_index, forcing_feature]
+  input_coord_ranges:
+    time:
+      start: 1990-09-03T00:00
+      end: 1990-09-09T00:00
+      step: PT3H
+  chunking:
+    time: 1
+
+inputs:
+  danra_height_levels:
+    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
+    dims: [time, x, y, altitude]
+    variables:
+      u:
+        altitude:
+          values: [100,]
+          units: m
+      v:
+        altitude:
+          values: [100, ]
+          units: m
+    dim_mapping:
+      time:
+        method: rename
+        dim: time
+      state_feature:
+        method: stack_variables_by_var_name
+        dims: [altitude]
+        name_format: f"{var_name}{altitude}m"
+      grid_index:
+        method: stack
+        dims: [x, y]
+    target_architecture_variable: state
+
+  danra_surface:
+    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
+    dims: [time, x, y]
+    variables:
+      # shouldn't really be using sea-surface pressure as "forcing", but don't
+      # have radiation varibles in danra yet
+      - pres_seasurface
+    dim_mapping:
+      time:
+        method: rename
+        dim: time
+      grid_index:
+        method: stack
+        dims: [x, y]
+      forcing_feature:
+        method: stack_variables_by_var_name
+        name_format: f"{var_name}"
+    target_architecture_variable: forcing
+
+  danra_lsm:
+    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/lsm.zarr
+    dims: [x, y]
+    variables:
+      - lsm
+    dim_mapping:
+      grid_index:
+        method: stack
+        dims: [x, y]
+      static_feature:
+        method: stack_variables_by_var_name
+        name_format: f"{var_name}"
+    target_architecture_variable: static
+```
+
+Apart from identifies to keep track of the configuration file format version and the datasets version, the configuration file is divided into two main sections:
+
+- `architecture`: defines the input variables and dimensions of the model architecture to be trained. These are the variables and dimensions that the inputs datasets will be mapped to.
+- `inputs`: a list of source datasets to extract data from. These are the datasets that will be mapped to the architecture defined in the `architecture` section.
+
+### The `architecture` section
+
+```yaml
+architecture:
+  input_variables:
+    static: [grid_index, static_feature]
+    state: [time, grid_index, state_feature]
+    forcing: [time, grid_index, forcing_feature]
+  input_coord_ranges:
+    time:
+      start: 1990-09-03T00:00
+      end: 1990-09-09T00:00
+      step: PT3H
+  chunking:
+    time: 1
+```
+
+The `architecture` section defines three things:
+
+1. `input_variables`: what input variables the model architecture you are targeting expects, and what the dimensions are for each of these variables.
+2. `input_coord_ranges`: the range of values for each of the dimensions that the model architecture expects as input. These are optional, but allows you to ensure that the training dataset is created with the correct range of values for each dimension.
+3. `chunking`: the chunk sizes to use when writing the training dataset to zarr. This is optional, but can be used to optimise the performance of the zarr dataset. By default the chunk sizes are set to the size of the dimension, but this can be overridden by setting the chunk size in the configuration file. A common choice is to set the dimension along which you are batching to align with the of each training item (e.g. if you are training a model with time-step roll-out of 10 timesteps, you might choose a chunksize of 10 along the time dimension).
+
+### The `inputs` section
+
+```yaml
+inputs:
+  danra_height_levels:
+    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
+    dims: [time, x, y, altitude]
+    variables:
+      u:
+        altitude:
+          values: [100,]
+          units: m
+      v:
+        altitude:
+          values: [100, ]
+          units: m
+    dim_mapping:
+      time:
+        method: rename
+        dim: time
+      state_feature:
+        method: stack_variables_by_var_name
+        dims: [altitude]
+        name_format: f"{var_name}{altitude}m"
+      grid_index:
+        method: stack
+        dims: [x, y]
+    target_architecture_variable: state
+
+  danra_surface:
+    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
+    dims: [time, x, y]
+    variables:
+      # shouldn't really be using sea-surface pressure as "forcing", but don't
+      # have radiation varibles in danra yet
+      - pres_seasurface
+    dim_mapping:
+      time:
+        method: rename
+        dim: time
+      grid_index:
+        method: stack
+        dims: [x, y]
+      forcing_feature:
+        method: stack_variables_by_var_name
+        name_format: f"{var_name}"
+    target_architecture_variable: forcing
+
+  ...
+```
+
+The `inputs` section defines the source datasets to extract data from. Each source dataset is defined by a key (e.g. `danra_height_levels`) which names the source, and the attributes of the source dataset:
+
+- `path`: the path to the source dataset. This can be a local path or a URL to e.g. a zarr dataset or netCDF file, anything that can be read by `xarray.open_dataset(...)`.
+- `dims`: the dimensions that the source dataset is expected to have. This is used to check that the source dataset has the expected dimensions and also makes it clearer in the config file what the dimensions of the source dataset are.
+- `variables`: selects which variables to extract from the source dataset. This may either be a list of variable names, or a dictionary where each key is the variable name and the value defines a dictionary of coordinates to do selection on. When doing selection you may also optionally define the units of the variable to check that the units of the variable match the units of the variable in the model architecture.
+- `target_architecture_variable`: the variable in the model architecture that the source dataset should be mapped to.
+- `dim_mapping`: defines how the dimensions of the source dataset should be mapped to the dimensions of the model architecture. This is done by defining a method to apply to each dimension. The methods are:
+  - `rename`: simply rename the dimension to the new name
+  - `stack`: stack the listed dimension to create the dimension in the output
+  - `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.
diff --git a/example.danra.yaml b/example.danra.yaml
@@ -1,13 +1,12 @@
-schema_version: v0.1.0
+schema_version: v0.2.0
 dataset_version: v0.1.0
 
 architecture:
-  sampling_dim: time
   input_variables:
     static: [grid_index, static_feature]
     state: [time, grid_index, state_feature]
     forcing: [time, grid_index, forcing_feature]
-  input_range:
+  input_coord_ranges:
     time:
       start: 1990-09-03T00:00
       end: 1990-09-09T00:00
@@ -22,20 +21,22 @@ inputs:
     variables:
       u:
         altitude:
-          sel: [100, ]
+          values: [100,]
           units: m
       v:
         altitude:
-          sel: [100, ]
+          values: [100, ]
           units: m
     dim_mapping:
-      time: time
+      time:
+        method: rename
+        dim: time
       state_feature:
         method: stack_variables_by_var_name
         dims: [altitude]
         name_format: f"{var_name}{altitude}m"
       grid_index:
-        method: flatten
+        method: stack
         dims: [x, y]
     target_architecture_variable: state
 
@@ -47,9 +48,11 @@ inputs:
       # have radiation varibles in danra yet
       - pres_seasurface
     dim_mapping:
-      time: time
+      time:
+        method: rename
+        dim: time
       grid_index:
-        method: flatten
+        method: stack
         dims: [x, y]
       forcing_feature:
         method: stack_variables_by_var_name
@@ -63,7 +66,7 @@ inputs:
       - lsm
     dim_mapping:
       grid_index:
-        method: flatten
+        method: stack
         dims: [x, y]
       static_feature:
         method: stack_variables_by_var_name