Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to dataclasses based config #8

Merged
merged 4 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- split dataset creation and storage to zarr into separate functions `mllam_data_prep.create_dataset(...)` and `mllam_data_prep.create_dataset_zarr(...)` respectively ![\#7](https://github.com/mllam/mllam-data-prep/pull/7)

- changes to spec from v0.1.0:
- `sampling_dim` removed from `architectures` section of spec, this is not needed to create the training data
- selection on variable coordinates values is now set with `inputs.{dataset_name}.variables.{variable_name}.values`
rather than `inputs.{dataset_name}.variables.{variable_name}.sel`
- when dimension-mapping method `stack_variables_by_var_name` is used the formatting string for the new variable
is now called `name_format` rather than `name`
- when dimension-mapping is done by simply renaming a dimension this configuration now needs to be set by providing
the named method (`rename`) explicitly through the `method` key, i.e. rather than `{to_dim}: {from_dim}` it is now
`{to_dim}: {method: rename, dim: {from_dim}}` to match the signature of the other dimension-mapping methods.
- coordinate value ranges for the dimensions that the architecture expects as input has been renamed from
`architecture.input_ranges` to `architecture.input_coord_ranges` to make the use more clear
- attribute `inputs.{dataset_name}.name` attribute has been removed, with the key `dataset_name` this is
superfluous

## [v0.1.0](https://github.com/mllam/mllam-data-prep/releases/tag/v0.1.0)

First tagged release of `mllam-data-prep` which includes functionality to
Expand Down
173 changes: 172 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ A training dataset is constructed by declaring in a yaml configuration file (for

The configuration is principally a means to represent how the dimensions of a given variable in a source dataset should be mapped to the dimensions and input variables of the model architecture to be trained.

The full configuration file specification is given in [mllam_data_prep/config/spec.py](mllam_data_prep/config/spec.py).
The configuration is given in yaml-format and the file specification is defined using python3 [dataclasses](https://docs.python.org/3/library/dataclasses.html) (serialised to yaml using [dataclasses-wizard](https://dataclass-wizard.readthedocs.io/en/latest/)) and defined in [mllam_data_prep/config.py](mllam_data_prep/config.py).


## Installation
Expand Down Expand Up @@ -55,3 +55,174 @@ python -m mllam_data_prep example.danra.yaml
Example output:

![](docs/example_output.png)

## Configuration file

A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness:

```yaml
schema_version: v0.2.0
dataset_version: v0.1.0

architecture:
input_variables:
static: [grid_index, static_feature]
state: [time, grid_index, state_feature]
forcing: [time, grid_index, forcing_feature]
input_coord_ranges:
time:
start: 1990-09-03T00:00
end: 1990-09-09T00:00
step: PT3H
chunking:
time: 1

inputs:
danra_height_levels:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
dims: [time, x, y, altitude]
variables:
u:
altitude:
values: [100,]
units: m
v:
altitude:
values: [100, ]
units: m
dim_mapping:
time:
method: rename
dim: time
state_feature:
method: stack_variables_by_var_name
dims: [altitude]
name_format: f"{var_name}{altitude}m"
grid_index:
method: stack
dims: [x, y]
target_architecture_variable: state

danra_surface:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
dims: [time, x, y]
variables:
# shouldn't really be using sea-surface pressure as "forcing", but don't
# have radiation varibles in danra yet
- pres_seasurface
dim_mapping:
time:
method: rename
dim: time
grid_index:
method: stack
dims: [x, y]
forcing_feature:
method: stack_variables_by_var_name
name_format: f"{var_name}"
target_architecture_variable: forcing

danra_lsm:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/lsm.zarr
dims: [x, y]
variables:
- lsm
dim_mapping:
grid_index:
method: stack
dims: [x, y]
static_feature:
method: stack_variables_by_var_name
name_format: f"{var_name}"
target_architecture_variable: static
```

Apart from identifies to keep track of the configuration file format version and the datasets version, the configuration file is divided into two main sections:

- `architecture`: defines the input variables and dimensions of the model architecture to be trained. These are the variables and dimensions that the inputs datasets will be mapped to.
- `inputs`: a list of source datasets to extract data from. These are the datasets that will be mapped to the architecture defined in the `architecture` section.

### The `architecture` section

```yaml
architecture:
input_variables:
static: [grid_index, static_feature]
state: [time, grid_index, state_feature]
forcing: [time, grid_index, forcing_feature]
input_coord_ranges:
time:
start: 1990-09-03T00:00
end: 1990-09-09T00:00
step: PT3H
chunking:
time: 1
```

The `architecture` section defines three things:

1. `input_variables`: what input variables the model architecture you are targeting expects, and what the dimensions are for each of these variables.
2. `input_coord_ranges`: the range of values for each of the dimensions that the model architecture expects as input. These are optional, but allows you to ensure that the training dataset is created with the correct range of values for each dimension.
3. `chunking`: the chunk sizes to use when writing the training dataset to zarr. This is optional, but can be used to optimise the performance of the zarr dataset. By default the chunk sizes are set to the size of the dimension, but this can be overridden by setting the chunk size in the configuration file. A common choice is to set the dimension along which you are batching to align with the of each training item (e.g. if you are training a model with time-step roll-out of 10 timesteps, you might choose a chunksize of 10 along the time dimension).

### The `inputs` section

```yaml
inputs:
danra_height_levels:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
dims: [time, x, y, altitude]
variables:
u:
altitude:
values: [100,]
units: m
v:
altitude:
values: [100, ]
units: m
dim_mapping:
time:
method: rename
dim: time
state_feature:
method: stack_variables_by_var_name
dims: [altitude]
name_format: f"{var_name}{altitude}m"
grid_index:
method: stack
dims: [x, y]
target_architecture_variable: state

danra_surface:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
dims: [time, x, y]
variables:
# shouldn't really be using sea-surface pressure as "forcing", but don't
# have radiation varibles in danra yet
- pres_seasurface
dim_mapping:
time:
method: rename
dim: time
grid_index:
method: stack
dims: [x, y]
forcing_feature:
method: stack_variables_by_var_name
name_format: f"{var_name}"
target_architecture_variable: forcing

...
```

The `inputs` section defines the source datasets to extract data from. Each source dataset is defined by a key (e.g. `danra_height_levels`) which names the source, and the attributes of the source dataset:

- `path`: the path to the source dataset. This can be a local path or a URL to e.g. a zarr dataset or netCDF file, anything that can be read by `xarray.open_dataset(...)`.
- `dims`: the dimensions that the source dataset is expected to have. This is used to check that the source dataset has the expected dimensions and also makes it clearer in the config file what the dimensions of the source dataset are.
- `variables`: selects which variables to extract from the source dataset. This may either be a list of variable names, or a dictionary where each key is the variable name and the value defines a dictionary of coordinates to do selection on. When doing selection you may also optionally define the units of the variable to check that the units of the variable match the units of the variable in the model architecture.
- `target_architecture_variable`: the variable in the model architecture that the source dataset should be mapped to.
- `dim_mapping`: defines how the dimensions of the source dataset should be mapped to the dimensions of the model architecture. This is done by defining a method to apply to each dimension. The methods are:
- `rename`: simply rename the dimension to the new name
- `stack`: stack the listed dimension to create the dimension in the output
- `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.
23 changes: 13 additions & 10 deletions example.danra.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
schema_version: v0.1.0
schema_version: v0.2.0
dataset_version: v0.1.0

architecture:
sampling_dim: time
input_variables:
static: [grid_index, static_feature]
state: [time, grid_index, state_feature]
forcing: [time, grid_index, forcing_feature]
input_range:
input_coord_ranges:
time:
start: 1990-09-03T00:00
end: 1990-09-09T00:00
Expand All @@ -22,20 +21,22 @@ inputs:
variables:
u:
altitude:
sel: [100, ]
values: [100,]
units: m
v:
altitude:
sel: [100, ]
values: [100, ]
units: m
dim_mapping:
time: time
time:
method: rename
dim: time
state_feature:
method: stack_variables_by_var_name
dims: [altitude]
name_format: f"{var_name}{altitude}m"
grid_index:
method: flatten
method: stack
dims: [x, y]
target_architecture_variable: state

Expand All @@ -47,9 +48,11 @@ inputs:
# have radiation varibles in danra yet
- pres_seasurface
dim_mapping:
time: time
time:
method: rename
dim: time
grid_index:
method: flatten
method: stack
dims: [x, y]
forcing_feature:
method: stack_variables_by_var_name
Expand All @@ -63,7 +66,7 @@ inputs:
- lsm
dim_mapping:
grid_index:
method: flatten
method: stack
dims: [x, y]
static_feature:
method: stack_variables_by_var_name
Expand Down
Loading
Loading