Feat/boundary dataloader #90

sadamov · 2024-11-21T12:56:27Z

This PR introduces support for handling boundary data through an additional optional datastore. This allows forcing the model's interior state with boundary conditions during training and evaluation. This PR introduces boundary data handling up until WeatherDataset.__getitem__ which now returns 5 elements init_states, target_states, forcing, **boundary**, target_times (#84 will later implement boundary handling on the model side).
Currently boundary datastore must be of type mdp containing forcing category data. The code wa developed in a way that this requirement can be easily relaxed in the future.

Motivation and Context:

In weather modeling incorporating realistic boundary conditions is crucial for accurately simulating atmospheric processes. By providing an optional boundary datastore, the model gains flexibility to accept external boundary data, enhancing prediction accuracy and allowing for more complex simulations. The current workflow using n outmost gridcells as boundary will be fully replaced.
The creation of a new ERA5 WeatherBench MDP example datastore demonstrates the implementation of these features and serves as a template for future models.

Key Changes:

Introduced num_past/future_forcing_steps and num_past/future_boundary_steps as separate arguments and CLI flags.
Created a new ERA5 WeatherBench MDP example datastore: Developed an example datastore using ERA5 WeatherBench data within the Model-Data-Platform (MDP) framework.
Ensured num_grid_points also works for the forcing datastore (i.e. not containing state vars)
Combined time slicing for state and forcing and make it timestep-based. The same function now handles state forcing and boundary data and matched the corresponding samples based on actual time (using minimal time differences between state and forcing/boundary).
Updated global configuration to support two datastores: Modified the global configuration to accommodate both state and boundary/forcing datastores simultaneously.
Added temporal embedding of time_step_diffs: Introduced temporal embeddings by adding time step differences as additional input features to improve the model's temporal awareness.
Removed interior/exterior masks from the codebase: Eliminated the use of data masks to simplify the codebase. This is now handled by having two separate datastores
Accept boundary data in common_step but do not return it: Modified the common_step method to accept boundary data as input without returning it, preparing for future boundary/state implementation on the model side.
Moved the handling of ensemble members to the initialization of WeatherDataset to prevent duplicate time steps in the dataset.

Bugfix:

Fixed a bug in analysis time retrieval in the MEPS store: Corrected an issue with retrieving analysis_time from the MEPS data store where duplicate time steps for one member were returned.

Notes:

I have introduced strict assertion that time , analysis_time and elapsed_forecast_time all have a consistent step size. This is not strictly necessary but probably what most user want and it helps with the temporal encoding.

Introduced Dependencies

"gcsfs>=2021.10.0" Where should we download era5 from, do we want this dependency to Google Cloud Storage?

TODOs:

Upload MEPS example with dates close to Danra and next to each other for test/val/train @leifdenby This allows for testing both npy and mdp datastores with the same boundary datastore and limits the required data.

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

combine two slicing fcts into one

…dary)

joeloskarsson · 2024-12-06T16:05:15Z

I think that if we merge main into here the tests will pass. They seem to have failed for the same reason as on main, which was fixed with #94.

leifdenby

First read through: this was joy to read 😊 I have a few comments, but in general I think this is very close to done.

neural_lam/config.py

neural_lam/datastore/mdp.py

neural_lam/weather_dataset.py

leifdenby · 2024-12-09T12:53:09Z

neural_lam/weather_dataset.py

+    def _process_windowed_data(self, da_windowed, da_state, da_target_times):
+        """Helper function to process windowed data. This function stacks the
+        'forcing_feature' and 'window' dimensions and adds the time step
+        differences to the existing features as a temporal embedding.


would it maybe make sense to split adding the temporal embedding in a separate function? To me this doesn't anything to do with the windowing, but maybe I am missing something. Something like _add_temporal_embedding(...)

I think this is highly related. The temporal embedding (which we should probably not refer to as such, as this is not an embedding, so from here on: the delta times) is the time differences for the different entries in the window. After these have been stacked it is not clear to me how you would know what the delta-times are, since you've got rid of the window dimension.

Yes these are indeed highly related, since the time_deltas directly correspond to the windowing. As the method is concise, I suggest to keep it as is. With the comments it should be clear what is going on. My naming however is bad. I propose to rename time_diff_steps into time_deltas everywhere in weather_dataset.py, see 7e1a246

neural_lam/weather_dataset.py

tests/test_datasets.py

tests/test_time_slicing.py

neural_lam/weather_dataset.py

Co-authored-by: Leif Denby <[email protected]>

…/neural-lam into feat/boundary_dataloader

sadamov · 2024-12-20T13:55:49Z

neural_lam/weather_dataset.py

+        check_time_overlap(
+            self.da_state,
+            self.da_boundary_forcing,
+            da1_is_forecast=self.datastore.is_forecast,
+            da2_is_forecast=self.datastore_boundary.is_forecast,
+            num_past_steps=self.num_past_boundary_steps,
+            num_future_steps=self.num_future_boundary_steps,
+        )


Joel Oskarsson
I've been trying out loading data in neural-lam that I have created using the DANRA and ERA5 configs from https://github.com/leifdenby/mllam-data-prep/tree/feat/crop-with-other-dataset. In those configs both datasets cover exactly the same time period. Running neural-lam with num_past_boundary_steps > 0 raises this error https://github.com/sadamov/neural-lam/blob/f0a7046b7ce0c5c955762b67ebe017fdef4eae4e/neural_lam/weather_dataset.py#L207 (even with the fix #90 (comment)). This makes sense, as there is no boundary data before the first time point for the interior. However, an alternative behavior would be to adjust the first time point for the interior (e.g. ignoring the first time steps of the interior data, until data loading is possible for the set num_past_boundary_steps ), to still allow for training in this case. What's your thoughts on this? Now one has to always leave in some extra time steps in the boundary forcing, depending on num_past_boundary_steps , which has to be decided beforehand when the dataset is created. But maybe that makes more sense than allowing for having a different number of samples depending on num_past_boundary_steps ?

Simon Adamov:
true this is indeed by design.
tell the user to include sufficient time-steps and have a very clear requirement and warning
OR: expect most users to chose the same datetimes for boundary and forcing and give a warning how many time steps were removed from interior
As long as only a few timesteps are removed from interior, I think option 2 is nicer to handle. But no user would want half their training data to be ignored "quietly".
So, I think both options are fine, as long as we have clear warnings and error messages. Is 2 your preference?

Joel Oskarsson
That's good ideas. I agree that we should really avoid cutting away much of the training data without any proper warning. But cutting away a few timesteps is not really a problem (but could still show a warning explaining how much is being removed). Yea, option 2 is quite natural I think, since it would be nice to prepare the data by just specifying the exact same time period for the interior and boundary when building corresponding zarrs 😁 I'm thinking that this might be easy to achieve by "just" slicing away a few time steps in the interior xr.Dataarray used in WeatherDataset, but not I have not looked at any details of what such an implementation would look like.

I suggest to implement an additional stateless function called crop_time_if_needed in 7e5797e. This function gets called automatically before any time-slicing operations in WeatherDataset. I left all the existing checks and warning in the code, can't hurt. Added additional print statement about how many time steps were removed from the state data. Let me know what you think!

neural_lam/weather_dataset.py

sadamov · 2024-12-20T20:50:11Z

@leifdenby @joeloskarsson I have given this a thorough rework and implemented all your feedback from here and from slack. there are two threads unresolved, mostly for you to have a chance to intervene if not in agreement with my solution. Please give this your final review so we can merge soon. It is my understanding that everything up to the return of WeatherDataset should now be ready to run "realistic" experiments with neural-lam.

PS: locally the tests pass for me

sadamov and others added 30 commits November 21, 2024 13:49

add datastore_boundary to neural_lam

5df1bff

complete integration of boundary in weatherDataset

46590ef

Add test to check timestep length and spacing

b990f49

setting default mdp boundary to 0 gridcells

3fd1d6b

implement time-based slicing

1f2499c

combine two slicing fcts into one

remove all interior_mask and boundary_mask

1af1481

added gcsfs dependency for era5 weatherbench download

d545cb7

added new era5 datastore config for boundary

5c1a7d7

removed left-over boundary-mask references

30e4f05

make check for existing category in datastore more flexible (for boun…

6a8c593

…dary)

implement xarray based (mostly) time slicing and windowing

17c920d

cleanup analysis based time-slicing

7919995

implement datastore_boundary in existing tests

9bafcee

allow for grid shape retrieval from forcing data

ce06bbc

rearrange time slicing, boundary first

884b5c6

identified issue, cleanup next

5904cbe

use xarray plot only

efe0302

don't reraise

a489c2e

remove debug plot

242d08b

remove extent calc used in diagnosing issue

c1f706c

add type annotation

cf8e3e4

ensure tensor copy to cpu mem before data-array creation

85160ce

apply time-indexing to support ar_steps_val > 1

52c4528

renaming test datastores

b96d8eb

adding num_past/future_boundary_step args

72da25f

using combined config file

244f1cc

proper handling of state/forcing/boundary in dataset

a9cc36e

datastore_boundars=None introduced

dcc0b46

bug fix for file retrieval per member

a3b3bde

rename datastore for tests

3ffc413

Merge remote-tracking branch 'mllam/main' into feat/boundary_dataloader

3df3fcb

sadamov marked this pull request as ready for review December 5, 2024 12:51

added static feature to era5 boundary test datastore

f8613da

Merge remote-tracking branch 'mllam/main' into feat/boundary_dataloader

f0a7046

leifdenby reviewed Dec 9, 2024

View reviewed changes

joeloskarsson reviewed Dec 17, 2024

View reviewed changes

neural_lam/weather_dataset.py Outdated Show resolved Hide resolved

sadamov and others added 14 commits December 20, 2024 11:03

rename function to represent multiple datastores

8cc608d

streamline da_grid_reference variable naming

857f748

updated docstring of WeatherDataset

d0a6f24

renamed da_boundary -> da_boundary_forcing

ef40a39

updated docstrings of get_dataarray()

71b52b2

check times in stateless functions from utils.py

b690563

add num_ensemble_members property to BaseDatastore

a37dc3c

Update neural_lam/weather_dataset.py

8d1bec6

Co-authored-by: Leif Denby <[email protected]>

renaming time_diff_steps to time_deltas

47370f9

Merge branch 'feat/boundary_dataloader' of https://github.com/sadamov…

7e1a246

…/neural-lam into feat/boundary_dataloader

add num_ensemble_members to mdp store

d524377

Rename temporal embeddings and diffs to time deltas

98c54d9

Adding some comments about analysis_time indexing

4a278fd

moved comments around

c82d22b

sadamov commented Dec 20, 2024

View reviewed changes

neural_lam/weather_dataset.py Show resolved Hide resolved

joeloskarsson and others added 6 commits December 20, 2024 15:14

Make hotfix to make boundary dataset created with mdp work

6e3f3bd

Bugfixes

20ca263

c0c50d5

Add missing check if boundary_forcing is None

94de240

bugfix typo in time check

1d14a15

introduce crop_time_if_needed to align interior with boundary data

7e5797e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/boundary dataloader #90

Feat/boundary dataloader #90

sadamov commented Nov 21, 2024 •

edited

Loading

joeloskarsson commented Dec 6, 2024

leifdenby left a comment

leifdenby Dec 9, 2024

joeloskarsson Dec 10, 2024

sadamov Dec 20, 2024

sadamov Dec 20, 2024 •

edited

Loading

sadamov Dec 20, 2024

sadamov commented Dec 20, 2024 •

edited

Loading

Feat/boundary dataloader #90

Are you sure you want to change the base?

Feat/boundary dataloader #90

Conversation

sadamov commented Nov 21, 2024 • edited Loading

Motivation and Context:

Introduced Dependencies

TODOs:

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

joeloskarsson commented Dec 6, 2024

leifdenby left a comment

Choose a reason for hiding this comment

leifdenby Dec 9, 2024

Choose a reason for hiding this comment

joeloskarsson Dec 10, 2024

Choose a reason for hiding this comment

sadamov Dec 20, 2024

Choose a reason for hiding this comment

sadamov Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

sadamov Dec 20, 2024

Choose a reason for hiding this comment

sadamov commented Dec 20, 2024 • edited Loading

sadamov commented Nov 21, 2024 •

edited

Loading

sadamov Dec 20, 2024 •

edited

Loading

sadamov commented Dec 20, 2024 •

edited

Loading