Move OSGB transformation hack out of nwp-consumer and into ocf-datapipes #26

devsjc · 2023-08-24T15:12:52Z

In order to quickly deploy the consumer and have it work with the soon to be retired PVNet1, which expects the "x" and "y" coordinates to be (poorly labelled) OSGB coordinates, a "quick fix" was put in place in the consumer to approximate those values. (See inputs/metoffice/client.py) This is not an optimal end position however, as OCF are moving away from OSGB and towards lat/long anyway, not to mention the ambiguity that a bare "x" and "y" coordinate add to the dataset.

Instead, this logic to convert from lat long to OSGB should be implemented in ocf-datapipes, then the version of that library being used in PVNet1/2 can be bumped accordingly.

As such, this issue mostly requires code changes in ocf-datapipes and the PVNet1/2 repos. The net change in this repo should be exactly the inverse of that which was added with the implementation of the "quick fix" in #27, i.e. removing all that was added there and nothing else.

The text was updated successfully, but these errors were encountered:

peterdudfield · 2023-08-25T14:18:59Z

@dfulu is working on this I think in datapipes

peterdudfield · 2023-10-26T10:01:34Z

@devsjc you might find @dfulu has now sorted this, but not totlly sure

dfulu · 2023-10-26T10:55:11Z

It may already be fixed without me intending to do so. But what would the coordinates look like without #27?

peterdudfield · 2023-10-26T11:33:35Z

Does PVnet version right now uses lat/lon of osgb?

dfulu · 2023-10-26T11:56:03Z

The PVNet modeld oesn't use either. It just uses the NWP image. But the pvnet datapipe inside ocf_datapipes is currently using OSGB.

I think there are some issues to sort out before we could drop the OSGB coordinates. At least we need to change the data loading function: https://github.com/openclimatefix/ocf_datapipes/blob/main/ocf_datapipes/load/nwp/providers/ukv.py

But also I was looking at the current production NWP data.

<xarray.Dataset>
Dimensions:    (variable: 5, init_time: 1, step: 43, y: 639, x: 455)
Coordinates:
  * init_time  (init_time) datetime64[ns] 2023-10-26T09:00:00
    latitude   (y, x) float64 ...
    longitude  (y, x) float64 ...
  * step       (step) timedelta64[ns] 00:00:00 01:00:00 ... 1 days 18:00:00
  * variable   (variable) object 'dswrf' 'lcc' 'sde' 't' 'wdir10'
  * x          (x) int64 -202708 -200701 -198695 ... 700567 702566 704564
  * y          (y) int64 1262937 1260928 1258919 ... -13442 -15450 -17458
Data variables:
    UKV        (variable, init_time, step, y, x) float32 ...

So the latitude and longitude are each 2-dimensional?

When I select ds_nwp.latitude.isel(y=0) I get

<xarray.DataArray 'latitude' (x: 455)>
array([60.788934, 60.791687, 60.79443 , ..., 61.058338, 61.05675 , 61.055152])
Coordinates:
    latitude   (x) float64 ...
    longitude  (x) float64 ...
  * x          (x) int64 -202708 -200701 -198695 ... 700567 702566 704564
    y          int64 1262937
Attributes:
    long_name:      latitude
    standard_name:  latitude
    units:          degrees_north

Which means the latitude varies a little bit depending on x-OSGB.

We currently don't have support in ocf_datapipes for two-dimensional coordinates. We always assume the coordinates are independent and form a regular grid. So it would actually be a decent chunk of work to switch to using lat-lons.

dfulu · 2023-10-26T12:02:50Z

Although it looks like the OSGB coordinates which we calculate might also each be two-dimensional, but we just ignore that?

https://github.com/openclimatefix/nwp-consumer/blob/main/src/nwp_consumer/internal/inputs/metoffice/client.py#L281-L290

peterdudfield · 2023-10-26T13:12:29Z

Yea the lat lons are a 2-dimension grid. So the rough order of work would be

Upgrade ocf_datapipes so it could take 2-d lat/lons
change PVnet to use new ocf_datapipes and use lat/lon
Check national_xg doesnt use osgb
remove osgb from nwp-consumer?

peterdudfield · 2023-10-26T13:13:42Z

Or perhaps just 1. Move the osgb calculation into ocf_dataipes

dfulu · 2023-10-26T14:54:16Z

I made a plot of the OSGB coordinates we calculate in client.py#L281-L290 vs what they would be used 2D OSGB.

In the figure below, the red points are our current grid points, and the blue points are the points using 2D OSGB. There are some lines showing the offset between the two.

This means that in production we are accidentally offsetting our windows by 3-4 pixels. This is likely something we want to fix

NB: this figure needs a download and zoom in to see the gridpoints and offsets properly

peterdudfield · 2023-10-26T14:58:56Z

Nice, thanks for doing this, how do we go about fixing it?

dfulu · 2023-10-26T15:05:04Z

@jacobbieker had suggested we might regrid to the same grid used in our training data. That might be the easiest. It also should work if we can get datapipes working with 2D spatial grids and select directly from lat-lon - but our input images might be slightly distorted compared to the training data. So I don't know which would be best

dfulu · 2023-10-26T15:05:34Z

Here's a closer up view of the offset

peterdudfield · 2023-10-26T15:13:50Z

Can you do this re-gridding in ocf_datapipes?

dfulu · 2023-10-26T16:06:30Z

We'd have to write it in, won't have any regridding functionality yet. However, could it not make sense to do it within the consumer? It's likely that every model that uses the UKV data in production will have been trained on the UKV CEDA archive data, so trained using that grid. So each model would need to perform this regridding independently. I can't really think of any uses for the UKV data on the grid it comes in on

peterdudfield · 2023-10-26T16:10:17Z

Sorry maybe im confused, what the regridding from and to?

dfulu · 2023-10-26T16:19:12Z

Ahh right, so @jacobbieker was suggesting, and I think I would agree, that we should regrid the data to the same grid as the training data. Currently, the production data is on a different grid than the training data. It's not just how we access the coordinates of the data, but the data is centred on different spatial points. This also means that we have a slight spatial distortion between the training and production data

In the fig below, the blue points are the actual grid (2D) of the production data. In black is the grid of the training data. I'm not sure how much the distortion will affect our results now (probably quite little) but if we moved to bigger image patch sizes it could

peterdudfield · 2023-10-26T16:24:06Z

And what grid is the training data on? What grid is the live data on? Are they both lat lons? Perhaps the grid doesnt have a name its just what the data comes in.

I like the idea that the nwp-consumer is pretty clean and doesnt do that kinda of stuff, or perhaps there's a option to turn this regridded on in the nwp-consumer.

Would we have to save the grid of the training data, in a file. Load this in the nwp-consumer and then re-grid?

devsjc · 2023-10-27T09:28:17Z

So far as I'm aware, the live (metoffice) data arrives in lat/lon, and it is regridded to osgb x/y as the training (ceda) data is in OSGB.

dfulu · 2023-10-27T09:38:21Z

The training is on a regularly-spaced OSGB coordinate grid, whereas the production data isn't on a regularly spaced OSGB of lat-lon grid. I suspect the production data is actually on some grid of its own

I like the idea that the nwp-consumer is pretty clean and doesnt do that kinda of stuff, or perhaps there's a option to turn this regridded on in the nwp-consumer.

Yeh that's fair enough. I guess another option is just to have some utility function which could be used in any current and future model apps.

I'm not sure it makes sense to do the regridding in datapipes. For regridding, we need to actually load the underlying data, not just the coordinates, in order to do the interpolation to the new grid. To find a value on the new grid via interpolation, other surrounding spatial points must be loaded. I think this means that if we select a NWP window in datapipes, we will need to load the chunk containing the window of interest as well as all surrounding chunks. So if we try to do the regridding lazily, its going be a lot of repeated computation to do the regridding 317 times (once for each GSP) - it would also add to the RAM footprint of each worker. And if we did it eagerly, regridding the whole dataset in RAM before slicing windows it means each worker will also have a copy of the NWP in RAM. One copy of the dataset (data-national) is currently 240MB in RAM. So I think it makes more sense to do the regridding once and save it to disk, then pull from that. We probably could put that regridding, disk cacheing, and reloading into datapipes, but maybe that's messy?

Would we have to save the grid of the training data, in a file. Load this in the nwp-consumer and then re-grid?

Yeh I think so

dfulu · 2023-10-27T09:42:34Z

So far as I'm aware, the live (metoffice) data arrives in lat/lon, and it is regridded to osgb x/y as the training (ceda) data is in OSGB.

I think we're talking about different things when we say "regridding". I might be wrong, but I thought what we do currently is a translation of the coordinate from lat-lon to OSGB - so the underlying data values stay the same. "Regridding" the way I'm talking about it is to do an interpolation to different spatial points. Like in the figure above we'd interpolate all the blue datapoints (the production grid) to the location of the black ones (the training grid).

jacobbieker · 2023-10-27T09:45:40Z

Yeah, I would agree with @dfulu on regridding the data once and resaving it to disk, or having it in the consumer, since one of the ideas of the consumer is that it keeps the prod and training data the same. According to this, the operational UKV is not on a regular grid at all, there is an inner 1.5km grid that then slowly becomes less dense until it is a 4km grid on the edges, so that might be the cause of the mismatch.

peterdudfield · 2023-10-27T10:39:05Z

Ah ok, think I agree, so perhaps the task is for the nwp-consumer to regrid the data do a regular osgb grid, then itll be the same as the training data

devsjc · 2023-10-27T10:40:06Z

one of the ideas of the consumer is that it keeps the prod and training data the same

Indeed, and with this in mind, perhaps it makes sense to interpolate in the consumer. As long as we're happy with the zarrs it produces not being the data as was from that source! However, if we are interpolating, is OSGB the best thing to choose? What with us now consuming NWP data from India, Malta etc, perhaps we should set the standard now to lat long for forward compatibility, and do a batch regrid of the training data?

peterdudfield · 2023-10-27T10:43:24Z

I think in general we should try to use lat/lon grids, as this is what other people do.

The Metoffice is probably a special use case, where the training data is in OSGB grid, and therefore I think the nwp-consumer should also make it like this. Perhaps we can have an option to turn this on/off for the Metoffice data. This means we could leave it in lat/lon grid if we decided to use that in the future

dfulu · 2023-10-27T10:47:03Z

As long as we're happy with the zarrs it produces not being the data as was from that source!

I think that the training data must have been regridded at some point, so I think thats fine. Also like Jacob said, for model performance the biggest concern should be that the training and production data are the same.

So I'd go towards making our production data look as close as possible to the training data rather than a close as possible to the source data

devsjc · 2023-10-27T10:49:06Z

Do we save the MetOffice data as we go into a rolling archive? If we're downloading it anyway for the production systems we should push it to leonardo too, that way going forward we'd have an archive of data exactly as it was in live, also in lat long. Then we could do a batch interpolate of the old ceda training data to match, and then we'd be consistent going forward. I appreciate this would maybe mean retraining (although in live the data is lat long anyway, so the model must be able to cope).

Seems to me that hinging our systems around legacy training data that is incompatible with both the norm as per Peter's comment and our ability to more easily expand into other parts of the globe would artificially hold us back.

dfulu · 2023-10-27T10:51:33Z

Having a rolling archive would be great from tracking down any blips in the forecast!

peterdudfield · 2023-10-27T10:51:48Z

Yea, that could be a good task for Dagster to go and do,

Trying getting raw data from s3, perhaps once a day
if its not there, then get it from CEDA

peterdudfield · 2023-10-27T11:33:00Z

Could we (just discussing this with @devsjc )

Get live ocf_datapipes to load NWP using lat long 2-D grid (currently using calulcated osgb from nwp-consumer)
Training ocf_dataipes load using osgb
add lat, lon 2-D grid to data from nwp-consumer, to make live and ceda more similar. If the data source has lat/lon already, then we dont need to do this e.g ECMWF

When I say lat,lon 2-D grid i mean a lat lon per point

dfulu · 2023-10-27T14:13:08Z

I think we need to do the regridding step somewhere, not just transform the coords. Here's how we might do the regridding properly:

import numpy as np
import xarray as xr
import ocf_blosc2
import pyproj

 # This is a commonly used regridding package for weather/climate data in xarray
import xesmf as xe

# 1. Load data
ds_train = xr.open_zarr("eg. UKV_intermediate_version_7.zarr")
ds_prod = xr.open_zarr("eg. latest.zarr")



# 2. Need to find the lat-lons of the training data for the regrid

OSGB2latlonTransformer = pyproj.Transformer.from_crs(
    crs_from=27700,
    crs_to=4326,
    always_xy=True,
)

x_osgb_2d_train, y_osgb_2d_train = np.meshgrid(ds_train.x.values, ds_train.y.values)
lons_2d_train, lats_2d_train = OSGB2latlonTransformer.transform(
    x_osgb_2d_train,
    y_osgb_2d_train,
)

# 3. These are the coords we are aiming for - we could save this out so we don't need to compute it every time
ds_target_coords = xr.Dataset(
    data_vars=dict(
        longitude=(["y", "x"], lons_2d_train),
        latitude=(["y", "x"], lats_2d_train),
    ),
    coords=dict(
        x=ds_train.x.values,
        y=ds_train.y.values,
    )
)

# 4. Eager loading probably better for this
ds_prod = ds_prod.compute()

# 5. Construct the regridding object - I think even this can be saved out
regridder = xe.Regridder(ds_prod, ds_target_coords, method="bilinear")

# 6. Run regridding
ds_prod_regridded = regridder(ds_prod).compute()

It takes about 10s on my GCP instance, but its got a lot of cores.

dfulu · 2023-10-27T14:15:40Z

After regridding, the priductuon NWP data (specifically temp) looks like this:

dfulu · 2023-10-27T14:18:01Z

In the image above the white boarder is missing data - i.e. in production we have a smaller spatial area than we do in training. I didn't know this was the case actually, and it is something we'll need to think about if we want to use bigger NWP window sizes - especially considering around the Cornwall and east Anglia GSPs

peterdudfield · 2023-10-27T14:19:56Z

In the image above the white board is missing data - i.e. in production we have a smaller spatial area than we do in training. I didn't know this was the case actually, and it is something we'll need to think about if we want to use bigger NWP window sizes - especially considering around the Cornwall and east Anglia GSPs

Thats good to check, perhaps you can run training with these bigger areas first, see the effect and then we can think what to do about it. It oculd be a case of using ECMWF a bit earlier as we wont be restricted then

dfulu · 2023-10-27T14:51:10Z

This shows all the patches we are currently using. So thankfully at the minute, we should be far enough from the boundaries that this doesn't matter. This is with window size of 24 pixels. If we were to double to 48 (semi-planned at some point) I think we'll be at the cusp of needing to deal with this properly

peterdudfield · 2023-10-27T15:01:04Z

Should we move this to a seperate github chat about making NWP bigger?

dfulu · 2023-10-30T15:54:27Z

I think @devsjc and I and I have found the grid that the production data comes on. It is using a Lambert Azimuthal Equal Area projection which I presume has been centred on the UK. Here's the link the to metoffice doc. So I think that means we will still want to reproject the production NWP data onto another grid to match the training data

peterdudfield · 2023-10-30T16:03:16Z

@dfulu you mind adding the code, that you used on GCP
@devsjc can try it very quickly

The other option is the NWP-consumer saves teh data how it is. Then PVnet does the regridding

There's something quite nice about the NWP-consumer saving the data how it comes, and then pvnet needs it in a certina way, so it re does it

peterdudfield · 2023-10-31T08:32:19Z

So I think we found that the gird from the live consumer isn't quite the same as the CEDA grid. The picture above @dfulu illustrates that well. Its a few pixels different. This means the ML model will see a slightly different image (not PVnet takes an image in, not a point)

So I think the options are

re-grid the data in nwp-consumer
re-grid the data in ocf-datapipes
re-grid all the training data to the same as nwp-consumer

I think I'm tempted to put this in 2.
Thoughts?

dfulu · 2023-10-31T10:31:47Z

In regards to the consumer being simple and just saving the data however it comes. I see the appeal of the idea, but I also like @jacobbieker's point that the consumer should produce data which looks the same as our raw training data. Also the form of the UKV data we receive could change in the future. What we get now is a Lambert Azimuthal Equal Area projection, but that's because of our standing order for the data. @devsjc and I were looking at the order, and found that it is possible to request the same data on a regularly spaced lat-lon grid. The spatial area we download could also change based on our data order. There is also nothing special about the Lambert Azimuthal Equal Area projection. The data has already been interpolated onto this grid, so we aren't receiving raw model outputs. We are also reformatting from grib2 to zarr. So with all those arbitrary choices already, it might be a simpler guide for the consumer to get data which is as close as possible to our training data.
I think we would need to save the regridded data to disk and then load from it. Since datapipes is more focused on preparing batches I feel like this doesn't quite fit.
This seems like a lot of computation and means we'd have another copy of the NWP data floating round. It is also dependent on our choices in that standing order of data from the metoffice
Another option could just be to put the regridding inside the app. This might be the simplest option for now. It means we don't have to tack on the regridding and saving to disk in datapipes and could keep the consumer simpler. The app is also entirely production focused anyway, so it doesn't seem too messy to have this step in it. However, it means that we'd need to add the same code to national_xg etc if we want them all to regrid.

I think my preference would be 1 or 4

peterdudfield · 2023-10-31T11:10:34Z

Thanks @dfulu for this. Maybe we should try it in 4, and see how it works, other we could move it to 1

devsjc added the bug Something isn't working label Aug 24, 2023

peterdudfield added the discussion label Oct 27, 2023

dfulu mentioned this issue Nov 2, 2023

Regrid NWP inputs openclimatefix/uk-pvnet-app#21

Merged

5 tasks

Move OSGB transformation hack out of nwp-consumer and into ocf-datapipes #26

Move OSGB transformation hack out of nwp-consumer and into ocf-datapipes #26

Comments

devsjc commented Aug 24, 2023 • edited Loading

peterdudfield commented Aug 25, 2023

peterdudfield commented Oct 26, 2023

dfulu commented Oct 26, 2023

peterdudfield commented Oct 26, 2023

dfulu commented Oct 26, 2023 • edited Loading

dfulu commented Oct 26, 2023

peterdudfield commented Oct 26, 2023

peterdudfield commented Oct 26, 2023

dfulu commented Oct 26, 2023 • edited Loading

peterdudfield commented Oct 26, 2023

dfulu commented Oct 26, 2023

dfulu commented Oct 26, 2023

peterdudfield commented Oct 26, 2023

dfulu commented Oct 26, 2023 • edited Loading

peterdudfield commented Oct 26, 2023

dfulu commented Oct 26, 2023

peterdudfield commented Oct 26, 2023

devsjc commented Oct 27, 2023

dfulu commented Oct 27, 2023

dfulu commented Oct 27, 2023

jacobbieker commented Oct 27, 2023

peterdudfield commented Oct 27, 2023

devsjc commented Oct 27, 2023 • edited Loading

peterdudfield commented Oct 27, 2023

dfulu commented Oct 27, 2023 • edited Loading

devsjc commented Oct 27, 2023 • edited Loading

dfulu commented Oct 27, 2023

peterdudfield commented Oct 27, 2023

peterdudfield commented Oct 27, 2023

dfulu commented Oct 27, 2023 • edited Loading

dfulu commented Oct 27, 2023

dfulu commented Oct 27, 2023 • edited Loading

peterdudfield commented Oct 27, 2023

dfulu commented Oct 27, 2023

peterdudfield commented Oct 27, 2023

dfulu commented Oct 30, 2023

peterdudfield commented Oct 30, 2023 • edited Loading

peterdudfield commented Oct 31, 2023

dfulu commented Oct 31, 2023 • edited Loading

peterdudfield commented Oct 31, 2023

devsjc commented Aug 24, 2023 •

edited

Loading

dfulu commented Oct 26, 2023 •

edited

Loading

dfulu commented Oct 26, 2023 •

edited

Loading

dfulu commented Oct 26, 2023 •

edited

Loading

devsjc commented Oct 27, 2023 •

edited

Loading

dfulu commented Oct 27, 2023 •

edited

Loading

devsjc commented Oct 27, 2023 •

edited

Loading

dfulu commented Oct 27, 2023 •

edited

Loading

dfulu commented Oct 27, 2023 •

edited

Loading

peterdudfield commented Oct 30, 2023 •

edited

Loading

dfulu commented Oct 31, 2023 •

edited

Loading