Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copernicusmarine speed up retrieval of time extents #1082

Open
veenstrajelmer opened this issue Jan 23, 2025 · 4 comments
Open

copernicusmarine speed up retrieval of time extents #1082

veenstrajelmer opened this issue Jan 23, 2025 · 4 comments

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Jan 23, 2025

Retrieving the time extents for bio/phyc for my/my-int/anfc was very slow. A easy fix was implemented in #1058 already. However, it can be significantly faster via copernicusmarine.describe(), also preventing quite some unnecessary logging messages.

As suggested by Leïla from copernicusmarine servicedesk, this is significantly faster:

import copernicusmarine

data = copernicusmarine.describe(
    dataset_id='cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m',
    disable_progress_bar=True)

seen_times = set()
for product in data.products:
    for dataset in product.datasets:
        for version in dataset.versions:
            for part in version.parts:
                for service in part.services:
                    for variable in service.variables:
                        for coordinate in variable.coordinates:
                            if coordinate.coordinate_id == 'time':
                                time_values = (coordinate.minimum_value, coordinate.maximum_value)
                                if time_values not in seen_times:
                                    seen_times.add(time_values)
                                    print(f"Minimum Value: {coordinate.minimum_value}")
                                    print(f"Maximum Value: {coordinate.maximum_value}")

However, the above code is a not as robust as I would prefer, so I tried an alternative method below that includes some assertions and makes more use of the copernicusmarine describe dataset structure:

import copernicusmarine
import pandas as pd
# importing a private xarray function here
from xarray.coding.times import decode_cf_datetime

data = copernicusmarine.describe(
    dataset_id='cmems_mod_glo_phy-cur_anfc_0.083deg_P1D-m',
    disable_progress_bar=True,
    )

def convert_time(time_raw, time_units):
    time_np = decode_cf_datetime(num_dates=[time_raw], units=time_units)
    time_pd = pd.Timestamp(time_np[0])
    return time_pd

# check if there is indeed only one of products/datasets/versions/parts
assert len(data.products) == 1
assert len(data.products[0].datasets) == 1
assert len(data.products[0].datasets[0].versions) == 1
assert len(data.products[0].datasets[0].versions[0].parts) == 1

# there are four services, but only geoseries and timeseries contain coordinates in their data variables.
# I expect the if the variables service "original-files" the time values 
# would be 12 hours different from those in geoseries/timeseries (netcdf vs ARCO)
part = data.products[0].datasets[0].versions[0].parts[0]
service_arco_geo_series = part.get_service_by_service_name(service_name="arco-geo-series")

# I guess service.from_metadata_item() could be useful, but I cannot get it to work
# service_arco_geo_series.from_metadata_item()

# Therefore, we get the coordinates from the first data variable in the service.
# We assume now that all data variables contain the coordinates, might be tricky
# It would be useful to attach the coordinates also to the service itself,
# since I expect these are valid here. This would avoid some nesting
var0 = service_arco_geo_series.variables[0]
var0_coordinates = var0.coordinates

# get time coordinate by searching for coordinate_id="time"
coordinate_ids = [x.coordinate_id for x in var0_coordinates]
timecoord_idx = coordinate_ids.index("time")
time_coord = var0_coordinates[timecoord_idx]

# the time extents are raw numbers w.r.t. a reference date
time_units = time_coord.coordinate_unit
time_min_raw = time_coord.minimum_value
time_max_raw = time_coord.maximum_value

# convert to pandas timestamps
time_min = convert_time(time_min_raw, time_units)
time_max = convert_time(time_max_raw, time_units)
print(f"Minimum Value: {time_min}")
print(f"Maximum Value: {time_max}")

@renaudjester: You might have some ideas of improving the code above. My comments in the second code block state some issues I encounter. I was hesitant to report this issue at the copernicusmarine Github, since it is not a bug or feature (at least not yet). Please let me know if you prefer that anyway or if this channel also works for you. If the above usecase is a reason to open a feature request at the copernicusmarine github, please feel free to link it to this issue also.

@renaudjester
Copy link

Hi!

You question is about this isn't it:

there are four services, but only geoseries and timeseries contain coordinates in their data variables.
I expect the if the variables service "original-files" the time values would be 12 hours different from those in geoseries/timeseries (netcdf vs ARCO)

Indeed we don't have the coordinates for native data because actually in stac we are parsing the coordinates per assets and in the native data assets there are no coordinates. I am not sure if this information is accurately in the metadata 🤔 (for example if the time values in the metadata are the ones of ARCO always or also from the native)

About you code:

as your concern:

Therefore, we get the coordinates from the first data variable in the service. We assume now that all data variables contain the coordinates, might be tricky

Actually, what I have seen is the some datasets have variables with some of the coordinates. For example, variable "thetao" with have longitude, latitude, depth, time and then variable "surface_thetao", only longitude, latitude, time.

The coordinates are really variable dependent. I guess there are on the same grid so yeah we could do a generic coordinate values. But I would need to check that this is indeed the case.

Maybe something like: part.get_coordinates that returns the grid of the dataset could help?

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Jan 27, 2025

Thanks for your reply. Something like part.get_coordinates that returns the coordinates of the dataset would indeed be useful. I have created mercator-ocean/copernicus-marine-toolbox#277 just yet. As you also mentioned, the coordinates can differ per variable (e.g. depth wont be present on a surface/bottom variable), but the dataset/part coordinates would include all variable coordinates.

Some remaining questions:

  • I was also wondering what .from_metadata_item() should do. I have seen it in several levels of the nested class returned by describe, but I could get none to work. Is it meant to be helpful in this case or is it an unrelated function?
  • when reading about parts this is for instance used for insitu data. I guess the temporal extents are equal for all parts of a dataset version? It might also be sensible to get the coordinates from the dataset version instead (so one level up)

@renaudjester
Copy link

Thanks for the issue! We will look at it :D

About your questions:

  • It's a function used in the code for to parse the STAC metadata so it is not intended for users to do it (I should have probably make it private). You can notice that it is not in the documentation. So I would say that it's not related.
  • Not sure with the in-situ to be honest! But now some datasets have ARCO data in stereographic project (originalGrid part). And in this case the coordinates are different.

Writing this I am thinking of something: maybe we could reverse the describe result to some extent 🤔 So that the get coordinates return something like:

[
 Coordinate(
... # all the coordinate information
variables=["uo", ...]
service=["arco-geo-series", ...]
part=["default", "originalGrid", ...]
]

Would it be easier to select?

@veenstrajelmer
Copy link
Collaborator Author

Do you mean decreasing the layering like:

  • product >> dataset >> version >> coordinates/variables/services/parts
    instead of:
  • product >> dataset >> version >> part >> service >> variable >> coordinates

Not sure if I get it completely. However, I guess any solution that makes it easier and more robust to get the metadata would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants