Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic params and facets #4

Closed
wants to merge 72 commits into from

Conversation

larsbuntemeyer
Copy link

@larsbuntemeyer larsbuntemeyer commented Oct 2, 2022

This is just adding my first experience with ESGF CORDEX API. It's basically just adapting request parameters and facets for different ESGF projects.

Copy link
Owner

@jbusecke jbusecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @larsbuntemeyer thank you so much for the engagement here! I really appreciate your help with this.
Your Pr has exposed some issues in my original implementation(especially the lack of tests) and made me think of more general cases (which is very great).
I think that I would like to understand how we can generalize this code to even more mips in a general way (hopefully increasing the usefulness).
I left some detailed comments, but except for some smaller issues (hardcoding table_id=Amo) I think this Pr is a very good addition. Lets iterate on the details and then we can merge this?

) -> Tuple[List[str], Dict[str, Dict[str, str]]]:

table_id = facets_from_iid(iid).get("table_id")
# really hacky!
table_id = "Amon" # facets_from_iid(iid).get("table_id")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is a good solution. I think basically we need a "facet string" argument that encodes the names of facets as they appear in the instance id.
It would be great to get a sample of more (if there are) different archives (not sure what the right term for the highest level (e.g CMIP, CORDEX) is.
Once we get an idea how many there are and how the different instance ids are formatted we might be able to generalize this more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was just a quick hack (i didn't actually want to commit that)... I think the highest level term is called project as done, e.g., in the facet search example...

@@ -0,0 +1,27 @@
# dataset id templates
cmip6_template = "mip_era.activity_id.institution_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_label.version"
cordex_template = "project.product.domain.institute.driving_model.experiment.ensemble.rcm_name.rcm_version.time_frequency.variable"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this actually partially fits into my earlier comment. But hardcoding "table_id" is still not ideal.
Which facet describes time frequency in CORDEX? I assume its "time_frequency"? Are the values "conpatible" with CMIP Controlled vocabulary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dived a little more into the ESGF API documentation and added two functions to retrieve that information from the API now.

Copy link
Author

@larsbuntemeyer larsbuntemeyer Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the time frequency in CORDEX is described in "time_frequency", the values are not 100% percent compatible but they contain basically the same vocabulary as in CMIP5, e.g., "day", "mon", "1hr", etc... However, there is no indication of the time cell method, e.g., "1hrPt" as in CMIP6...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also access the time frequency facet although it's not part of the dataset_id. It still has different names for the ESGF projects, (e.g. time_frequency in CORDEX but frequency in CMIP6). In general, the CORDEX vocabulary is derived from CMIP5 vocabulary. However, this could easily become a class attribute as you suggested...


recipe_inputs = asyncio.run(generate_recipe_inputs_from_iids(iids, ssl=sslcontext))
print("DONE")
print(recipe_inputs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice! I need to add tests for CMIP cases too.
Can we actually assert here that recipes were generated (and that they have attributes we desire?).
Testing here will be tricky, since we might have to account for certain glitchy things on the ESGF server side. But fir now this is a great step forward!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to mention here, that the actual recipe creation won't work with the latest version of pange-forge-recipes because of pangeo-forge/pangeo-forge-recipes#418... when having an SSLObject in the FilePattern...

@@ -140,6 +146,9 @@ async def _esgf_api_request(
if params["latest"] == "true":
if "version" in facets:
del facets["version"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have previously wondered that this seems yanky. If I set "latest" to false in the requests I can actually provide the version. Is that behavior the same for CORDEX?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know.. I think in CORDEX, datasets are always the "latest" version while retracted versions are removed from ESGF...

@larsbuntemeyer
Copy link
Author

larsbuntemeyer commented Oct 5, 2022

Thanks for your all your comment @jbusecke ! I really appreciate it! I'll clean this up now and let you know when i am ready to merge!

Comment on lines 19 to 37
"""Requests the dataset_id string template for an ESGF project"""
if url is None:
url = "https://esgf-node.llnl.gov/esg-search/search"
params = {
"project": project,
"fields": "*",
"limit": 1,
"format": "application/solr+json"
}
r = requests.get(url, params)
#print(r.status_code)
return r.json()["response"]["docs"][0]["dataset_id_template_"][0]


def facets_from_template(template: str):
"""Parse the (dataset_id) string template into a list of (facet) keys"""
regex = r"\((.*?)\)"
return re.findall(regex, template)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dived a little bit into the ESGF API documentation and added now these two functions that should be able to retrieve the dataset_id_template_ (and parse it into facets) for any ESGF project. I don't use them yet but i think this is better than what i did in the params module, e.g.,

template = utils.get_dataset_id_template("CMIP6")
# returns '%(mip_era)s.%(activity_drs)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s'

which can be parsed into:

utils.facets_from_template(template)
['mip_era',
 'activity_drs',
 'institution_id',
 'source_id',
 'experiment_id',
 'member_id',
 'table_id',
 'variable_id',
 'grid_label']

Copy link
Author

@larsbuntemeyer larsbuntemeyer Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, here we have member_id instead of variant_label, that seems not right, seems that both work...

Copy link
Owner

@jbusecke jbusecke Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this was a long standing issue of confusion for me. I have only ever encountered different value for these keys in DCPP experiments. An example here:
image
Here you can see that variant_label has a different value then member_id:
image

So long story short I think I mixed those values up! I believe that this occurence here, needs to be changed (which your PR does) since in some cases the value of 'member_id' does not uniquely identify a single dataset. For most experiments the values are interchangeable AFAIK.

@jbusecke
Copy link
Owner

jbusecke commented Oct 6, 2022

This looks really cool. Ultimately I wonder if we should substantially refactor this logic and make it more object oriented.
Just thinking out loud:
Should we have something like this:

class ESGFBuilder:
    def __init__(project, ):
        ...defines class properties like the facets from project type via the request you outlined above...
    
    ... set up the connection to the server for the duration of the session...
    ...logic that retrieves urls etc...
    ...method that returns recipes...

I will have to invest some more time into this refactor, and wrapping my mind around async, but I just wanted to brain dump this here in case you had some input.

I really like the direction this is going for, generalizing this tool for different ESGF projects!

```

using a different url:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify if this is necessary or if this is just to show that the functionality works on all the ESGF search nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was just to show, that you can choose a different url. CORDEX searches should actually also work for all ESGF nodes.

@@ -96,9 +97,11 @@ async def response_data_processing(
session: aiohttp.ClientSession,
response_data: List[Dict[str, str]],
iid: str,
ssl: ssl.SSLContext = None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is great, but adding all these input arguments is a bit of a code smell to me. I think we should aim to set up a 'Connection' or 'Client' object once and then pass this around the different logic that needs to make a request?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@@ -110,7 +113,7 @@ async def response_data_processing(
# print(urls)
print(f"{iid}: Check for netcdf 3 files")
pattern_kwargs = {}
netcdf3_check = await is_netcdf3(session, urls[-1])
netcdf3_check = await is_netcdf3(session, urls[-1], ssl)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -48,6 +49,8 @@
"esgf.ichec.ie",
"esgf.nci.org.au",
"esgf.rcec.sinica.edu.tw",
"esgf1.dkrz.de",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks

picked_data_node = data_node
print(f"DEBUG: Picking preferred data_node: {picked_data_node}")
break
else:
print(f"Got status {status} for {matching_data_node['instance_id']}")
elif len(matching_data_nodes) == 0:
print(f'DEBUG: Data node: {data_node} not available')
print(f"DEBUG: Data node: {data_node} not available")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of these formatting issues here. Do you think it would be useful if I prepend a PR implementing proper pre-commit linting and you can refactor this? Or would that cause more trouble at this point?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5 would be ready to merge. Let me know if it would help merging that before.
I personally would have a much easier time to focus on the functional changes you made here if we can get rid of the formatting stuff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely! I woud very much appreciate the pre-commit config!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let me merge #5 and then you can rebase this.

@larsbuntemeyer
Copy link
Author

larsbuntemeyer commented Oct 17, 2022

Just some thoughts here for myself: I have to rethink the mip keyword, it does not make sense. It should be project which is the API keyword. There is an additional complication in the CORDEX project since there is an additional project keyword in the facet search, e.g., there can be project=cordex where the project facet is hardcode in the dataset_id_template:

"id":"cordex.output.EUR-44.MPI-CSC.MPI-M-MPI-ESM LR.historical.r1i1p1.REMO2009.v1.mon.tas.v20150609|esgf1.dkrz.de",
"dataset_id_template_":["cordex.%(product)s.%(domain)s.%(institute)s.%(driving_model)s.%(experiment)s.%(ensemble)s.%(rcm_name)s.%(rcm_version)s.%(time_frequency)s.%(variable)s"]

or project=cordex-reklies where the project facet can be parsed:

"id":"cordex-reklies.output.EUR-11.GERICS.MIROC-MIROC5.historical.r1i1p1.REMO2015.v1.mon.tas.v20170329|esgf1.dkrz.de",
 "dataset_id_template_":["%(project)s.%(product)s.%(domain)s.%(institute)s.%(driving_model)s.%(experiment)s.%(ensemble)s.%(rcm_name)s.%(rcm_version)s.%(time_frequency)s.%(variable)s"]

I have to make sure to search all cordex project not only project=cordex...

@larsbuntemeyer
Copy link
Author

larsbuntemeyer commented Oct 18, 2022

More thoughts:

If i chose project=CORDEX-Reklies, the product facet gets inactive:
grafik
That's really weired since it is still part of the dataset_id.... hmmm.

It works now, but required some ugly exceptions. Have to clean it up some more.

@jbusecke
Copy link
Owner

If i chose project=CORDEX-Reklies, the product facet gets inactive:

Is there a value when you query the API directly? I wonder if this is some glitch in the GUI

Also is there a list of all available projects? I have to admit we are moving into somewhat unfamiliar territory for me here, and I hope that having a full list will give us an idea how to most efficiently generalize this.

@larsbuntemeyer
Copy link
Author

larsbuntemeyer commented Nov 5, 2022

I'll try to remove some of the if statements that are hacked in depending on the project. I think, this could then be more generalized. The ESGF CORDEX project is not always as consistent as it could be, that's why there might be come exceptions. Sorry for the late responses here, i had to dig through some old grumpy Fortran code the last week... :-/

@larsbuntemeyer
Copy link
Author

Although the recipe run works locally, feedstock creation is not yet possible due to pangeo-forge/pangeo-forge-recipes#79 and pangeo-forge/pangeo-forge-recipes#418.

@larsbuntemeyer larsbuntemeyer marked this pull request as ready for review November 7, 2022 20:39
@jbusecke
Copy link
Owner

Hey @larsbuntemeyer just checking in here (sorry for the long silence). I just wanted to make you aware of some new developments in PGF-recipes (see #9 (comment)) that in my mind now completely eliminate the need for the dynamic_kwargs.py module. I am planning to fully deprecate this logic and move towards handling those issues in the recipe generation itself.

I have started (and will continue) to work on #15 which reflects these plans. I would be curious if you agree that the upstream changes eliminate your need for the dynamic_kwargs module as well?

@larsbuntemeyer
Copy link
Author

Hey @jbusecke , thanks for looking back into this after all! Sorry for the long absence, i had to kick some model runs regularly which took all my attention! I agree, that the refactoring makes this redundant. I also have to get myself up to date with the Beam refactor in PGF now.

@jbusecke
Copy link
Owner

Hey @larsbuntemeyer just digging out of a mountain of gh notifications and found this. I have quite substantially refactored this package since the new pangeo-forge version (based on apache beam) became available.

  • I removed the dynamic_kwargs.py entirely. The process was very imprecise and slow anyways. We can actually now dynamically chunk by providing a custom function to the pangeo-forge-recipe stage. Check out this recent presentation for some details. I have made a new repo specifically for chunking algorithms, so if you had some custom logic in mind, please check it out: https://github.com/jbusecke/dynamic_chunks
  • I also substantially refactored the ESGF client. This was all really done to serve my CMIP6 ingestion (), which now works really well, but I have to apologize for moving fast and destructive here. I think we could still accomodate CORDEX data here, but I think this would have to be a new PR?

Please let me know if you are still interested in this, or have (understandibly) moved on.

@larsbuntemeyer
Copy link
Author

Hey @jbusecke thanks for the updates! I have been following a bit your work and just watched your Pangeo showcase recently, great work! I have been moving on a bit with pange-forge-cordex to ingest CORDEX data on AWS.

However, this is all based on pangeo-forge-recipes < 0.10.0, so instead of refactoring that, i will check out your work and see how far i get. But a new PR will definitely make sense, so this one should be closed!

@jbusecke jbusecke closed this Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants