Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for daily output frequency #9

Closed
wants to merge 7 commits into from

Conversation

duncanwp
Copy link

I needed to support the daily table_id for my climatebench-feedstock so have made a first pass at fixing #3 here.

Happy for any feedback / other approaches.

Copy link
Owner

@jbusecke jbusecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @duncanwp, thank you very much for this very helpful contribution. I think the changes are all good as is.
I think it would be a good idea to add some simple unit tests for get_timesteps_simple to ensure we get consistent results for a set of given inputs.

@@ -13,21 +13,35 @@
# TODO: I might not want to allow this in the ocean and ice fields. Lets see
)

daily_divisors = sorted(
[1, 30, 360, 365, 3653]
# TODO: It might be more efficient to default to the underlying file chunks?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so. Dealing with the underlying chunks, and outputting them in a way that makes sense to the user is one of the core tasks pangeo-forge aims to solve. These divisons should rather reflect scientifically useful units to e.g. avoid splitting e.g. a yearly chunk with a few days left, as this would make certain operations (annual average) much slower to compute.
These look good for now, but I was wondering about different calendars and how we could maybe influence which divisors to use depending on the calendar of the data.
Just thinking out loud here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. As far as I know there isn't any ESGF metadata to describe the calendar though and we'd have to open the files to inspect the time coordinate metadata directly which would be pain.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is a problem to be pushed to later work.

allowed_divisors = {
"Omon": monthly_divisors,
"SImon": monthly_divisors,
"Amon": monthly_divisors,
"day": daily_divisors,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these dividers also work generally for other daily data (see here: https://github.com/WCRP-CMIP/CMIP6_CVs/blob/master/CMIP6_table_id.json)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have no idea, I didn't realise there were different day table_id and can't find any documentation on what they represent. They might be the same but it seems a bit redundant. Happy to include them or not depending on how much you want to protect against potential future issues.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to check this PR out in the next few days, and maybe run some tests with the 'other' daily frequencies.

elif "day" in table_id:
timesteps = [
m * 30 for m in months
] # This is close enough without worrying about calendars etc.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. I guess dealing with calendars is a whole can of worms, but I am thinking it might be worth to put some of these thoughts into a dedicated issue to keep track of it? Again for now this seems fine for me!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds like a plan

@duncanwp
Copy link
Author

duncanwp commented Nov 4, 2022

Great, happy to add a test or two - but there doesn't seem to be a framework. How do you want to set it up? Would it make sense as a separate PR?

Copy link
Owner

@jbusecke jbusecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge this soon. I hope to get some time to play around with this in the next days. @duncanwp could you provide me with the instance ids of the data you are using for climatebench right now?

pangeo_forge_esgf/dynamic_kwargs.py Outdated Show resolved Hide resolved
@@ -13,21 +13,35 @@
# TODO: I might not want to allow this in the ocean and ice fields. Lets see
)

daily_divisors = sorted(
[1, 30, 360, 365, 3653]
# TODO: It might be more efficient to default to the underlying file chunks?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is a problem to be pushed to later work.

allowed_divisors = {
"Omon": monthly_divisors,
"SImon": monthly_divisors,
"Amon": monthly_divisors,
"day": daily_divisors,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to check this PR out in the next few days, and maybe run some tests with the 'other' daily frequencies.

@duncanwp
Copy link
Author

I don't have an explicit list sorry but I extract all NorESM2-LM files for all ensemble members matching the following experiments and variables:

experiments = [
               '1pctCO2', 'abrupt-4xCO2', 'historical', 'piControl', # CMIP
               'hist-GHG', 'hist-aer', # DAMIP
               'ssp126', 'ssp245', 'ssp370', 'ssp370-lowNTCF', 'ssp585' #	ScenarioMIP
]
variables = [
             'tas', 'tasmin', 'tasmax', 'pr'
]

See the original (hacky!) script here: https://github.com/duncanwp/ClimateBench/blob/main/prepare_data.py

@duncanwp
Copy link
Author

duncanwp commented Feb 2, 2023

bump

1 similar comment
@duncanwp
Copy link
Author

bump

@jbusecke
Copy link
Owner

Hi @duncanwp, sorry for the radio silence here. I am just now returning to this work in my context of transforming CMIP data from ESGF. I think that some work I have started upstream in pangeo-forge-recipes could entirely eliminate the need for dynamic kwargs generation in pangeo-forge-esgf. This would be really nice since the issue is solved on a much more general level. The new beam-refactor enables a lot more flexibility.
If you are ok with this I am tending towards moving this discussion/feature further upstream?

@jbusecke jbusecke mentioned this pull request Jul 28, 2023
@duncanwp
Copy link
Author

duncanwp commented Aug 2, 2023

No worries, sounds good. I'd still be keen to look at daily ClimateBench variables as a potential use-case so would appreciate a pointer on where to get started with that!

@jbusecke
Copy link
Owner

jbusecke commented Aug 2, 2023

I am also still very curious about that use case @duncanwp, and I truly appreciate your patience. I think we can try to support that from the LEAP data-ingestion side (or at least focus discussions there).
Could you raise an issue here so we can discuss this usecase with respect to the new beam refactor (hopefully entraining @cisaacstern in the process 😄).

FYI here is some more info on LEAP and how we are trying to handle data ingestion (even though your use case is more of a derived data product than a canonical ingestion): https://leap-stc.github.io/leap-pangeo/jupyterhub.html#i-have-a-dataset-and-want-to-work-with-it-on-the-hub-how-do-i-upload-it

Either way having an outline of your plan there would be very helpful to keep things focused.

@duncanwp
Copy link
Author

duncanwp commented Aug 3, 2023

Sure, sounds good! I just created an issue over there: leap-stc/data-management#43

@duncanwp duncanwp closed this Aug 3, 2023
@jbusecke
Copy link
Owner

jbusecke commented Aug 7, 2023

Fantastic. Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants