Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an ExcludeFilter #106

Open
mathause opened this issue Oct 14, 2024 · 1 comment
Open

add an ExcludeFilter #106

mathause opened this issue Oct 14, 2024 · 1 comment

Comments

@mathause
Copy link
Collaborator

In my IPCC analyses I had to remove many simulations (model-variable-, model-scenario-ensemble-, or other combinations). Even for the cleaned 'new-generation' repos, mesmer has to remove some of the simulations. For IPCC I did that in the data processing loop but I think it would be better done in the 'find the simulations' part (i.e. in filefinder). So we should add an ExcludeFilter (better names always welcome). We'd need to think about the way metadata for the excluded simulations is passed.

For IPCC I have a function which identifies matching metadata:

https://github.com/IPCC-WG1/Chapter-11/blob/d1a3a99f242a568fb4cefc36a038c888a90b9d37/code/fixes/_fixes_common.py#L47

But maybe could also use pandas machinery, e.g. isin():

# NOTE: untested

conditions = [
    # remove AWI ocean data: has an unstructured grid
    {
        "table": ["Oday", "Ofx", "Omon", "SIday", "SImon"],
        "model": ["AWI-CM-1-1-MR", "AWI-ESM-1-1-LR"],
    },
    # tasmax and tasmin are wrong for CESM
    {
        "table": "day",
        "varn": ["tasmax", "tasmin"],
        "model": ["CESM2", "CESM2-WACCM"],
    },
    ...
]


to_keep = True

for condition in conditions:
    to_keep |= ~ all(df[key].isin(cond) for key, cond in condition.items()))


df = df.iloc[to_keep]

FYI @veni-vidi-vici-dormivi

@veni-vidi-vici-dormivi
Copy link
Collaborator

veni-vidi-vici-dormivi commented Oct 14, 2024

That is a useful idea! As far as I can see, isin can take dictionaries, so even

for condition in conditions:
   to_keep |= ~all(df.isin(cond))

could work.

I like ExcludeFilter. Also find_files could have an exclude= parameter but it might become too bulky...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants