Check time steps against expectations as part of the QC tests #99

jbusecke · 2024-02-12T20:51:23Z

I just checked on the iids requested in #72 and noticed that there might have been a bug in the ingestion of previous LEAP ingested stores.

I ran the following:

iids = [
    'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514',
    'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.so.gn.v20200514',
    'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
    'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514'
]
import intake
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json" # Only stores that pass current tests
col = intake.open_esm_datastore(url)  

CMIP6_naming_schema = "mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version"

for iid in iids:
    print("==============================================================================")
    print(f'Checking for {iid=}')
    facet_dict = {k:v for k,v in zip(CMIP6_naming_schema.split('.'), iid.split('.'))}
    # we do not catalog the mip_era...which we probably should? TODO raise an issue
    del facet_dict['mip_era']
    cat = col.search(**facet_dict)
    if len(cat) == 0: 
        print('Not found in catalog')
    elif len(cat) == 1:
        path = cat.df['zstore'].tolist()[0]
        ddict = cat.to_dataset_dict()
        name, ds = ddict.popitem()
        print(f'{name=} Found in catalog at \n{path=}')
        display(ds)
    else:
        print(f"Found more than one entry. ⛔️ This should not happen.")

Which showed all iids already in the catalog (within the 'LEAP legacy' prefix 'gs://cmip6/CMIP6_LEAP_legacy/'). Each of these datasets has only 2 timesteps which makes me conclude that we have ingested pruned datasets into the main catalog. The way I have this set up currently this should not happen, but we need to overwrite (related to #76 ) these now and build better checks in the future.

Here are the steps I am taking next:

Attempt manually overwrite the iids requested in [REQUEST]: HighResMIP HadGEM vars for ocean sound speed calculation #72 (see Rewrite HighRes request #98)
Write new tests that check the min expected timesteps for a given iid (depends on experiment, table_id, something else?)
Rerun the legacy parsing notebook to remove other PR ingestions from the main catalog.

The text was updated successfully, but these errors were encountered:

jbusecke · 2024-05-11T04:26:10Z

This is now possible due to the full API request injection (see #133 (comment)).

jbusecke added the testing label Apr 16, 2024

jbusecke mentioned this issue May 11, 2024

Reintegrate tracking ID #133

Open

3 tasks

jbusecke mentioned this issue May 13, 2024

[REQUEST]: Sea-Ice Variables (sifb and siitdthick) #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check time steps against expectations as part of the QC tests #99

Check time steps against expectations as part of the QC tests #99

jbusecke commented Feb 12, 2024 •

edited

Loading

jbusecke commented May 11, 2024

Check time steps against expectations as part of the QC tests #99

Check time steps against expectations as part of the QC tests #99

Comments

jbusecke commented Feb 12, 2024 • edited Loading

jbusecke commented May 11, 2024

jbusecke commented Feb 12, 2024 •

edited

Loading