update get_das_info to include empty and broken files #63

mafrahm · 2024-08-14T11:35:06Z

This PR adds auxiliary information to each dataset inst that is generated via the get_das_info script. The main difference is that we need to load the meta data for each file instead of for the full dataset (e.g. f"dasgoclient -query='file dataset={dataset}' -json").

I timed the runtime of the get_das_info and compared it with the new_get_das_info with the following command:

python3 get_das_info.py -d "/TTto*/Run3Summer22EENanoAODv12-130X*/NANOAODSIM"

The increased runtime duration due to having to check the info of each dataset was negligibly (50s vs 48s) even when running over 65 datasets with O(10k) files in total.

TODO:

decide on how to store the meta data
- two separate aux entries or one combined ( --> changed to one combined + comments)
- should we be more explicit when writing the n_files (e.g. 95 -2 instead of 93 when there are 95 datasets with two empty/broken ones) ( --> added comment)
add the aux entries to all convert functions
consistency checks: is it sufficient to sum over all files ourselves or should we compare this result from the filesummariesservice? ( --> should be fine without checks)

pkausw

Hey @mafrahm , thanks for preparing this draft so fast! I put some small comments/suggestions in the code, but I'm also happy to discuss some more 👍

scripts/get_das_info.py

pkausw · 2024-08-15T07:28:13Z

Regarding your todos:

  * two separate aux entries or one combined

Personally, I might have a small preference towards one combined aux entry if possible. In principle, a file is broken (or rather unusable) to us if it's either not valid or empty. Having one combined entry would reduce the overhead for checks downstream I think, in the sense that you need to consider only one entry instead of two. That being said, it would still be nice to document why a file is unusable somehow, e.g. with a comment in the code. If this can't be realized easily in the automated compilation of the das info, I would also be happy with two aux entries

  * should we be more explicit when writing the n_files (e.g. `95 -2` instead of `93` when there are 95 datasets with two empty/broken ones)

I think explicitly writing 95-2 is an elegant way to do this because we encode both the "official" number of files and the real (usable) number in one line

* consistency checks: is it sufficient to sum over all files ourselves or should we compare this result from the `filesummaries`service?

It might be nice to have such a consistency check. On the other hand, both infos originate from the official CMS database, so if there was an inconsistence there it would be very fundamental indeed. I personally don't think that we need to debug CMS services and am therefore happy with summing the number of events from usable files, but maybe others have other opinions

mafrahm · 2024-08-31T12:41:54Z

I just added the discussed points to the code, here is an exemplary dataset entry. Here is an example with empty files:

python3 get_das_info.py -d "/MuonEG/Run2022F-22Sep2023-v1/NANOAOD"

cpn.add_dataset(
    name="PLACEHOLDER",
    id=14784482,
    processes=[procs.PLACEHOLDER],
    keys=[
        "/MuonEG/Run2022F-22Sep2023-v1/NANOAOD",  # noqa
    ],
    aux={
        "broken_files": [
            "/store/data/Run2022F/MuonEG/NANOAOD/22Sep2023-v1/50000/4d76213a-ef14-411a-9558-559a6df3f978.root",  # empty
            "/store/data/Run2022F/MuonEG/NANOAOD/22Sep2023-v1/50000/4fb72196-3b02-4499-8f6c-a54e15692b32.root",  # empty
        ],
    }
    n_files=93,  # 95-2
    n_events=38219969,
)

When empty files are also broken, they will be marked as "#broken"

And one example without empty datasets (of course we could also skip the "broken_files" aux entirely, not sure what would be preferred):

python3 get_das_info.py -d "/Muon/Run2022F-22Sep2023-v2/NANOAOD"

cpn.add_dataset(
    name="PLACEHOLDER",
    id=14826624,
    processes=[procs.PLACEHOLDER],
    keys=[
        "/Muon/Run2022F-22Sep2023-v2/NANOAOD",  # noqa
    ],
    aux={
        "broken_files": [],
    },
    n_files=359,  # 359-0
    n_events=449887248,
)

pkausw

LGTM, thanks!

update get_das_info to include empty and broken files

d0c6dd2

mafrahm requested a review from pkausw August 14, 2024 11:35

mafrahm self-assigned this Aug 14, 2024

pkausw requested changes Aug 15, 2024

View reviewed changes

scripts/get_das_info.py Outdated Show resolved Hide resolved

scripts/get_das_info.py Show resolved Hide resolved

mafrahm force-pushed the feature/broken_files_query branch from a6a701e to 315d002 Compare August 31, 2024 12:38

improve verbosity of broken files

f2350e1

mafrahm force-pushed the feature/broken_files_query branch from 315d002 to f2350e1 Compare August 31, 2024 12:43

mafrahm added 4 commits October 14, 2024 13:17

add broken files info to all convert functions

af30edd

remove previous get_das_info function

487e3ad

add broken files aux entries to MuonEG in 2022

4fc4bec

add is_data and era info back to muoneg datasets

de20331

mafrahm mentioned this pull request Oct 14, 2024

filter dataset lfns based on broken_files aux columnflow/columnflow#548

Merged

mafrahm marked this pull request as ready for review October 14, 2024 11:43

Merge branch 'master' into feature/broken_files_query

7491b38

pkausw approved these changes Oct 14, 2024

View reviewed changes

pkausw merged commit f1193f5 into master Oct 14, 2024
4 checks passed

pkausw deleted the feature/broken_files_query branch October 14, 2024 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update get_das_info to include empty and broken files #63

update get_das_info to include empty and broken files #63

mafrahm commented Aug 14, 2024 •

edited

Loading

pkausw left a comment

pkausw commented Aug 15, 2024 •

edited

Loading

mafrahm commented Aug 31, 2024

pkausw left a comment

update get_das_info to include empty and broken files #63

update get_das_info to include empty and broken files #63

Conversation

mafrahm commented Aug 14, 2024 • edited Loading

pkausw left a comment

Choose a reason for hiding this comment

pkausw commented Aug 15, 2024 • edited Loading

mafrahm commented Aug 31, 2024

pkausw left a comment

Choose a reason for hiding this comment

mafrahm commented Aug 14, 2024 •

edited

Loading

pkausw commented Aug 15, 2024 •

edited

Loading