Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update get_das_info to include empty and broken files #63

Merged
merged 7 commits into from
Oct 14, 2024

Conversation

mafrahm
Copy link
Contributor

@mafrahm mafrahm commented Aug 14, 2024

This PR adds auxiliary information to each dataset inst that is generated via the get_das_info script. The main difference is that we need to load the meta data for each file instead of for the full dataset (e.g. f"dasgoclient -query='file dataset={dataset}' -json").

I timed the runtime of the get_das_info and compared it with the new_get_das_info with the following command:

python3 get_das_info.py -d "/TTto*/Run3Summer22EENanoAODv12-130X*/NANOAODSIM"

The increased runtime duration due to having to check the info of each dataset was negligibly (50s vs 48s) even when running over 65 datasets with O(10k) files in total.

TODO:

  • decide on how to store the meta data
    • two separate aux entries or one combined ( --> changed to one combined + comments)
    • should we be more explicit when writing the n_files (e.g. 95 -2 instead of 93 when there are 95 datasets with two empty/broken ones) ( --> added comment)
  • add the aux entries to all convert functions
  • consistency checks: is it sufficient to sum over all files ourselves or should we compare this result from the filesummariesservice? ( --> should be fine without checks)

@mafrahm mafrahm requested a review from pkausw August 14, 2024 11:35
@mafrahm mafrahm self-assigned this Aug 14, 2024
Copy link
Contributor

@pkausw pkausw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mafrahm , thanks for preparing this draft so fast! I put some small comments/suggestions in the code, but I'm also happy to discuss some more 👍

@pkausw
Copy link
Contributor

pkausw commented Aug 15, 2024

Regarding your todos:

  * two separate aux entries or one combined

Personally, I might have a small preference towards one combined aux entry if possible. In principle, a file is broken (or rather unusable) to us if it's either not valid or empty. Having one combined entry would reduce the overhead for checks downstream I think, in the sense that you need to consider only one entry instead of two. That being said, it would still be nice to document why a file is unusable somehow, e.g. with a comment in the code. If this can't be realized easily in the automated compilation of the das info, I would also be happy with two aux entries

  * should we be more explicit when writing the n_files (e.g. `95 -2` instead of `93` when there are 95 datasets with two empty/broken ones)

I think explicitly writing 95-2 is an elegant way to do this because we encode both the "official" number of files and the real (usable) number in one line

* consistency checks: is it sufficient to sum over all files ourselves or should we compare this result from the `filesummaries`service?

It might be nice to have such a consistency check. On the other hand, both infos originate from the official CMS database, so if there was an inconsistence there it would be very fundamental indeed. I personally don't think that we need to debug CMS services and am therefore happy with summing the number of events from usable files, but maybe others have other opinions

@mafrahm mafrahm force-pushed the feature/broken_files_query branch from a6a701e to 315d002 Compare August 31, 2024 12:38
@mafrahm
Copy link
Contributor Author

mafrahm commented Aug 31, 2024

I just added the discussed points to the code, here is an exemplary dataset entry. Here is an example with empty files:

python3 get_das_info.py -d "/MuonEG/Run2022F-22Sep2023-v1/NANOAOD"
cpn.add_dataset(
    name="PLACEHOLDER",
    id=14784482,
    processes=[procs.PLACEHOLDER],
    keys=[
        "/MuonEG/Run2022F-22Sep2023-v1/NANOAOD",  # noqa
    ],
    aux={
        "broken_files": [
            "/store/data/Run2022F/MuonEG/NANOAOD/22Sep2023-v1/50000/4d76213a-ef14-411a-9558-559a6df3f978.root",  # empty
            "/store/data/Run2022F/MuonEG/NANOAOD/22Sep2023-v1/50000/4fb72196-3b02-4499-8f6c-a54e15692b32.root",  # empty
        ],
    }
    n_files=93,  # 95-2
    n_events=38219969,
)

When empty files are also broken, they will be marked as "#broken"

And one example without empty datasets (of course we could also skip the "broken_files" aux entirely, not sure what would be preferred):

python3 get_das_info.py -d "/Muon/Run2022F-22Sep2023-v2/NANOAOD"
cpn.add_dataset(
    name="PLACEHOLDER",
    id=14826624,
    processes=[procs.PLACEHOLDER],
    keys=[
        "/Muon/Run2022F-22Sep2023-v2/NANOAOD",  # noqa
    ],
    aux={
        "broken_files": [],
    },
    n_files=359,  # 359-0
    n_events=449887248,
)

@mafrahm mafrahm force-pushed the feature/broken_files_query branch from 315d002 to f2350e1 Compare August 31, 2024 12:43
@mafrahm mafrahm marked this pull request as ready for review October 14, 2024 11:43
Copy link
Contributor

@pkausw pkausw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pkausw pkausw merged commit f1193f5 into master Oct 14, 2024
4 checks passed
@pkausw pkausw deleted the feature/broken_files_query branch October 14, 2024 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants