-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update get_das_info to include empty and broken files #63
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mafrahm , thanks for preparing this draft so fast! I put some small comments/suggestions in the code, but I'm also happy to discuss some more 👍
Regarding your todos:
Personally, I might have a small preference towards one combined aux entry if possible. In principle, a file is broken (or rather unusable) to us if it's either not valid or empty. Having one combined entry would reduce the overhead for checks downstream I think, in the sense that you need to consider only one entry instead of two. That being said, it would still be nice to document why a file is unusable somehow, e.g. with a comment in the code. If this can't be realized easily in the automated compilation of the das info, I would also be happy with two aux entries
I think explicitly writing
It might be nice to have such a consistency check. On the other hand, both infos originate from the official CMS database, so if there was an inconsistence there it would be very fundamental indeed. I personally don't think that we need to debug CMS services and am therefore happy with summing the number of events from usable files, but maybe others have other opinions |
a6a701e
to
315d002
Compare
I just added the discussed points to the code, here is an exemplary dataset entry. Here is an example with empty files:
cpn.add_dataset(
name="PLACEHOLDER",
id=14784482,
processes=[procs.PLACEHOLDER],
keys=[
"/MuonEG/Run2022F-22Sep2023-v1/NANOAOD", # noqa
],
aux={
"broken_files": [
"/store/data/Run2022F/MuonEG/NANOAOD/22Sep2023-v1/50000/4d76213a-ef14-411a-9558-559a6df3f978.root", # empty
"/store/data/Run2022F/MuonEG/NANOAOD/22Sep2023-v1/50000/4fb72196-3b02-4499-8f6c-a54e15692b32.root", # empty
],
}
n_files=93, # 95-2
n_events=38219969,
) When empty files are also broken, they will be marked as "#broken" And one example without empty datasets (of course we could also skip the "broken_files" aux entirely, not sure what would be preferred):
cpn.add_dataset(
name="PLACEHOLDER",
id=14826624,
processes=[procs.PLACEHOLDER],
keys=[
"/Muon/Run2022F-22Sep2023-v2/NANOAOD", # noqa
],
aux={
"broken_files": [],
},
n_files=359, # 359-0
n_events=449887248,
) |
315d002
to
f2350e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
This PR adds auxiliary information to each dataset inst that is generated via the
get_das_info
script. The main difference is that we need to load the meta data for each file instead of for the full dataset (e.g. f"dasgoclient -query='file dataset={dataset}' -json").I timed the runtime of the
get_das_info
and compared it with thenew_get_das_info
with the following command:The increased runtime duration due to having to check the info of each dataset was negligibly (50s vs 48s) even when running over 65 datasets with O(10k) files in total.
TODO:
95 -2
instead of93
when there are 95 datasets with two empty/broken ones) ( --> added comment)filesummaries
service? ( --> should be fine without checks)