Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make it possible to read only a subset of available collections in ROOTFrameReader #504

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

tmadlener
Copy link
Collaborator

@tmadlener tmadlener commented Oct 11, 2023

BEGINRELEASENOTES

  • Add a collsToRead argument to readEntry and readNextEntry for the ROOTFrameReader to allow limiting the collections that are actually read.

ENDRELEASENOTES

This is an (early) proposal for introducing this functionality which would in the end address #499

  • Interface design: Do we want to have the possibility to change this frame-by-frame?
    • Alternatively would have to introduce a configuration method that takes a category name and a list of collections.
  • Should we only pass on the limited collection id table?
    • Interplay with Frame and assumptions there
  • Add tests and checks to avoid breaking things when collection names are passed that are not even available from the data
    • The way the check is implemented only names that exist in the frame will be checked. Non-existent names in collsToRead will be silently ignored.
  • Same capabilities for RNTuple reader
  • Documentation (once the rest has settled)
  • Ensure that files that have been read with a limited set of collections and are then written again, can be read without issues.
  • Needs Require C++20 and update to C++20 #698 as it also starts using std::ranges.
  • Needs Make sure RNTupleReader builds with ROOT > 6.34 #719 to pass on dev3 stacks

@tmadlener tmadlener linked an issue Oct 11, 2023 that may be closed by this pull request
@tmadlener
Copy link
Collaborator Author

Revived this as this is likely to become an issue for moving EDM4hep forwards otherwise. I have settled on making it possible to pass the collections that should be made available on a frame-by-frame basis as that is the most flexible and is also rather simple to implement.

I still have to make sure that the idTable that is available from the constructed Frames is properly cleaned up and also add some tests that passing in unavailable collection names doesn't break the whole thing.

@tmadlener tmadlener force-pushed the select-colls-root branch 4 times, most recently from 1123457 to 7cc113d Compare January 8, 2025 16:15
Comment on lines +181 to +184
if (!collsToRead.empty() &&
std::ranges::find(collsToRead, m_collectionInfo[category].name[i]) == collsToRead.end()) {
continue;
}
Copy link
Member

@jmcarcell jmcarcell Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!collsToRead.empty() &&
std::ranges::find(collsToRead, m_collectionInfo[category].name[i]) == collsToRead.end()) {
continue;
}
if (std::ranges::find(collsToRead, m_collectionInfo[category].name[i]) == collsToRead.end()) {
continue;
}

Probably nothing is gained by checking if collsToRead is empty if in the end find checks or does a for loop over its elements? Same below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure I follow here. Are you suggesting to do a loop over collsToRead instead of looping over the potentially available collections and then checking each of them to be present in collsToRead?

The advantage of the current way is that collections that are not present are simply ignored, but on the other hand that might be something we want to warn about.

Thinking about it, I don't think we can get rid of the find in any case, because if we loop over collsToRead we still have to check whether all collections are also actually available. Or am I missing something?

Copy link
Member

@jmcarcell jmcarcell Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just edited; I just meant the first empty check as it's deleted in the suggestion, if it's empty find will not find it anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry then I completely misunderstood. The check is negated, so we only want to do the find if there are collections to check. If we remove the check, we will stop reading any collections (because if collsToRead is empty then find will never find a name in it).

Copy link
Member

@jmcarcell jmcarcell Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggestion is still incorrect because how it is now if collsToRead is empty then the first check will always fail and will always continue. So it should be like this, only continue if at least one collection is passed to be read and it is not found there:

    if (!collsToRead.empty()) {
      if (std::ranges::find(collsToRead, m_collectionInfo[category].name[i]) == collsToRead.end()) {
        continue;
      }
    }

then the existing calls will just pass an empty collsToRead and the continue will not trigger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am missing something the nested if and the current implementation with the && are equivalent?

The main question this discussion has triggered for me is whether the continue should actually be an exception, because arguably, if I put a collection name into the collsToRead, I expect it to be in the Frame I get in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, it's correct. Probably an exception is better than simply not reading it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to limit the collections that are read
2 participants