Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a tool to merge several podio files into a single one #681

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jmcarcell
Copy link
Member

@jmcarcell jmcarcell commented Sep 19, 2024

BEGINRELEASENOTES

  • Add a tool to merge several podio files into a single one:
    • Metadata for every event can be saved or not
    • The same format as the first input file will be used (TTree or RNTuple)

ENDRELEASENOTES

Useful not to have to deal with many small files, even though the readers can read them fine.

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a lot of heavy lifting that might not be strictly necessary, as it unpacks and than repacks every collection in every frame, whereas a simple hadd would almost be enough for merging TTree based files. Additionally, it will do schema evolution (if applicable), so that there could be subtle changes to the produced output file. On the other hand, this will correctly handle input files with different schema versions, so that might be not too bad.

How much slower is this than a c++ implementation? There is #620 after all to solve a quite significant performance issue with podio-dump, and the difference is not only the long startup time in that case.

tools/podio-merge-files Outdated Show resolved Hide resolved
@jmcarcell
Copy link
Member Author

Ah I didn't think about hadd, this is a tool I have used a couple of times (well only the for loop) so I thought it would be nice to have. On small files (1 MB TTree files) it's much faster than in python, but that's the case with the worst relative overhead I would imagine. For RNTuples it crashes with the ones I tried.

@tmadlener
Copy link
Collaborator

For RNTuples it crashes with the ones I tried.

Can you specify what crashes in this case? hadd or the c++ version of this script?

@jmcarcell
Copy link
Member Author

hadd is crashing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants