-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Re-enable eager parquet reading #1273
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
@lgray If this is useful to do, I could use your (hopefully really quick) input on the scattered questions in the opening message. Not time critical, as I'm prioritizing unit tests for the weighted cutflows (+ 2 other features that are equally substantial actually), then other parts of column-joining. |
Oh - sure I can take a look. I'll go through this next week! |
On first pass I'm surprised it was that easy to setup! Thanks! Looking into the rest as the day goes. |
WIP!
Eager parquet reading had a mismatch between the uproot-like shims and the parquet mappings, so
allow_missing
is inserted into the appropriate methods, with a copy of the uproot IndexedOptionArray creation added (I hope appropriately... let me know if not Q1). Running only the test_nanoevents.py locally the tests succeed, but I'm not sure where the dask/delayed version gets (would get) tested Q2 (that should still fail due to lack of form_mapping in dask-awkward).The test for a column being available dives through the file, arrow_schema, etc. which may be more direct than desired Q3
I'm not expecting
allow_missing
is tested for parquet, nor if there are files already available for this... Q4 if needed, I could probably make some... both HLT and GenModel branches potentiallyThe old
nano_dy.parquet
file triggers an error related to awkward extension type:c.f. scikit-hep/awkward#3393
However... newer parquet files made with hepconvert, even with extensionarray=True, pass. Might be able to make use of accessing the
storage_type
to bypass the error, but I didn't have a complete implementation before I swapped to using newer parquet files. Q5The new parquet files use hepconvert's default compression, which is zstd. Okay? Better to use another? Q6
The equivalent of preprocess for parquet files would be a nice addition, and for reasons related to consistency (and related work on experimental column-joining, which benefits from similar interfaces + building unionform pre-emptively), I've inserted a dict-like filespec handling ala uproot5 interface. Improvements can certainly be made.
@lgray this will need quite some work still, but if you could approve/veto some of the choices made (or give thoughts on open questions above) that would be greatly appreciated!