Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to decouple from relying directly on proto #44

Merged
merged 2 commits into from
Nov 18, 2023

Conversation

Jefffrey
Copy link
Collaborator

Part of datafusion-contrib/orc-rust#18

Start introducing some of our own structs/types to represent their protobuf counter part

Allows us to have more control, such as ensuring fields are actually present (since proto has them as Option usually)

Comment on lines +49 to +55
compression: Option<Compression>,
type_description: Arc<TypeDescription>,
number_of_rows: u64,
/// Statistics of columns across entire file
column_statistics: Vec<ColumnStatistics>,
stripes: Vec<StripeMetadata>,
user_custom_metadata: HashMap<String, Vec<u8>>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't contain all fields yet, but can fill in more as needed (such as rowIndexStride)

Comment on lines +56 to +58
// TODO: for now keeping this, but ideally won't want all stripe footers here
// since don't want to require parsing all stripe footers in file unless actually required
stripe_footers: Vec<StripeFooter>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoping to move this out of this struct, as don't want to require parsing all stripe footers in file upfront (at least not in this metadata section)

Comment on lines +106 to +109
// TODO: false count?
Some(TypeStatistics::Bucket {
true_count: stats.count[0], // TODO: safety check this
})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC spec is kinda vague on details here, been referring to https://github.com/apache/orc/blob/e1c7e4d9150820fa64b151ad942bffdf81ef34d0/c%2B%2B/src/Statistics.cc#L197

Might need further investigation

Copy link
Collaborator

@WenyXu WenyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jefffrey Jefffrey merged commit cae1014 into main Nov 18, 2023
6 checks passed
@Jefffrey Jefffrey deleted the refactor_own_proto_structs branch November 18, 2023 05:35
waynexia pushed a commit that referenced this pull request Oct 24, 2024
* Refactor to decouple from relying directly on proto

* Fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants