-
Notifications
You must be signed in to change notification settings - Fork 916
Document Arrow <--> Parquet schema conversion better #7479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks @alamb. I was just looking at this again because I'd like to see a coercion in DF to keep dictionaries if possible. Currently Comet only knows the primitive datatype and ignores encoding when generating the schema, which results in dictionaries being unpacked in the Parquet reader when we might have preferred to keep them encoded. I'll take another pass through this later to think about what might be missing, but first read was good. |
I think it is always possible to read Parquet data as dictionaries (at least for strings, we might have to add code / tests for other types). I think it would make sense to allow writing DictionaryArrays for other types if it isn't already supported. As I recall one potential complication is that the same column is stored in multiple pages, and each page can have different encodings (e.g. some pages are dictionary encoded and some are plain). |
parquet/src/arrow/mod.rs
Outdated
//! [`BinaryViewArray`] or [`BinaryArray`]. | ||
//! | ||
//! To recover the original Arrow types, the writers in this module add | ||
//! metadata in the [`ARROW_SCHEMA_META_KEY`] key to record the original Arrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reminded on #5626 that this metadata is the same format as used by arrow-cpp, which is an important caveat. I will add to this doc
/// This option is only required if you want to cast columns to a different type. | ||
/// For example, if you wanted to cast from an Int64 in the Parquet file to a Timestamp | ||
/// in the Arrow schema. | ||
/// If provided, this schema takes precedence over the schema inferred from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true, the schema in the parquet file must be authoritative. The arrow schema is merely a hint - see #1663
Edit: it may take precedence over the embedded arrow schema though, I don't recognise this particular codepath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by "authoritative"
What this method does is override any embedded arrow schema hint
I have reworded it - let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the major confusion, which this PR didn't create, but which it also doesn't really address is that the arrow schema provided may not be what the reader actually uses. If say the arrow schema says TimestampNanoseconds, but the parquet is actually TimestampMilliseconds, IIRC it will return TimestampMilliseconds.
The reason for me writing this PR is that I don't think it is clear how parquet / arrow schema conversions are handled, including the embedded arrow schema hint and then the APIs that let people supply / modify their own hint
My experience is that if the hint schema is provided but doesn't match what is read from the file, an error is raised:
The error message is actually pretty bad. I'll make a new PR to improve it. |
05c519e
to
3426c8a
Compare
|
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Aah yes, I remember now. If you provide a schema it will use it as a hint for the schema inference process, but if that inference process ignores the hints for any reason it will return an error. If, however, the schema is embedded in the file, it does not error and behaves as I described above |
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Since I don't see any reason to rush this PR in I'll plan to leave it open for another day or two to have time to gather comments |
Which issue does this PR close?
Closes #.
Rationale for this change
There have been several questions / changes related to Arrow and Parquet schema recently for example
Dictionary(_, FixedSizeBinary(_))
to Parquet #7446 (comment) from @albertlockettI realized the high level behavior was not well documented anywhere so I wanted to do that (so I have a place to refer people to mostly)
What changes are included in this PR?
Document the schema conversion process
Are there any user-facing changes?
Documentation only changes, no code or behavior changes