Skip to content

Document Arrow <--> Parquet schema conversion better #7479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 7, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

There have been several questions / changes related to Arrow and Parquet schema recently for example

I realized the high level behavior was not well documented anywhere so I wanted to do that (so I have a place to refer people to mostly)

What changes are included in this PR?

Document the schema conversion process

Are there any user-facing changes?

Documentation only changes, no code or behavior changes

@alamb alamb added the documentation Improvements or additions to documentation label May 7, 2025
@github-actions github-actions bot added the parquet Changes to the parquet crate label May 7, 2025
@mbutrovich
Copy link
Contributor

Thanks @alamb. I was just looking at this again because I'd like to see a coercion in DF to keep dictionaries if possible. Currently Comet only knows the primitive datatype and ignores encoding when generating the schema, which results in dictionaries being unpacked in the Parquet reader when we might have preferred to keep them encoded. I'll take another pass through this later to think about what might be missing, but first read was good.

@alamb
Copy link
Contributor Author

alamb commented May 7, 2025

Thanks @alamb. I was just looking at this again because I'd like to see a coercion in DF to keep dictionaries if possible.

I think it is always possible to read Parquet data as dictionaries (at least for strings, we might have to add code / tests for other types).

I think it would make sense to allow writing DictionaryArrays for other types if it isn't already supported.

As I recall one potential complication is that the same column is stored in multiple pages, and each page can have different encodings (e.g. some pages are dictionary encoded and some are plain).

//! [`BinaryViewArray`] or [`BinaryArray`].
//!
//! To recover the original Arrow types, the writers in this module add
//! metadata in the [`ARROW_SCHEMA_META_KEY`] key to record the original Arrow
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reminded on #5626 that this metadata is the same format as used by arrow-cpp, which is an important caveat. I will add to this doc

/// This option is only required if you want to cast columns to a different type.
/// For example, if you wanted to cast from an Int64 in the Parquet file to a Timestamp
/// in the Arrow schema.
/// If provided, this schema takes precedence over the schema inferred from
Copy link
Contributor

@tustvold tustvold May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, the schema in the parquet file must be authoritative. The arrow schema is merely a hint - see #1663

Edit: it may take precedence over the embedded arrow schema though, I don't recognise this particular codepath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by "authoritative"

What this method does is override any embedded arrow schema hint

I have reworded it - let me know what you think.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the major confusion, which this PR didn't create, but which it also doesn't really address is that the arrow schema provided may not be what the reader actually uses. If say the arrow schema says TimestampNanoseconds, but the parquet is actually TimestampMilliseconds, IIRC it will return TimestampMilliseconds.

@alamb
Copy link
Contributor Author

alamb commented May 7, 2025

The reason for me writing this PR is that I don't think it is clear how parquet / arrow schema conversions are handled, including the embedded arrow schema hint and then the APIs that let people supply / modify their own hint

I think the major confusion, which this PR didn't create, but which it also doesn't really address is that the arrow schema provided may not be what the reader actually uses. If say the arrow schema says TimestampNanoseconds, but the parquet is actually TimestampMilliseconds, IIRC it will return TimestampMilliseconds.

My experience is that if the hint schema is provided but doesn't match what is read from the file, an error is raised:

https://github.com/apache/arrow-rs/blob/812160005efe3afc63531b8ea051e1fa44a91f67/parquet/src/arrow/arrow_reader/mod.rs#L541-L540

called Result::unwrap() on an Err value: ArrowError("incompatible arrow schema, the following fields could not be cast: [column1]")

The error message is actually pretty bad. I'll make a new PR to improve it.

@alamb
Copy link
Contributor Author

alamb commented May 7, 2025

The error message is actually pretty bad. I'll make a new PR to improve it.

@tustvold
Copy link
Contributor

tustvold commented May 7, 2025

My experience is that if the hint schema is provided but doesn't match what is read from the file, an error is raised:

Aah yes, I remember now. If you provide a schema it will use it as a hint for the schema inference process, but if that inference process ignores the hints for any reason it will return an error.

If, however, the schema is embedded in the file, it does not error and behaves as I described above

alamb and others added 2 commits May 7, 2025 14:37
@alamb
Copy link
Contributor Author

alamb commented May 7, 2025

Since I don't see any reason to rush this PR in I'll plan to leave it open for another day or two to have time to gather comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants