Document Arrow <--> Parquet schema conversion better #7479

alamb · 2025-05-07T10:57:21Z

Which issue does this PR close?

Closes #.

Rationale for this change

There have been several questions / changes related to Arrow and Parquet schema recently for example

feat: Support round trip reading/writing Arrow type Dictionary(_, FixedSizeBinary(_)) to Parquet #7446 (comment) from @albertlockett
Support different TimeUnits and timezones when reading Timestamps from INT96 #7220 from @mbutrovich
write duration to parquet but read as int64 #5625 from @Liyixin95

I realized the high level behavior was not well documented anywhere so I wanted to do that (so I have a place to refer people to mostly)

What changes are included in this PR?

Document the schema conversion process

Are there any user-facing changes?

Documentation only changes, no code or behavior changes

mbutrovich · 2025-05-07T13:48:41Z

Thanks @alamb. I was just looking at this again because I'd like to see a coercion in DF to keep dictionaries if possible. Currently Comet only knows the primitive datatype and ignores encoding when generating the schema, which results in dictionaries being unpacked in the Parquet reader when we might have preferred to keep them encoded. I'll take another pass through this later to think about what might be missing, but first read was good.

alamb · 2025-05-07T14:00:51Z

Thanks @alamb. I was just looking at this again because I'd like to see a coercion in DF to keep dictionaries if possible.

I think it is always possible to read Parquet data as dictionaries (at least for strings, we might have to add code / tests for other types).

I think it would make sense to allow writing DictionaryArrays for other types if it isn't already supported.

As I recall one potential complication is that the same column is stored in multiple pages, and each page can have different encodings (e.g. some pages are dictionary encoded and some are plain).

alamb · 2025-05-07T14:56:14Z

parquet/src/arrow/mod.rs

+//! [`BinaryViewArray`] or [`BinaryArray`].
+//!
+//! To recover the original Arrow types, the writers in this module add
+//! metadata in the [`ARROW_SCHEMA_META_KEY`] key to record the original Arrow


I was reminded on #5626 that this metadata is the same format as used by arrow-cpp, which is an important caveat. I will add to this doc

tustvold · 2025-05-07T15:12:41Z

parquet/src/arrow/arrow_reader/mod.rs

-    /// This option is only required if you want to cast columns to a different type.
-    /// For example, if you wanted to cast from an Int64 in the Parquet file to a Timestamp
-    /// in the Arrow schema.
+    /// If provided, this schema takes precedence over the schema inferred from


This is not true, the schema in the parquet file must be authoritative. The arrow schema is merely a hint - see #1663

Edit: it may take precedence over the embedded arrow schema though, I don't recognise this particular codepath

I am not sure what you mean by "authoritative"

What this method does is override any embedded arrow schema hint

I have reworded it - let me know what you think.

parquet/src/arrow/arrow_reader/mod.rs

tustvold

I think the major confusion, which this PR didn't create, but which it also doesn't really address is that the arrow schema provided may not be what the reader actually uses. If say the arrow schema says TimestampNanoseconds, but the parquet is actually TimestampMilliseconds, IIRC it will return TimestampMilliseconds.

alamb · 2025-05-07T17:36:17Z

The reason for me writing this PR is that I don't think it is clear how parquet / arrow schema conversions are handled, including the embedded arrow schema hint and then the APIs that let people supply / modify their own hint

I think the major confusion, which this PR didn't create, but which it also doesn't really address is that the arrow schema provided may not be what the reader actually uses. If say the arrow schema says TimestampNanoseconds, but the parquet is actually TimestampMilliseconds, IIRC it will return TimestampMilliseconds.

My experience is that if the hint schema is provided but doesn't match what is read from the file, an error is raised:

https://github.com/apache/arrow-rs/blob/812160005efe3afc63531b8ea051e1fa44a91f67/parquet/src/arrow/arrow_reader/mod.rs#L541-L540

called Result::unwrap() on an Err value: ArrowError("incompatible arrow schema, the following fields could not be cast: [column1]")

The error message is actually pretty bad. I'll make a new PR to improve it.

alamb · 2025-05-07T18:11:05Z

The error message is actually pretty bad. I'll make a new PR to improve it.

Made Improve error messages if schema hint mismatches with parquet schema #7481

Co-authored-by: Raphael Taylor-Davies <[email protected]>

tustvold · 2025-05-07T18:22:01Z

My experience is that if the hint schema is provided but doesn't match what is read from the file, an error is raised:

Aah yes, I remember now. If you provide a schema it will use it as a hint for the schema inference process, but if that inference process ignores the hints for any reason it will return an error.

If, however, the schema is embedded in the file, it does not error and behaves as I described above

parquet/src/arrow/mod.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

alamb · 2025-05-07T18:39:56Z

Since I don't see any reason to rush this PR in I'll plan to leave it open for another day or two to have time to gather comments

Document Arrow <--> Parquet schema conversion better

11c99a3

alamb added the documentation Improvements or additions to documentation label May 7, 2025

github-actions bot added the parquet Changes to the parquet crate label May 7, 2025

alamb mentioned this pull request May 7, 2025

feat: Support round trip reading/writing Arrow type Dictionary(_, FixedSizeBinary(_)) to Parquet #7446

Open

alamb commented May 7, 2025

View reviewed changes

alamb mentioned this pull request May 7, 2025

fix duration conversion error #5626

Closed

tustvold requested changes May 7, 2025

View reviewed changes

alamb added 8 commits May 7, 2025 11:58

Add a note about arrow metadata convention

2949e78

lint

ca7b0c3

Fix links

893718d

clarify what happens with provided schema

3a4e03d

More docs

8b0920c

fmt

e417988

more claritification

8121600

More clarifications

2defff5

tustvold reviewed May 7, 2025

View reviewed changes

parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved

tustvold reviewed May 7, 2025

View reviewed changes

fmt

3426c8a

alamb force-pushed the alamb/doc_parquet_schema branch from 05c519e to 3426c8a Compare May 7, 2025 18:05

alamb mentioned this pull request May 7, 2025

Improve error messages if schema hint mismatches with parquet schema #7481

Open

Update parquet/src/arrow/arrow_reader/mod.rs

eace423

Co-authored-by: Raphael Taylor-Davies <[email protected]>

tustvold approved these changes May 7, 2025

View reviewed changes

parquet/src/arrow/mod.rs Outdated Show resolved Hide resolved

alamb and others added 2 commits May 7, 2025 14:37

Update parquet/src/arrow/mod.rs

3beb7a6

Co-authored-by: Raphael Taylor-Davies <[email protected]>

tweaks

6f803b1

alamb added 2 commits May 7, 2025 14:41

capitalization OCD

ad00554

capitalization OCD

652937f

alamb mentioned this pull request May 8, 2025

add duration.parquet apache/parquet-testing#80

Closed

albertlockett approved these changes May 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Arrow <--> Parquet schema conversion better #7479

Document Arrow <--> Parquet schema conversion better #7479

alamb commented May 7, 2025

mbutrovich commented May 7, 2025

alamb commented May 7, 2025

alamb May 7, 2025

tustvold May 7, 2025 •

edited

Loading

alamb May 7, 2025

tustvold left a comment

alamb commented May 7, 2025 •

edited

Loading

alamb commented May 7, 2025 •

edited

Loading

tustvold commented May 7, 2025

alamb commented May 7, 2025

Document Arrow <--> Parquet schema conversion better #7479

Are you sure you want to change the base?

Document Arrow <--> Parquet schema conversion better #7479

Conversation

alamb commented May 7, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mbutrovich commented May 7, 2025

alamb commented May 7, 2025

alamb May 7, 2025

Choose a reason for hiding this comment

tustvold May 7, 2025 • edited Loading

Choose a reason for hiding this comment

alamb May 7, 2025

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

alamb commented May 7, 2025 • edited Loading

alamb commented May 7, 2025 • edited Loading

tustvold commented May 7, 2025

alamb commented May 7, 2025

tustvold May 7, 2025 •

edited

Loading

alamb commented May 7, 2025 •

edited

Loading

alamb commented May 7, 2025 •

edited

Loading