Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Add support for parquet version config #11151

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

svm1
Copy link
Collaborator

@svm1 svm1 commented Oct 2, 2024

Allow the Presto session property parquet_writer_version, which is currently ignored by Velox, to toggle the parquet writer datapage version (V1 or V2). The value can be set as a session property or can be provided in the Hive config. Defaults to V1.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024
Copy link

netlify bot commented Oct 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d9fc5ec
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67997ada84532c00084cd9f1

velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 40a6c6c to 4982e1f Compare November 26, 2024 09:47
@svm1
Copy link
Collaborator Author

svm1 commented Nov 26, 2024

@yingsu00 @majetideepak Thank you for reviewing - made all necessary changes, please take a look!

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/tests/writer/ParquetWriterTest.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/writer/Writer.h Outdated Show resolved Hide resolved
@majetideepak majetideepak changed the title Add support for parquet_writer_version session property feat(parquet): Add support for parquet version config Jan 29, 2025
@majetideepak
Copy link
Collaborator

majetideepak commented Jan 29, 2025

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@@ -108,6 +109,7 @@ struct WriterOptions : public dwio::common::WriterOptions {
/// Timestamp time zone for Parquet write through Arrow bridge.
std::optional<std::string> parquetWriteTimestampTimeZone;
bool writeInt96AsTimestamp = false;
std::optional<arrow::ParquetDataPageVersion> parquetDataPageVersion;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Velox should have its own enum and not borrow from Arrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must've misinterpreted one of your previous comments, and also felt that as we are no longer working in the HiveConnector, there was no need for an intermediary enum. I can add it back.

@svm1
Copy link
Collaborator Author

svm1 commented Jan 29, 2025

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@majetideepak

I understand the two different versions - ParquetVersion referring to the format version (2.4, 2.6, etc) and ParquetDataPageVersion referring to the datapage (V1 or V2).

I think we had a discussion about this very debate back when I first began working on this fix, and we determined that the Presto parquet_writer_version property does map to ParquetDataPageVersion - #9700 (comment)

I also investigated the Presto Java source - parquet_writer_version there maps to the org.apache.parquet.column.ParquetProperties.WriterVersion enum, which itself contains only PARQUET_1_0 and PARQUET_2_0, likely representing the V1/V2 datapage - https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.10.1/org/apache/parquet/column/ParquetProperties.WriterVersion.html#PARQUET_1_0

The Presto docs also seem to point to this mapping:

Presto now supports Parquet writer versions V1 and V2 for the Hive catalog. It can be toggled using the session property parquet_writer_version and the config property hive.parquet.writer.version. Valid values for these properties are PARQUET_1_0 and PARQUET_2_0. Default is PARQUET_1_0.

The format version also appears far more granular than simply "1 vs 2" (https://github.com/apache/parquet-format/blob/master/CHANGES.md).

Therefore I believe the mapping established in this PR would be correct - please correct me if I am mistaken.

@majetideepak
Copy link
Collaborator

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion.
Can you verify this?

@majetideepak
Copy link
Collaborator

majetideepak commented Jan 29, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants