-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bloom filter write support to ParquetWriter #20662
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things to work on from first look:
- Check where in the parquet file Hive and Spark store the bloom filter (they use parquet-mr directly). We will want to make sure that we're not differing from that.
- We need product tests that files with bloom filter written by trino work on Apache Hive and Spark
- We don't want this getting enabled by default. Users should opt into it on a per-table basis similar to the way bloom filter columns can be specified in ORC in Trino
- BaseTestParquetWithBloomFilters should be updated to use Trino to write bloom filters
c04e5ac
to
beca40d
Compare
8172007
to
11a130a
Compare
There remains some work to decide on how to configure the bloom filters in Hive (and also Iceberg and Delta). We could do the following:
These are the same properties used by parquet-mr: https://github.com/apache/parquet-mr/blob/20d43639b5a380335953742ad6c9b3dd98e09f29/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L152-L155 If any of the Hive table properties have invalid values we treat them as unspecified. For the Iceberg table properties we will likely want to do For the Trino table properties we define:
|
11a130a
to
79fa7d6
Compare
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ColumnWriter.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/valuewriter/BigintValueWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
eb309f0
to
e4ecdd0
Compare
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
lib/trino-parquet/src/test/java/io/trino/parquet/writer/BenchmarkBinaryColumnWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/test/java/io/trino/parquet/writer/BenchmarkLongColumnWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriters.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/test/java/io/trino/parquet/writer/AbstractColumnWriterBenchmark.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/test/java/io/trino/parquet/writer/BenchmarkLongColumnWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java
Outdated
Show resolved
Hide resolved
70504f9
to
baed59e
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriters.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java
Outdated
Show resolved
Hide resolved
I've pushed some minor fixups, please apply them into the appropriate commits |
59fcecd
to
ab34493
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ColumnWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriters.java
Outdated
Show resolved
Hide resolved
...rino-parquet/src/main/java/io/trino/parquet/writer/valuewriter/TrinoValuesWriterFactory.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/test/java/io/trino/parquet/writer/TestParquetWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveUtil.java
Outdated
Show resolved
Hide resolved
ab34493
to
b5babe6
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableProperties.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, lgtm
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Outdated
Show resolved
Hide resolved
b5babe6
to
e5c415d
Compare
e5c415d
to
f04a803
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java
Outdated
Show resolved
Hide resolved
The bloom filters are added after all the row groups, right before the footer, similar to the first option described [here](https://github.com/apache/parquet-format/blob/master/BloomFilter.md#file-format). We do not support writing bloom filters for types for which we do not have read support.
Bloom filters must be explicitly enabled for a column of a table by setting the Hive table property `parquet.bloom.filter.columns=<column-name-1>,<column-name-2>` This Hive table propery can be enabled with the Trino table property `parquet_bloom_filter_columns = ARRAY['<column-name>']`. We do not support configuring the NDV or FPP of the filter.
f04a803
to
2499483
Compare
Description
The bloom filters are added after all the row groups, right before the footer, similar to the first option described here.
We do not support writing bloom filters for types for which we do not have read support.
Additional context and related issues
Fixes #16536
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: