Description
Is your feature request related to a problem or challenge?
BloomFilter support was added in #7821 by @hengfeiyang ❤️
There is partial support for optimizing queries that have IN
List predicates,. as suggested by @Ted-Jiang : #7821 (comment) and tested via https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L1056-L1084
However, this only supports queries where there are three or fewer items in the IN list:
SELECT *
FROM parquet_file
WHERE col IN ('foo', 'bar', 'baz')
It only works for small numbers of constants because the current implementation only checks for predicates like col = 'foo' OR col = 'bar'
. The reason this works for InList
s is that with small numbers of items ( 3
) are rewritten to OR
chains) by this code in the optimizer:
Thus, the the current bloom filter code will not work for queries with large numbers (more than the THRESHOLD_INLINE_INLIST
) of constants in the IN
list, such as
SELECT *
FROM parquet_file
WHERE col IN (
'constant1',
'constant2',
..,
'constant99',
'constant100',
)
Describe the solution you'd like
I would like the bloom filter code to directly support InListExpr
and thus also support IN
/ NOT IN
queries with large numbers of constants
In terms of implementation, after #8437 is merged and #8376 is closed, this should be a straightforward matter of:
- Adding support in
LiteralGurantee
code (see AddLiteralGuarantee
on columns to extract conditions required forPhysicalExpr
expressions to evaluate to true #8437 ) - Add tests in
LiteralGurantee
- Add a integration test for Bloom filters in
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Describe alternatives you've considered
No response
Additional context
Found while I was working on #8376