Skip to content

Support IN lists with more than three constants in predicates for bloom filters #8436

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

BloomFilter support was added in #7821 by @hengfeiyang ❤️

There is partial support for optimizing queries that have IN List predicates,. as suggested by @Ted-Jiang : #7821 (comment) and tested via https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L1056-L1084

However, this only supports queries where there are three or fewer items in the IN list:

SELECT * 
FROM parquet_file 
WHERE col IN ('foo', 'bar', 'baz')

It only works for small numbers of constants because the current implementation only checks for predicates like col = 'foo' OR col = 'bar'. The reason this works for InLists is that with small numbers of items ( 3) are rewritten to OR chains) by this code in the optimizer:

https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L500-L549

Thus, the the current bloom filter code will not work for queries with large numbers (more than the THRESHOLD_INLINE_INLIST) of constants in the IN list, such as

SELECT * 
FROM parquet_file 
WHERE col IN (
  'constant1',
  'constant2',
  ..,
  'constant99',
  'constant100',
)

Describe the solution you'd like

I would like the bloom filter code to directly support InListExpr and thus also support IN / NOT IN queries with large numbers of constants

In terms of implementation, after #8437 is merged and #8376 is closed, this should be a straightforward matter of:

  1. Adding support in LiteralGurantee code (see Add LiteralGuarantee on columns to extract conditions required for PhysicalExpr expressions to evaluate to true #8437 )
  2. Add tests in LiteralGurantee
  3. Add a integration test for Bloom filters in datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

Describe alternatives you've considered

No response

Additional context

Found while I was working on #8376

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions