Skip to content

Wrong query results for filters that involve partition columns and data file columns and pushdown_filters is enabled #15912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adriangb opened this issue May 1, 2025 · 7 comments · Fixed by #15935
Labels
bug Something isn't working

Comments

@adriangb
Copy link
Contributor

adriangb commented May 1, 2025

Describe the bug

Filters such as partition_col = col_from_file are never applied if datafusion.execution.parquet.pushdown_filters = true

To Reproduce

With datafusion-cli:

COPY  (
  SELECT arrow_cast('a', 'Utf8') AS val
)  TO 'test_files/scratch/test/part=a/123.parquet'
STORED AS PARQUET;
COPY  (
  SELECT arrow_cast('b', 'Utf8') AS val
)  TO 'test_files/scratch/test/part=b/123.parquet'
STORED AS PARQUET;
COPY  (
  SELECT arrow_cast('xyz', 'Utf8') AS val
)  TO 'test_files/scratch/test/part=c/123.parquet'
STORED AS PARQUET;

set datafusion.execution.parquet.pushdown_filters = true;

CREATE EXTERNAL TABLE test(part text, val text)
STORED AS PARQUET
PARTITIONED BY (part)
LOCATION 'test_files/scratch/test/';

SELECT * FROM test;

explain analyze
select * from test
where part != val;
> select * from test
where part != val;
+-----+------+
| val | part |
+-----+------+
| a   | a    |
| xyz | c    |
| b   | b    |
+-----+------+
3 row(s) fetched. 

Which is clearly wrong.

Expected behavior

> select * from test
where part != val;
+-----+------+
| val | part |
+-----+------+
| xyz | c    |
+-----+------+

Additional context

No response

@adriangb adriangb added the bug Something isn't working label May 1, 2025
@alamb
Copy link
Contributor

alamb commented May 2, 2025

sounds serious

@alamb alamb changed the title Wrong query results for filters that involve partition columns and data file columns Wrong query results for filters that involve partition columns and data file columns and pushdown_filters is enabled May 2, 2025
@alamb
Copy link
Contributor

alamb commented May 2, 2025

This is all the more reason I think to avoid adding more complexity to ListingTable as we are disucssing in

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2025

I think the fix is relatively simple though: any filters that reference both partition columns and data columns need to be marked as Inexact. I'm traveling so don't know that I can make a PR soon but it's a couple LOC.

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2025

Also although yes this is serious I suspect is pretty rare to have a filter that depends on both a partition column and data column. It hasn't been reported for years...

@alamb
Copy link
Contributor

alamb commented May 2, 2025

Also although yes this is serious I suspect is pretty rare to have a filter that depends on both a partition column and data column. It hasn't been reported for years...

I also think it is not that common to turn on filter_pushdown and use it with ListingTable - e.g. InfluxData and Comet both do not use ListingTable 🤔

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2025

Good point

@adriangb
Copy link
Contributor Author

adriangb commented May 3, 2025

Looking at the git blame... it looks like I introduced the bug in #15263.
Feeling guilty so here is the fix: #15935

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants