Skip to content

Remove redundant statistics from FileScanConfig #14955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 3, 2025

Conversation

Standing-Man
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Both FileScanConfig and DataSource has same statistics, it make that statistics may have out-of-sync bugs.

What changes are included in this PR?

Remove redundant statistics from FileScanConfig and only a single statistics that held on the DataSource.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate labels Mar 1, 2025
@Standing-Man Standing-Man marked this pull request as draft March 1, 2025 13:01
@Standing-Man Standing-Man force-pushed the redundant-statistics branch from f3eb12a to c819b16 Compare March 2, 2025 01:11
@github-actions github-actions bot removed the core Core DataFusion crate label Mar 2, 2025
@alamb
Copy link
Contributor

alamb commented Mar 3, 2025

Looks like there are some CI issues to address

Note that @blaginin fixed some issues recently, so if you merge up from main it might be better now

@Standing-Man
Copy link
Contributor Author

Standing-Man commented Mar 3, 2025

Looks like there are some CI issues to address

Note that @blaginin fixed some issues recently, so if you merge up from main it might be better now

Hi @alamb, i will try to merge up from main, Thanks.

@Standing-Man Standing-Man force-pushed the redundant-statistics branch from c819b16 to bdfb9ac Compare March 3, 2025 12:58
@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 3, 2025
@Standing-Man
Copy link
Contributor Author

Hi @alamb and @blaginin, I found that four tests failed due to the statistics num_rows and total_byte_size. I'm confused about how to proceed with fixing this issue, and I noticed that it is related to #14936.
image

@blaginin
Copy link
Contributor

blaginin commented Mar 3, 2025

i feel it may be easier if we fix #14936 first. I was planning to do it this week, but feel free to take over (just take the issue then)

@Standing-Man Standing-Man force-pushed the redundant-statistics branch from bdfb9ac to 6743003 Compare March 8, 2025 05:20
@blaginin
Copy link
Contributor

hey @Standing-Man #15352 just got merged so this pr may be easier to finish fyi 🌻

@xudong963 xudong963 self-requested a review March 29, 2025 02:40
@Standing-Man Standing-Man force-pushed the redundant-statistics branch from 9397785 to 2d6e298 Compare March 30, 2025 01:04
Signed-off-by: Alan Tang <[email protected]>

chore: fix fmt and clippy errors
@Standing-Man Standing-Man force-pushed the redundant-statistics branch from cbeb95e to b99e658 Compare March 30, 2025 01:26
@Standing-Man
Copy link
Contributor Author

Standing-Man commented Mar 30, 2025

Hi @blaginin, Thanks for your valuable contributions! I’ll continue working on fixing this issue.

@Standing-Man Standing-Man marked this pull request as ready for review March 30, 2025 01:29
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Standing-Man and @blaginin

FYI @xudong963 perhaps you can review this too as you are also working on statistics

@alamb alamb added the api change Changes the API exposed to users of the crate label Mar 30, 2025
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Mar 30, 2025

It seems the statistics generated in line 884 will be lost.

https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/table.rs#L884-L960

I think the idea is that the DataSource has the statistics rather than the FileScanConfig 🤔

@xudong963
Copy link
Member

It seems the statistics generated in line 884 will be lost.

https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/table.rs#L884-L960

I think the idea is that the DataSource has the statistics rather than the FileScanConfig 🤔

yeah i know, but i don't see the statistics is set to the datasource. I will check the code again later.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, I saw this line: https://github.com/apache/datafusion/pull/14955/files#diff-6ddd45bba34238ec207080fd56251fffeee8ef263a3ae25be172362bab753a5cL401, so the statistics isn't lost.

Thank you for your contribution!

@alamb alamb merged commit 190634b into apache:main Apr 3, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

Thanks again @Standing-Man and @xudong963

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* enhance: Remove redundant statistics from FileScanConfig

Signed-off-by: Alan Tang <[email protected]>

* chore: fix some ci error

Signed-off-by: Alan Tang <[email protected]>

chore: fix fmt and clippy errors

---------

Signed-off-by: Alan Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datasource Changes to the datasource crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove redundant statistics from FileScanConfig
4 participants