-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Fix reported statistics does not do column pruning #4137
Conversation
cc @JingsongLi @Zouxxyy thank you |
@Zouxxyy We may need to run a TPC-DS test to show the compact. |
@ulysses-you Thanks for the contribution, have you tested the effect in production scenarios? |
thank you @Zouxxyy. I tested with 20 columns and using without this pr:
with this pr:
In general, the |
.map(_.avgLen()) | ||
.filter(_.isPresent) | ||
.map(_.getAsLong) | ||
.getOrElse(field.`type`().defaultSize().toLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the default size of Spark's filed here? because the data will actually be converted to Spark row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use Paimon's data type default size. During scan we have not converted to Spark row yet. Paimon returns SparkInternalRow
to Spark which is maintained using Paimon's memory structure. The row conversion happens when Spark would wrap a Project. Then if there is a Project, Spark would re-calculate the sizeInBytes using Spark's data type default size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
+1 |
Purpose
SupportsReportStatistics#estimateStatistics
should report the statistics after do column pruning, filter push down. This pr fixes it does not do column pruning. In addition, we should consider the metadata column size.Introduce
defaultSize
method forDataType
, so that we can get the size in bytes without column statistics.Tests
add test
API and Format
no
Documentation