-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example for building an external secondary index for parquet files #10549
Conversation
55f92ed
to
b44bffb
Compare
b44bffb
to
814bff7
Compare
8be8b09
to
460c419
Compare
use tempfile::TempDir; | ||
use url::Url; | ||
|
||
/// This example demonstrates building a secondary index over multiple Parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example speaks for itself in terms of comments, so I don't plan to add additional ones on the PR unless something is unclear
This PR is now ready for review |
@crepererum and @NGA-TRAN -- here is a PR ready for your review that shows how to do file level pruning with statistics. I will make an example of how to do row group level / page level pruning next |
I start reviewing this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example is a bit long (in lines of code) but I also don't have any concrete recommendation on how to make it shorter w/o sacrificing its good readability. Maybe -- but that's personal taste -- collapse the use
statements a bit, e.g. instead of
use foo::Bar;
use foo::Baz;
use
use foo::{Bar, Baz};
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use arrow::array::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I collapsed this a bit but paradoxically when I tried to collapse them even more the number of lines actually increased
For example, this goes from 4 lines
use datafusion::datasource::listing::PartitionedFile;
use datafusion::datasource::physical_plan::parquet::{
RequestedStatistics, StatisticsConverter,
};
to 6 lines (though it is less dense)
use datafusion::datasource::{
listing::PartitionedFile;
physical_plan::parquet::{
RequestedStatistics, StatisticsConverter,
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find the grouped ones easier to parse though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason I have heard for using the single line includes is that it reduces merge conflicts when people change / update the includes. I am not sure how relevant this is in this case
FWIW the includes were automatically created by my editor (rust rover).
Thank you very much for the review @crepererum |
apache#10549) * Add example for building an external index for parquet filtes * Use register_object_store api * use FileScanConfig API * Udpate to use new API * Collapose `use` statements * fix typo
Note: While this PR looks very large (715 lines) around half of the content is comments / docstrings
Which issue does this PR close?
Closes #10546
Rationale for this change
See #10546
Building and using external indexes in DataFusion is an important feature. Adding an example of how to do so will help drive the design and APIs
What changes are included in this PR?
New Example
Are these changes tested?
CI
Are there any user-facing changes?
No -- just an example
TODOs