Add example for building an external secondary index for parquet files #10549

alamb · 2024-05-16T15:32:53Z

Note: While this PR looks very large (715 lines) around half of the content is comments / docstrings

Which issue does this PR close?

Rationale for this change

Building and using external indexes in DataFusion is an important feature. Adding an example of how to do so will help drive the design and APIs

What changes are included in this PR?

New Example

Are these changes tested?

CI

Are there any user-facing changes?

No -- just an example

TODOs

Propose a nicer API for extracting the statistics
Connect pruning predicate into scan to avoid scanning files
File tickets / PRs to make creating ParquetExec easier
Try and make some PRs / documentation upstream in the parquet crate to make it easier to work with parquet statistics

datafusion-examples/examples/parquet_index.rs

alamb · 2024-05-22T12:05:10Z

datafusion-examples/examples/parquet_index.rs

+use tempfile::TempDir;
+use url::Url;
+
+/// This example demonstrates building a secondary index over multiple Parquet


I think the example speaks for itself in terms of comments, so I don't plan to add additional ones on the PR unless something is unclear

alamb · 2024-05-22T20:43:55Z

This PR is now ready for review

datafusion-examples/examples/parquet_index.rs

…t_index

alamb · 2024-05-27T12:16:02Z

@crepererum and @NGA-TRAN -- here is a PR ready for your review that shows how to do file level pruning with statistics.

I will make an example of how to do row group level / page level pruning next

NGA-TRAN · 2024-05-28T14:19:31Z

I start reviewing this

…t_index

crepererum

The example is a bit long (in lines of code) but I also don't have any concrete recommendation on how to make it shorter w/o sacrificing its good readability. Maybe -- but that's personal taste -- collapse the use statements a bit, e.g. instead of

use foo::Bar;
use foo::Baz;

use

use foo::{Bar, Baz};

…t_index

alamb · 2024-05-31T10:28:36Z

datafusion-examples/examples/parquet_index.rs

+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{


I collapsed this a bit but paradoxically when I tried to collapse them even more the number of lines actually increased

For example, this goes from 4 lines

use datafusion::datasource::listing::PartitionedFile; use datafusion::datasource::physical_plan::parquet::{ RequestedStatistics, StatisticsConverter, };

to 6 lines (though it is less dense)

use datafusion::datasource::{ listing::PartitionedFile; physical_plan::parquet::{ RequestedStatistics, StatisticsConverter, } };

I personally find the grouped ones easier to parse though.

Another reason I have heard for using the single line includes is that it reduces merge conflicts when people change / update the includes. I am not sure how relevant this is in this case

FWIW the includes were automatically created by my editor (rust rover).

alamb · 2024-05-31T10:35:04Z

Thank you very much for the review @crepererum

apache#10549) * Add example for building an external index for parquet filtes * Use register_object_store api * use FileScanConfig API * Udpate to use new API * Collapose `use` statements * fix typo

alamb changed the title ~~Alamb/external parquet index~~ Add example for building an external index for parquet files May 16, 2024

alamb added the documentation Improvements or additions to documentation label May 16, 2024

alamb force-pushed the alamb/external_parquet_index branch from 55f92ed to b44bffb Compare May 16, 2024 16:12

github-actions bot removed the documentation Improvements or additions to documentation label May 16, 2024

alamb commented May 16, 2024

View reviewed changes

datafusion-examples/examples/parquet_index.rs Outdated Show resolved Hide resolved

alamb force-pushed the alamb/external_parquet_index branch from b44bffb to 814bff7 Compare May 17, 2024 11:28

github-actions bot added the core Core DataFusion crate label May 17, 2024

Add example for building an external index for parquet filtes

460c419

alamb force-pushed the alamb/external_parquet_index branch from 8be8b09 to 460c419 Compare May 22, 2024 11:17

github-actions bot removed the core Core DataFusion crate label May 22, 2024

alamb changed the title ~~Add example for building an external index for parquet files~~ Add example for building an external secondary index for parquet files May 22, 2024

alamb mentioned this pull request May 22, 2024

Clean up parquet_index example #10618

Closed

2 tasks

alamb marked this pull request as ready for review May 22, 2024 12:03

alamb commented May 22, 2024

View reviewed changes

This was referenced May 22, 2024

Minor: Improve ObjectStoreUrl docs + examples #10619

Merged

Add SessionContext::register_object_store #10621

Merged

Add FileScanConfig::new() API #10623

Merged

alamb commented May 22, 2024

View reviewed changes

datafusion-examples/examples/parquet_index.rs Show resolved Hide resolved

alamb commented May 22, 2024

View reviewed changes

datafusion-examples/examples/parquet_index.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request May 23, 2024

Add ParquetExec::builder(), deprecate ParquetExec::new #10636

Merged

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

d5542b1

…t_index

This was referenced May 23, 2024

Simplify ParquetExec::new() #10643

Closed

Improve ParquetExec and related documentation #10647

Merged

alamb added 3 commits May 27, 2024 08:05

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

e6efa10

…t_index

Use register_object_store api

d83adb2

use FileScanConfig API

8d5237e

alamb mentioned this pull request May 28, 2024

Example for building an external index for parquet files #10546

Closed

alamb added 5 commits May 29, 2024 05:37

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

230d785

…t_index

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

472f3be

…t_index

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

e62498d

…t_index

Udpate to use new API

0b38b53

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

2aa269e

…t_index

crepererum approved these changes May 31, 2024

View reviewed changes

alamb added 2 commits May 31, 2024 06:26

Collapose use statements

468c51f

Merge remote-tracking branch 'apache/main' into alamb/external_parque…

8bdbfce

…t_index

alamb commented May 31, 2024

View reviewed changes

fix typo

578a346

alamb merged commit 09dde27 into apache:main May 31, 2024
23 checks passed

alamb deleted the alamb/external_parquet_index branch May 31, 2024 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example for building an external secondary index for parquet files #10549

Add example for building an external secondary index for parquet files #10549

alamb commented May 16, 2024 •

edited

Loading

alamb May 22, 2024

alamb commented May 22, 2024

alamb commented May 27, 2024

NGA-TRAN commented May 28, 2024

crepererum left a comment •

edited

Loading

alamb May 31, 2024

crepererum May 31, 2024

alamb May 31, 2024

alamb commented May 31, 2024

Add example for building an external secondary index for parquet files #10549

Add example for building an external secondary index for parquet files #10549

Conversation

alamb commented May 16, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODOs

alamb May 22, 2024

Choose a reason for hiding this comment

alamb commented May 22, 2024

alamb commented May 27, 2024

NGA-TRAN commented May 28, 2024

crepererum left a comment • edited Loading

Choose a reason for hiding this comment

alamb May 31, 2024

Choose a reason for hiding this comment

crepererum May 31, 2024

Choose a reason for hiding this comment

alamb May 31, 2024

Choose a reason for hiding this comment

alamb commented May 31, 2024

alamb commented May 16, 2024 •

edited

Loading

crepererum left a comment •

edited

Loading