-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ParquetExec
and related documentation
#10647
Conversation
@thinkharderdev , @tustvold, @Ted-Jiang and @crepererum: if you have time, could you double check that this correctly describes |
/// * Multi-threaded (aka multi-partition): read from one or more files in | ||
/// parallel. Can read concurrently from multiple row groups from a single file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this "concurrency" instead of "multi-threading". IIRC we don't implement ANY threading in this operator and solely rely on tokio to dispatch concurrent bits for us. I think it's fine to mention that the concurrency in this operator CAN lead to multi-core usage under specific circumstances.
/// table schema. This can be used to implement "schema evolution". See | ||
/// [`SchemaAdapterFactory`] for more details. | ||
/// | ||
/// * metadata_size_hint: controls the number of bytes read from the end of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is passed on to the reader (custom or builtin) and the reader uses that to gather the metadata. The reader CAN however use another more precise source for this information or not read the metadata from object store at all (e.g. it could use an extra service, a dataset-based source or some sort of cache).
/// | ||
/// * Limit pushdown: stop execution early after some number of rows are read. | ||
/// | ||
/// * Custom readers: controls I/O for accessing pages. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// * Custom readers: controls I/O for accessing pages. See | |
/// * Custom readers: implements I/O for accessing byte ranges and the metadata object. See |
It's not steering the IO process, it's actually responsible for performing (or not performing) it. For example, a custom impl. could totally NOT use an object store (which is esp. interesting for the metadata bit, see other comment below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call -- updated
/// * Step 3: The `ParquetOpener` gets the file metadata by reading the footer, | ||
/// and applies any predicates and projections to determine what pages must be | ||
/// read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets the metadata from the ParquetFileReaderFactory
or more specifically the AsyncFileReader
that this factory returns. The ParquetOpener
doesn't care where the metadata comes from.
/// Interface for creating [`AsyncFileReader`]s to read parquet files. | ||
/// | ||
/// This interface is used by [`ParquetOpener`] in order to create readers for | ||
/// parquet files. Implementations of this trait can be used to provide custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "this trait" in this case? I guess you're referring to AsyncFileReader
, not ParquetFileReaderFactory
here. To avoid confusion and give the user more freedom how/where the implement "pre-cached data, I/O ..." etc., I suggest to start a new paragraph and say:
The combined implementations of [`ParquetFileReaderFactory`] and [`AsyncFileReader`]
can be used to provide custom data access operations such as
pre-cached data, I/O coalescing, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent idea. I did so
datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @alamb
* Improve ParquetExec and related documentation * Improve documentation * Update datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs Co-authored-by: Oleks V <[email protected]> * fix link --------- Co-authored-by: Oleks V <[email protected]>
Which issue does this PR close?
Part of #10549
Rationale for this change
While trying to make an example that uses ParquetExec, I found its documentation was sparse and could be improved
What changes are included in this PR?
Are these changes tested?
By existing CI
Are there any user-facing changes?
Just doc strings, no functional changes