-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Datafusion-cli: Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead #14877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- this is really cool!
I have a suggestion that I think would make the code better, but we could do it as a follow on PR or never in my opinion.
In addition to the very nice tests you have in this PR I tested it manaully (notes below) and it worked great
Testing without limit:
cargo run -p datafusion-cli -- --maxrows inf -c "select * from 'benchmarks/data/hits.parquet'"
And output was produced immediately and streamed out
I also tested with
cargo run -p datafusion-cli -- -c "select * from 'benchmarks/data/hits.parquet'"
Which printed out the first 40 rows and then continued running but did not print anything, as expected
@@ -250,37 +245,43 @@ pub(super) async fn exec_and_print( | |||
} | |||
// As the input stream comes, we can generate results. | |||
// However, memory safety is not guaranteed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but I don't understand this comment about Memory safety
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also confusing to me, maybe we can remove the message. It should be memory safe, we only keep one batch size.
@@ -209,6 +211,145 @@ impl PrintFormat { | |||
} | |||
Ok(()) | |||
} | |||
|
|||
#[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the newly added functions don't use fields on PrintFormat
I think we could probably avoid the
#[allow(clippy::too_many_arguments)]
By encapsulating all the mut
variables here in a new struct
Maybe something like
struct OutputStreamState {
preview_batches: Vec<RecordBatch>,
preview_row_count: usize,
preview_limit: usize,
precomputed_widths: Option<Vec<usize>>,
header_printed: bool,
writer: &mut dyn std::io::Write,
}
impl OutputStreamState {
fn process_batch(&mut self, batch: RecordBatch) { ... }
...
fn compute_column_widths(&self,
batches: &Vec<RecordBatch>,
schema: SchemaRef,
) {...}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb for review, this is a very good suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up ticket:
#14886
} | ||
|
||
// test print_batch with different batch widths | ||
// and preview count is less than the first batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you @alamb for review, i created a follow-up to do the code clean: And we can do some code clean work including:
|
Thanks again @zhuqi-lucas |
FYI we found one issue with this code here (fixed in #14921): |
I believe I also found another related issue with explain plans |
…nt, make it totally streaming printing without memory overhead (apache#14877)" This reverts commit 53fc94f.
…t, make it totally streaming printing without memory overhead (#14948) * Revert "Fix: New Datafusion-cli streaming printing way should handle corner case for only one small batch which lines are less than max_rows (#14921)" This reverts commit 463ef3b. * Revert "Datafusion-cli: Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead (#14877)" This reverts commit 53fc94f.
Which issue does this PR close?
Rationale for this change
Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead.
Almost no extra memory usage for datafusion-cli now after this PR.
What changes are included in this PR?
Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead.
Almost no extra memory usage for datafusion-cli now after this PR.
Are these changes tested?
Unit test added.
Also mannul tested the datafusion-cli.
Are there any user-facing changes?
Almost no extra memory usage for datafusion-cli now!