Skip to content

Datafusion-cli: Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead #14877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 26, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Feb 25, 2025

Which issue does this PR close?

Rationale for this change

Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead.
Almost no extra memory usage for datafusion-cli now after this PR.

What changes are included in this PR?

Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead.
Almost no extra memory usage for datafusion-cli now after this PR.

  1. Change all datafusion-cli printing to streaming
  2. Support preview the column width format printing, default is 1000 lines
  3. Also support csv/json, etc format, max rows support which not supporting before
  4. Totally removing the datafusion-cli memory overhead, the only occupied memory is the current batch for batchs iterator.

Are these changes tested?

Unit test added.

Also mannul tested the datafusion-cli.

Are there any user-facing changes?

Almost no extra memory usage for datafusion-cli now!

@alamb alamb mentioned this pull request Feb 25, 2025
10 tasks
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- this is really cool!

I have a suggestion that I think would make the code better, but we could do it as a follow on PR or never in my opinion.

In addition to the very nice tests you have in this PR I tested it manaully (notes below) and it worked great

Testing without limit:

cargo run -p datafusion-cli -- --maxrows inf -c "select * from 'benchmarks/data/hits.parquet'"

And output was produced immediately and streamed out

I also tested with

cargo run -p datafusion-cli --  -c "select * from 'benchmarks/data/hits.parquet'"

Which printed out the first 40 rows and then continued running but did not print anything, as expected

@@ -250,37 +245,43 @@ pub(super) async fn exec_and_print(
}
// As the input stream comes, we can generate results.
// However, memory safety is not guaranteed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but I don't understand this comment about Memory safety

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also confusing to me, maybe we can remove the message. It should be memory safe, we only keep one batch size.

@@ -209,6 +211,145 @@ impl PrintFormat {
}
Ok(())
}

#[allow(clippy::too_many_arguments)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the newly added functions don't use fields on PrintFormat

I think we could probably avoid the

    #[allow(clippy::too_many_arguments)]

By encapsulating all the mut variables here in a new struct

Maybe something like

struct OutputStreamState {
        preview_batches: Vec<RecordBatch>,
        preview_row_count: usize,
        preview_limit: usize,
        precomputed_widths: Option<Vec<usize>>,
        header_printed: bool,
        writer: &mut dyn std::io::Write,
}

impl OutputStreamState {
  fn process_batch(&mut self, batch: RecordBatch) { ... }
...
  fn compute_column_widths(&self, 
    batches: &Vec<RecordBatch>,
    schema: SchemaRef,
  ) {...}

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb for review, this is a very good suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a follow-up ticket:
#14886

}

// test print_batch with different batch widths
// and preview count is less than the first batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- this is really cool!

I have a suggestion that I think would make the code better, but we could do it as a follow on PR or never in my opinion.

In addition to the very nice tests you have in this PR I tested it manaully (notes below) and it worked great

Testing without limit:

cargo run -p datafusion-cli -- --maxrows inf -c "select * from 'benchmarks/data/hits.parquet'"

And output was produced immediately and streamed out

I also tested with

cargo run -p datafusion-cli --  -c "select * from 'benchmarks/data/hits.parquet'"

Which printed out the first 40 rows and then continued running but did not print anything, as expected

Thank you @alamb for review, i created a follow-up to do the code clean:
#14886

And we can do some code clean work including:

  1. Add streaming state struct and clean up print batch logic
  2. Remove older logics about max-rows
  3. Add more docs and remove confusing docs
  4. Etc

@alamb alamb merged commit 53fc94f into apache:main Feb 26, 2025
25 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

Thanks again @zhuqi-lucas

@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

I believe I also found another related issue with explain plans

alamb added a commit to alamb/datafusion that referenced this pull request Feb 28, 2025
…nt, make it totally streaming printing without memory overhead (apache#14877)"

This reverts commit 53fc94f.
xudong963 pushed a commit that referenced this pull request Mar 1, 2025
…t, make it totally streaming printing without memory overhead (#14948)

* Revert "Fix: New Datafusion-cli streaming printing way should handle corner case for only one small batch which lines are less than max_rows (#14921)"

This reverts commit 463ef3b.

* Revert "Datafusion-cli: Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead (#14877)"

This reverts commit 53fc94f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Further improve datafusion-cli memory usage if we setting huge number for maxrow size.
2 participants