Skip to content

Improve spill performance: Disable re-validation of spilled files #15454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 5, 2025

Conversation

zebsme
Copy link
Contributor

@zebsme zebsme commented Mar 27, 2025

Which issue does this PR close?

Rationale for this change

Speed up spill read by enabling skip_validation

What changes are included in this PR?

  • Add a benchmark to spill_read
  • Change arrow version to 54.3.0 and update related sqllogictests

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 27, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great.

read_spill/StreamReader/read_100/with_validation/
                        time:   [3.7400 ms 3.7508 ms 3.7639 ms]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
read_spill/StreamReader/read_100/skip_validation/
                        time:   [2.4500 ms 2.4623 ms 2.4785 ms]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

However, I'm not familiar with the arrow upgrade process, here are some very basic questions:
Should we upgrade all arrow sub-crates in lockstep (all arrow-* to 54.3.0) to avoid conflict? @alamb
This StreamReader update should be inside arrow-ipc crate, why only updating arrow dependency in cargo.toml make this update available?

criterion_group!(benches, bench_spill_read);
criterion_main!(benches);

fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest writing the benchmark by timing SpillManager's IO speed for test batches, instead of several thin wrappers on arrow functions, in the future it might get hard to figure out what this micro-bench is for. This way we can also track the IO performance over time, and it's okay not able to compare the validation difference in this PR, we can apply some manual changes to verify the speed-up.

Copy link
Contributor Author

@zebsme zebsme Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the review, I'll follow your suggestion in the pr :)

// SAFETY: DataFusion's spill writer strictly follows Arrow IPC specifications
// with validated schemas and buffers. Skip redundant validation during read
// to speedup read operation. This is a deliberate safety-performance tradeoff.
let reader = unsafe { StreamReader::try_new(file, None)?.with_skip_validation(true) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

@@ -87,7 +87,7 @@ ahash = { version = "0.8", default-features = false, features = [
"runtime-rng",
] }
apache-avro = { version = "0.17", default-features = false }
arrow = { version = "54.2.1", features = [
arrow = { version = "54.3.0", features = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other arrow dependencies need to be upgraded as well (this is automatically done by cargo already, but makes sense to update it in this file as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- we need to specify 54.3.0 as we are now using a feature from 54.3.0

and I think we should do all the other crates too as @Dandandan suggests

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

Thank you @zebsme

@2010YOUY01
Copy link
Contributor

Thank you! I think it's almost ready. The only remaining task is to upgrade the other Arrow dependencies to 54.3.0 similar to #14153 did.

@zebsme zebsme requested a review from alamb March 31, 2025 15:52
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zebsme and @2010YOUY01 and @Dandandan

I'll wait to merge this until @2010YOUY01 has had a chance to review

@zebsme zebsme requested a review from 2010YOUY01 March 31, 2025 18:40

// BENCHMARK: REVALIDATION OVERHEAD COMPARISON
// ---------------------------------------------------------
// To compare performance with/without Arrow IPC validation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Merged up to get latest changes and rerun CI

@alamb
Copy link
Contributor

alamb commented Apr 2, 2025

Merged up from main to try and get a clean CI run

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cargo audit is failing on this PR because that job only runs when there are changes to Cargo.toml

https://github.com/apache/datafusion/actions/runs/14228726321/job/39874423613

So in other words cargo audit is failing on main, we just happen not to be running the job. I will file a ticket to get it fixed

Update:

@alamb
Copy link
Contributor

alamb commented Apr 5, 2025

The audit check is failing on main as well, so no need to hold this PR

Thanks again @zebsme

@alamb alamb merged commit 5c31692 into apache:main Apr 5, 2025
28 of 29 checks passed
@getChan getChan mentioned this pull request Apr 25, 2025
3 tasks
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…ache#15454)

* add skip_validation and comment

* add benchmark

* fmt

* update sqllogictests and fmt Cargo.toml

* use SpillManager in benchmark

* fix clippy

* upgrade dependencies

* Update comment

Co-authored-by: Daniël Heres <[email protected]>

---------

Co-authored-by: zeb <[email protected]>
Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve spill performance: Disable re-validation of spilled files
4 participants