Skip to content

Improve spill performance: Disable re-validation of spilled files #15320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #15271
alamb opened this issue Mar 19, 2025 · 4 comments · Fixed by #15454
Closed
Tracked by #15271

Improve spill performance: Disable re-validation of spilled files #15320

alamb opened this issue Mar 19, 2025 · 4 comments · Fixed by #15454
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 19, 2025

Is your feature request related to a problem or challenge?

Today when DataFusion spills files to disk, it uses the Arrow IPC format

Here is the code:

pub(crate) fn spill_record_batches(
batches: &[RecordBatch],
path: PathBuf,
schema: SchemaRef,
) -> Result<(usize, usize)> {
let mut writer = IPCStreamWriter::new(path.as_ref(), schema.as_ref())?;
for batch in batches {
writer.write(batch)?;
}
writer.finish()?;
debug!(
"Spilled {} batches of total {} rows to disk, memory released {}",
writer.num_batches,
writer.num_rows,
human_readable_size(writer.num_bytes),
);
Ok((writer.num_rows, writer.num_bytes))
}
fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> {
let file = BufReader::new(File::open(path)?);
let reader = StreamReader::try_new(file, None)?;
for batch in reader {
sender
.blocking_send(batch.map_err(Into::into))
.map_err(|e| exec_datafusion_err!("{e}"))?;
}
Ok(())
}

The IPC reader currently re-validates that all the data written is valid arrow data (for example, that the strings are valid utf8)

Disabling the validation resulted in a 3x performance increase in the arrow benchmarks

Here are the relvant arrow-rs prs / issues:

Describe the solution you'd like

I would like to disable validation when reading the spill files back in.

Describe alternatives you've considered

  1. Disable validation when reading spill files in
  2. Justify that change with comments explaining that we trust that nothing messed with the files after datafusion wrote them
  3. Add / use a benchmark showing the peformance benefit of doing this

Additional context

@shruti2522
Copy link
Contributor

take

@zebsme
Copy link
Contributor

zebsme commented Mar 27, 2025

After disable the validation, tpch benchmark on my m1pro :

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ 
┃ Query        ┃     main ┃ issue-15320 ┃        Change ┃ 
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ 
| QQuery 1     | 108.66ms |    101.82ms | +1.07x faster | 
| QQuery 2     |  26.13ms |     24.84ms |     no change | 
| QQuery 3     |  44.42ms |     44.23ms |     no change | 
| QQuery 4     |  36.61ms |     32.20ms | +1.14x faster | 
| QQuery 5     |  69.34ms |     62.23ms | +1.11x faster | 
| QQuery 6     |  22.63ms |     20.72ms | +1.09x faster | 
| QQuery 7     |  84.86ms |     80.36ms | +1.06x faster | 
| QQuery 8     |  57.36ms |     56.96ms |     no change | 
| QQuery 9     |  81.47ms |     79.90ms |     no change | 
| QQuery 10    |  68.18ms |     65.22ms |     no change | 
| QQuery 11    |  19.23ms |     18.54ms |     no change | 
| QQuery 12    |  42.77ms |     43.42ms |     no change | 
| QQuery 13    |  46.25ms |     49.34ms |  1.07x slower | 
| QQuery 14    |  33.68ms |     31.13ms | +1.08x faster | 
| QQuery 15    |  51.21ms |     46.60ms | +1.10x faster | 
| QQuery 16    |  17.94ms |     17.65ms |     no change | 
| QQuery 17    |  87.29ms |     81.23ms | +1.07x faster | 
| QQuery 18    | 122.48ms |    113.89ms | +1.08x faster | 
| QQuery 19    |  53.20ms |     48.69ms | +1.09x faster | 
| QQuery 20    |  43.22ms |     40.43ms | +1.07x faster | 
| QQuery 21    |  99.25ms |     92.03ms | +1.08x faster | 
| QQuery 22    |  18.60ms |     18.09ms |     no change | 
+--------------+----------+-------------+---------------+ 

@zebsme
Copy link
Contributor

zebsme commented Mar 27, 2025

Since it's a prerequisite task for #15321, I made a pr here :)

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2025

After disable the validation, tpch benchmark on my m1pro :

Have you configured the TPCH benchmark to spill (as in limit the memory used)? If not I wouldn't expect any performance improvement 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants