Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust engine consume a lot of memory compared to pyarrow #2968

Open
djouallah opened this issue Oct 31, 2024 · 9 comments
Open

rust engine consume a lot of memory compared to pyarrow #2968

djouallah opened this issue Oct 31, 2024 · 9 comments
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Milestone

Comments

@djouallah
Copy link

Environment

Delta-rs version:
0.21.0

Binding:

Environment:

  • OS:
    Linux

Bug

switching from pyarrow engine to rust increase memory usage by nearly 3X, the job used to works fine, but now, getting OOM errors.

I added a reproducible example with only 60 input files to demo the issue

https://colab.research.google.com/drive/1fahlV0FgKSAS8sQvRMu47s3bDP1ekLbb#scrollTo=333a177b-f075-412e-8ca1-32d44f8c07eb

@djouallah djouallah added the bug Something isn't working label Oct 31, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Oct 31, 2024
@rtyler
Copy link
Member

rtyler commented Oct 31, 2024

@djouallah 👋 in the attached notebook, which write_deltalake call is resulting in memory pressure? I'm less familiar with duckdb, but I assume the df objects that it is producing are pyarrow.DataSet? or are they another type?

@djouallah
Copy link
Author

it is an Arrow RecordBatchReader I think

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Nov 2, 2024

Rust engine materalizes everything to memory prior to starting the whole process. Pyarrow probably writes batch by batch when you pass a reader

@djouallah
Copy link
Author

just for my own understanding, is this something that can be fixed by datafusion ?

@ion-elgreco
Copy link
Collaborator

@djouallah there is a PR to address this but the contributor didn't have time to finish it yet: #2289

@rtyler
Copy link
Member

rtyler commented Nov 23, 2024

I have been analyzing the performance around this issue, even with today's latest HEAD of main

Rust engine

grafik

Pyarrow engine

grafik

The root cause of this behavior is, in my opinion, the collection of Vec<RecordBatch> by the Rust writer. I have been experimenting with some alternative approaches that allow the writer to use either an iterator or a stream of RecordBatch to perform incremental read/write behavior rather than full-read, then full-write behaviors as it is doing today.

There is some trickiness around Send +Sync here because of how the WriteBuilder works with its IntoFuture implementation.

Nothing required from anybody, I just wanted to share the analysis and progress here. 🤔 🏃

@rtyler rtyler added binding/rust Issues for the Rust crate and removed binding/python Issues for the Python package labels Nov 23, 2024
@djouallah
Copy link
Author

using 0.22, I just use the pyarrow engine, yes, processing 2300 csv files and it works great !!!
image

@rtyler rtyler modified the milestones: Rust v1.0.0, v0.23 Dec 1, 2024
@ldacey
Copy link
Contributor

ldacey commented Dec 3, 2024

Yeah, I saw the "pyarrow" deprecation message and swapped to the rust engine for one of my tables. It is only about 2.6 million rows per partition but it is very wide and all text data (200+ columns). The pods were failing due to OOM errors after doing this so I need to use the pyarrow engine and overwrite the partition still.

@rtyler
Copy link
Member

rtyler commented Jan 5, 2025

I'm returning to this and I have some exciting developments! Datafusion 44 introduced LazyMemoryExec which is the missing piece to make this a lot simpler.

My prior attempts at solving this problem were effectively in a similar solution domain as LazyMemoryExec, except I was blurring the lines between Python and Rust more than was probably safe or sane 😆

Conceptually I am building from @aersam's great work in #2289 and using channels to safely bring RecordBatches from the Python layer where there's frankly a lot going on with the Tokio and Python concurrency approaches. With a channel effectively just sending over RecordBatch after RecordBatch.

On the Rust side, the LazyMemoryExec approach makes it more viable to just have a generator receiving these RecordBatches and streaming them into Datafusion.

I hope to have things cleaned up and ready to rock this weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants