-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spilling in SortMergeJoin #9359
Comments
After reading some code and already opened issues on the same topic, probably its possible to summarize whats needed for POC at least:
|
I'll try to start with creating a test query that fails on mem |
Related to #9846 |
I think I managed to find the test query.
This gives me a suggestion ExternalSorter used but spill is not used Playing with sort params I managed to make SMJ merge phase complain on mem exhaustion
|
My next steps
|
I can reproduce this pretty reliably in lance if it's any help. I run SortExec on a column of 100 million strings (each ~30 bytes long) and I have a 100MiB fair pool and it triggers in about 5 minutes. Let me know if there is anything I can do to assist you. |
FWIW, I believe I was looking at this problem some time ago, if I remember correctly the issue was that one of the memory consumer, presumably My suspect is: or or somewhere around it. line
may back my clain I cant find more information looks like that branch is MIA |
Thanks, I'm planning to check next week if spilling works or not for ExternalSorter used for SMJ and see how it is possible to make spilling work as well for the merge phase, the memory pool is injected and SMJ get new allocations through |
Finally starting on it. The test doesn't work anymore |
Yes, I tried the above test few days ago but it doesn't work now. |
I made it work that way (notice
Weird the same thing as a test doesn't respect memory pool
|
the first use case is to try spilling for buffered data, as the buffered data comes in full size and eats the memory.
UPD: Buffered data comes in by partitions, every partition gets processed sequentially. The flow is approx:
Looks like the spilling needed in 1 place only |
Is your feature request related to a problem or challenge?
In SortMergeJoin, it is possibly run of memory when it requires extra memory to hold polled buffer batches. We can consider adding spilling support there to make the operator resilient to the memory issue.
Describe the solution you'd like
Add spilling support in SortMergeJoin.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: