Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory use when writing tables with very short columns to ORC #18136

Merged
merged 12 commits into from
Mar 3, 2025

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 28, 2025

Description

Closes #18059

To avoid estimating the maximum compressed size for each actual block in the file, ORC writer uses the estimate for the (uncompressed) block size limit, which defaults to 256KB. However, when we write many small blocks, this compressed block size estimate is much larger than what is needed, leading to high memory use for wide/short tables.

This PR adds logic to take the actual block size into account, and to use the size of the actual largest block in the file, not the largest possible block. This changes the memory usage by orders of magnitude in some tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 28, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@vuule vuule added bug Something isn't working non-breaking Non-breaking change labels Feb 28, 2025
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2025
@github-actions github-actions bot added the CMake CMake build issue label Mar 1, 2025
@vuule vuule marked this pull request as ready for review March 1, 2025 00:56
@vuule vuule requested review from a team as code owners March 1, 2025 00:56
@vuule vuule requested review from vyasr and shrshi March 1, 2025 00:56
@vuule vuule changed the title Fix for failed nightly tests on 11.4 Reduce memory use when writing tables with very short columns to ORC Mar 1, 2025
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks right to me, although I don't know the stripe stream bits well enough to understand why we start off with a 2d span with a dimension that is always of size one. However, this does seem to be reducing the memory usage enough to get CI to pass on the test we've been having problems with, so I'm approving this in the hopes that we can get more data from nightlies over the weekend. Thanks @vuule!

@bdice
Copy link
Contributor

bdice commented Mar 3, 2025

I think we still need this to unblock nightly CI? I will update this with the latest branch-25.04 now that we've completed the forward merge of CuPy 13.4 changes.

@raydouglass raydouglass merged commit 1c0ea5e into rapidsai:branch-25.04 Mar 3, 2025
103 of 110 checks passed
@vyasr vyasr mentioned this pull request Mar 3, 2025
3 tasks
raydouglass pushed a commit that referenced this pull request Mar 3, 2025
This test is failing in multiple places right now, such as [this
run](https://github.com/rapidsai/cudf/actions/runs/13595690128/job/38014725800)
on #18133 and [this
run](https://github.com/rapidsai/cudf/actions/runs/13636334843/job/38118996773?pr=18136)
on #18136. Let's skip it until we
can debug why so that we unblock other CI.

---------

Co-authored-by: Peter Andreas Entschev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] OrcWriterTest.MultipleBlocksInStripeFooter reports out of memory in nightly build
5 participants