Skip to content

Reduce memory usage in parquet format code path #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

PatrickJin-db
Copy link
Collaborator

@PatrickJin-db PatrickJin-db commented May 24, 2025

This PR contains two optimizations:

  1. When converting from arrow to pandas, read batches of rows at a time instead of entire parquet files. This change should eliminate pyarrow overhead almost entirely, which may be up to 40-50% reduction in memory usage.
  2. Set split_blocks=False. I believe split_blocks=True was initially introduced in case it might help with memory usage during the arrow to pandas conversion. However, it should be set to false for three reasons:
  • In my testing, it doesn't affect the memory usage much (if at all).
  • We are converting small batches at a time now, so the memory usage of the conversion is negligible now (expect maybe if the table contains an extremely large number of columns)
  • As per the documentation, split_blocks=True may cause pandas consolidations when other operations are done on the dataframe later on. In some of my test tables, this has indeed introduced unnecessary consolidations and actually increased memory usage.

Because change 1 may result in worse performance, this change is gated behind the param convert_in_batches, which is false by default.

Also updated the docstrings of load_as_pandas and load_table_changes_as_pandas.

Tested with unit tests.

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/parquet-file-batches branch 3 times, most recently from cb4784f to 772fa0f Compare May 28, 2025 05:05
@PatrickJin-db PatrickJin-db requested a review from linzhou-db May 28, 2025 05:05
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/parquet-file-batches branch from 772fa0f to 9673a7b Compare May 28, 2025 05:16
@linzhou-db
Copy link
Collaborator

the pre-merge failures doesn't seem related to the python changes..

@littlegrasscao could you check if it's related:

 - spark read limit *** FAILED ***
[info]   java.lang.NullPointerException: Cannot invoke "java.io.File.getCanonicalPath()" because the return value of "io.delta.sharing.spark.DeltaSharingSuite.testProfileFile()" is null
[info]   at io.delta.sharing.spark.DeltaSharingSuite.$anonfun$new$94(DeltaSharingSuite.scala:577)
[info]   at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info]   at org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229)
[info]   at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info]   at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/parquet-file-batches branch from 9673a7b to 9f342d0 Compare May 28, 2025 23:09
@PatrickJin-db PatrickJin-db merged commit 71c9920 into delta-io:main May 28, 2025
5 of 6 checks passed
PatrickJin-db added a commit to PatrickJin-db/delta-sharing that referenced this pull request May 29, 2025
PatrickJin-db added a commit that referenced this pull request May 29, 2025
* fix python lint and reformat scripts (#668)

* Reduce memory usage in parquet format code path (#737)

* refactor: Resolve lint errors in python release script. (#660)

* Reduce memory usage in delta format code paths (#723)

* Update Python connector version to 1.3.3

---------

Co-authored-by: Kyle Chui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants