Reduce memory usage in parquet format code path #737

PatrickJin-db · 2025-05-24T02:48:48Z

This PR contains two optimizations:

When converting from arrow to pandas, read batches of rows at a time instead of entire parquet files. This change should eliminate pyarrow overhead almost entirely, which may be up to 40-50% reduction in memory usage.
Set split_blocks=False. I believe split_blocks=True was initially introduced in case it might help with memory usage during the arrow to pandas conversion. However, it should be set to false for three reasons:

In my testing, it doesn't affect the memory usage much (if at all).
We are converting small batches at a time now, so the memory usage of the conversion is negligible now (expect maybe if the table contains an extremely large number of columns)
As per the documentation, split_blocks=True may cause pandas consolidations when other operations are done on the dataframe later on. In some of my test tables, this has indeed introduced unnecessary consolidations and actually increased memory usage.

Because change 1 may result in worse performance, this change is gated behind the param convert_in_batches, which is false by default.

Also updated the docstrings of load_as_pandas and load_table_changes_as_pandas.

Tested with unit tests.

linzhou-db · 2025-05-28T23:02:31Z

the pre-merge failures doesn't seem related to the python changes..

@littlegrasscao could you check if it's related:

 - spark read limit *** FAILED ***
[info]   java.lang.NullPointerException: Cannot invoke "java.io.File.getCanonicalPath()" because the return value of "io.delta.sharing.spark.DeltaSharingSuite.testProfileFile()" is null
[info]   at io.delta.sharing.spark.DeltaSharingSuite.$anonfun$new$94(DeltaSharingSuite.scala:577)
[info]   at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info]   at org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229)
[info]   at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info]   at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)

* fix python lint and reformat scripts (#668) * Reduce memory usage in parquet format code path (#737) * refactor: Resolve lint errors in python release script. (#660) * Reduce memory usage in delta format code paths (#723) * Update Python connector version to 1.3.3 --------- Co-authored-by: Kyle Chui <[email protected]>

PatrickJin-db force-pushed the PatrickJin-db/parquet-file-batches branch 3 times, most recently from cb4784f to 772fa0f Compare May 28, 2025 05:05

PatrickJin-db requested a review from linzhou-db May 28, 2025 05:05

PatrickJin-db force-pushed the PatrickJin-db/parquet-file-batches branch from 772fa0f to 9673a7b Compare May 28, 2025 05:16

linzhou-db approved these changes May 28, 2025

View reviewed changes

use ParquetFile and iter_batches

9f342d0

PatrickJin-db force-pushed the PatrickJin-db/parquet-file-batches branch from 9673a7b to 9f342d0 Compare May 28, 2025 23:09

PatrickJin-db merged commit 71c9920 into delta-io:main May 28, 2025
5 of 6 checks passed

PatrickJin-db added a commit to PatrickJin-db/delta-sharing that referenced this pull request May 29, 2025

Reduce memory usage in parquet format code path (delta-io#737)

07349b4

This was referenced May 29, 2025

Release Python Connector 1.3.3 #740

Merged

Reduce memory usage in delta format code paths #723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce memory usage in parquet format code path #737

Reduce memory usage in parquet format code path #737

Uh oh!

PatrickJin-db commented May 24, 2025 •

edited

Loading

Uh oh!

linzhou-db commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Reduce memory usage in parquet format code path #737

Reduce memory usage in parquet format code path #737

Uh oh!

Conversation

PatrickJin-db commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linzhou-db commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

PatrickJin-db commented May 24, 2025 •

edited

Loading