Support writing to multiple files in a directory with `write/sink_parquet` #17163

rgasper · 2024-06-24T13:57:04Z

Description

I'm currently using polars to perform ETL where the final destination is in a data lake, and there's an incompatibility when working with LazyFrames that's causing significant performance issues for consumers of that table (who aren't using polars!) I'd hope could be fixed. The problem is that LazyFrame.sink_parquet(), as far as I can tell, can only write to a single file; In my current use case this single file can be many tens of GBs in size. The data lake, however, functions best when large tables are stored as many smaller files, certainly less than 1 GB each. Currently in order to support this it seems I need to do something like the following:

# imports
import polars as pl
from uuid import uuid4

# constants
table_location = 's3://somebucket/someprefix/'
slice_size = int(1e6) # I have to tune this manually to try and achieve a desired file size

# do stuff
lz = pl.scan_parquet('source.parquet')
...
df = lz.collect(streaming=True)
for slice in df.iter_slices(slice_size):
    df.write_parquet(table_location + f"/{uuid4()}")

This requires manual decisions over slice size, and collecting the LazyFrame, which both are not ideal. Given that polars knows the dtypes of the columns and their sizes, the feature I want is to roll the logic for batching up into the sink_parquet and write_parquet interfaces. A first guess I have at the high level interface is allowing the user to specify directories as the destination path, a desired file size (expecting that it may not be possible to strictly enforce this), and a naming scheme for the files in the directory. When .sink_parquet() would be called by this interface, the LazyFrame would be dumped to different files of approximately the desired file size, with the number of rows in each file required to meet that size dynamically determined based on the columns and their dtypes.

Example:

# imports
import polars as pl

# constants
table_location = 's3://somebucket/someprefix/'
file_size = 1024*1024*128

# do stuff
lz = pl.scan_parquet('source.parquet')
...
lz.sink_parquet(table_location, file_size=file_size, file_naming_strategy="uuid")

Then the resulting directory has contents like:

abc-123-def-456-s8.parquet    121 MB
rs09-1sr8-129-rsiecc.parquet   138 MB
rise12-r0s9c-icy-s8s.parquet    12 MB

The text was updated successfully, but these errors were encountered:

stinodego · 2024-06-24T21:14:35Z

Thanks for the request. This is related to #11500, but not quite a duplicate.

We should definitely support this.

rgasper added the enhancement New feature or an improvement of an existing feature label Jun 24, 2024

stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Jun 24, 2024

stinodego added the accepted Ready for implementation label Jun 24, 2024

github-project-automation bot added this to Backlog Jun 24, 2024

github-project-automation bot moved this to Ready in Backlog Jun 24, 2024

stinodego added the A-io-parquet Area: reading/writing Parquet files label Jun 24, 2024

stinodego changed the title ~~LazyFrame.sink_parquet & Dataframe.write_parquet write to multiple files in a directory~~ Support writing to multiple files in a directory with write/sink_parquet Jun 24, 2024

stinodego mentioned this issue Jun 27, 2024

Partitioning for parquet output type #9372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support writing to multiple files in a directory with `write/sink_parquet` #17163

Support writing to multiple files in a directory with `write/sink_parquet` #17163

rgasper commented Jun 24, 2024 •

edited

Loading

stinodego commented Jun 24, 2024

Support writing to multiple files in a directory with write/sink_parquet #17163

Support writing to multiple files in a directory with write/sink_parquet #17163

Comments

rgasper commented Jun 24, 2024 • edited Loading

Description

stinodego commented Jun 24, 2024

Support writing to multiple files in a directory with `write/sink_parquet` #17163

Support writing to multiple files in a directory with `write/sink_parquet` #17163

rgasper commented Jun 24, 2024 •

edited

Loading