Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing to multiple files in a directory with write/sink_parquet #17163

Open
rgasper opened this issue Jun 24, 2024 · 1 comment
Open
Labels
A-io-parquet Area: reading/writing Parquet files A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@rgasper
Copy link

rgasper commented Jun 24, 2024

Description

I'm currently using polars to perform ETL where the final destination is in a data lake, and there's an incompatibility when working with LazyFrames that's causing significant performance issues for consumers of that table (who aren't using polars!) I'd hope could be fixed. The problem is that LazyFrame.sink_parquet(), as far as I can tell, can only write to a single file; In my current use case this single file can be many tens of GBs in size. The data lake, however, functions best when large tables are stored as many smaller files, certainly less than 1 GB each. Currently in order to support this it seems I need to do something like the following:

# imports
import polars as pl
from uuid import uuid4

# constants
table_location = 's3://somebucket/someprefix/'
slice_size = int(1e6) # I have to tune this manually to try and achieve a desired file size

# do stuff
lz = pl.scan_parquet('source.parquet')
...
df = lz.collect(streaming=True)
for slice in df.iter_slices(slice_size):
    df.write_parquet(table_location + f"/{uuid4()}")

This requires manual decisions over slice size, and collecting the LazyFrame, which both are not ideal. Given that polars knows the dtypes of the columns and their sizes, the feature I want is to roll the logic for batching up into the sink_parquet and write_parquet interfaces. A first guess I have at the high level interface is allowing the user to specify directories as the destination path, a desired file size (expecting that it may not be possible to strictly enforce this), and a naming scheme for the files in the directory. When .sink_parquet() would be called by this interface, the LazyFrame would be dumped to different files of approximately the desired file size, with the number of rows in each file required to meet that size dynamically determined based on the columns and their dtypes.

Example:

# imports
import polars as pl

# constants
table_location = 's3://somebucket/someprefix/'
file_size = 1024*1024*128

# do stuff
lz = pl.scan_parquet('source.parquet')
...
lz.sink_parquet(table_location, file_size=file_size, file_naming_strategy="uuid")

Then the resulting directory has contents like:

abc-123-def-456-s8.parquet    121 MB
rs09-1sr8-129-rsiecc.parquet   138 MB
rise12-r0s9c-icy-s8s.parquet    12 MB
@rgasper rgasper added the enhancement New feature or an improvement of an existing feature label Jun 24, 2024
@stinodego stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Jun 24, 2024
@stinodego
Copy link
Contributor

Thanks for the request. This is related to #11500, but not quite a duplicate.

We should definitely support this.

@stinodego stinodego added the accepted Ready for implementation label Jun 24, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 24, 2024
@stinodego stinodego added the A-io-parquet Area: reading/writing Parquet files label Jun 24, 2024
@stinodego stinodego changed the title LazyFrame.sink_parquet & Dataframe.write_parquet write to multiple files in a directory Support writing to multiple files in a directory with write/sink_parquet Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

2 participants