Support writing to multiple files in a directory with write/sink_parquet
#17163
Labels
A-io-parquet
Area: reading/writing Parquet files
A-io-partitioning
Area: reading/writing (Hive) partitioned files
accepted
Ready for implementation
enhancement
New feature or an improvement of an existing feature
Description
I'm currently using polars to perform ETL where the final destination is in a data lake, and there's an incompatibility when working with LazyFrames that's causing significant performance issues for consumers of that table (who aren't using polars!) I'd hope could be fixed. The problem is that
LazyFrame.sink_parquet()
, as far as I can tell, can only write to a single file; In my current use case this single file can be many tens of GBs in size. The data lake, however, functions best when large tables are stored as many smaller files, certainly less than 1 GB each. Currently in order to support this it seems I need to do something like the following:This requires manual decisions over slice size, and collecting the LazyFrame, which both are not ideal. Given that polars knows the dtypes of the columns and their sizes, the feature I want is to roll the logic for batching up into the
sink_parquet
andwrite_parquet
interfaces. A first guess I have at the high level interface is allowing the user to specify directories as the destination path, a desired file size (expecting that it may not be possible to strictly enforce this), and a naming scheme for the files in the directory. When.sink_parquet()
would be called by this interface, the LazyFrame would be dumped to different files of approximately the desired file size, with the number of rows in each file required to meet that size dynamically determined based on the columns and their dtypes.Example:
Then the resulting directory has contents like:
The text was updated successfully, but these errors were encountered: