define minimum row group #9756
-
I am using delta_rs ( which is based on datafusion) to write delta table, but by default it produce parquet files with 1 M row group, in my case, I want the minimum row group size to be 8 M rows, does datafusion support that functionality ? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Yes! See the configuration variable datafusion.execution.parquet.max_row_group_size https://arrow.apache.org/datafusion/user-guide/configs.html I am not familiar with delta-rs enough to know how you can access this configuration directly, but it likely is possible. |
Beta Was this translation helpful? Give feedback.
-
Thanks, I am familiar with max_row_group_size, I wanted rather something like min_row_group_size, I want the minimum to be 8 M |
Beta Was this translation helpful? Give feedback.
-
Hm... perhaps the name "max_row_group_size" is confusing. It sort of means the same thing as "minimum row group size" depending on your perspective. The parquet writer will continue to write data into a row group until it reaches "max_row_group_size" rows. Then, it will open a new row group and start writing to that. As a result, all but the very last row group will have exactly "max_row_group_size" rows. So, if you set max_row_group_size to 8M the row groups will have 8M rows. You could have the very last row group be smaller if the total number of rows is not divisible by 8M. I am not aware of a mechanism in |
Beta Was this translation helpful? Give feedback.
Hm... perhaps the name "max_row_group_size" is confusing. It sort of means the same thing as "minimum row group size" depending on your perspective. The parquet writer will continue to write data into a row group until it reaches "max_row_group_size" rows. Then, it will open a new row group and start writing to that. As a result, all but the very last row group will have exactly "max_row_group_size" rows.
So, if you set max_row_group_size to 8M the row groups will have 8M rows. You could have the very last row group be smaller if the total number of rows is not divisible by 8M. I am not aware of a mechanism in
arrow-rs
ordatafusion
to force even the last row group to be above a minimum v…