`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. #2845

useredsa · 2024-06-04T18:02:10Z

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

The text was updated successfully, but these errors were encountered:

LeonLuttenberger · 2024-06-17T15:11:29Z

Hey,

Unfortunately, because this implementation of to_iceberg relies on a mesh of Pandas and Athena queries, we can't currently support this option of using a partition transform function with mode="overwrite_partititons". However, we are exploring other APIs for refactoring to_iceberg, such as PyIceberg or other AWS Glue APIs, which would allow us to support this in the future.

github-actions · 2024-08-17T09:03:40Z

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

useredsa · 2024-08-19T09:16:46Z

Hi @LeonLuttenberger,

I understand that it would be difficult to solve the issue, but should it be closed due to inactivity while it is still unsolved?

useredsa added the bug Something isn't working label Jun 4, 2024

github-actions bot added the needs-triage label Jun 9, 2024

GSEnergy-KimKyungho mentioned this issue Jun 13, 2024

When saving a df with a partition transform function and mode="overwrite_partitions" using the to_iceberg method, the KeyError occurs #2854

Closed

jaidisido removed the needs-triage label Jun 18, 2024

github-actions bot added the closing-soon label Aug 17, 2024

github-actions bot removed the closing-soon label Aug 19, 2024

LeonLuttenberger added the backlog label Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. #2845

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. #2845

useredsa commented Jun 4, 2024

LeonLuttenberger commented Jun 17, 2024

github-actions bot commented Aug 17, 2024

useredsa commented Aug 19, 2024

athena.to_parquet fails when mode=overwrite_partitions and partition_cols contains something like hour(timestamp_col). #2845

athena.to_parquet fails when mode=overwrite_partitions and partition_cols contains something like hour(timestamp_col). #2845

Comments

useredsa commented Jun 4, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

LeonLuttenberger commented Jun 17, 2024

github-actions bot commented Aug 17, 2024

useredsa commented Aug 19, 2024

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. #2845

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. #2845