feat: make output file name of write task consistent with java api #1720

sharkdtu · 2025-02-25T12:22:44Z

Resolves: #1719

The output file name of java api is "{partitionId}-{taskId}-{operationId}-{counterId}.{extension}"

java api ref: https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L92-L101

however, the output file name of pyiceberg is "00000-{task_id}-{write_uuid}.{extension}", where task_id is assigned as a counter id for a target file. it is ok for non-distributed writting, but for distributed writting, it's better to be consistent with java api.

background
I am implementing distributed writing to Iceberg tables based on Ray Datasink, where each Ray task is responsible for writing data files(by calling _dataframe_to_data_files), and the driver collects the data files information and commits the transaction. To better differentiate, it would be ideal to have a certain naming convention for the files written by each Ray task under the same transaction:

All file names written by Ray tasks share a common UUID.
Each file name written by a Ray task includes the Ray task ID.
When a single Ray task writes multiple files, the file names should include a counter ID.

sharkdtu · 2025-02-27T12:43:02Z

@Fokko Could you please take a look at this PR? Thanks!

Fokko · 2025-02-27T19:31:59Z

Hey @sharkdtu Thanks for raising this. What would be the benefit of adding this counter to the output? I think it is unique without the counter.

sharkdtu · 2025-02-28T02:43:52Z

Hey @sharkdtu Thanks for raising this. What would be the benefit of adding this counter to the output? I think it is unique without the counter.

@Fokko Sorry for not providing detailed background information. I have updated the pr description, please take a look.
The 'counter id' you mentioned is used to distinguish the scenario where a single task writes multiple files during the distributed writing, just like spark.

Fokko · 2025-03-04T10:17:58Z

@sharkdtu Thanks for the added context. Still, I don't think this is the right place to add this.

Would each of the Ray workers call _dataframe_to_data_files? In the worst case, this might lead to partitions * workers number of data files. Instead, the idea behind the notion of Tasks is that they can be fed into a distributed system. The current _dataframe_to_data_files does both the generation of Tasks and writes the Parquet files. How about splitting this into _dataframe_to_write_tasks and _write_tasks_to_parquet, where Ray would implement a distributed variant of the latter. Thoughts?

Fokko · 2025-03-04T10:19:21Z

pyiceberg/table/__init__.py

@@ -1874,7 +1875,7 @@ class WriteTask:
    def generate_data_file_filename(self, extension: str) -> str:


How about:

Suggested change

def generate_data_file_filename(self, extension: str) -> str:

def generate_data_file_filename(self, extension: str, task_id: Optional[int] = None) -> str:

To only include the task ID for distributed engines?

sharkdtu · 2025-03-05T06:49:33Z

@sharkdtu Thanks for the added context. Still, I don't think this is the right place to add this.

Would each of the Ray workers call _dataframe_to_data_files? In the worst case, this might lead to partitions * workers number of data files. Instead, the idea behind the notion of Tasks is that they can be fed into a distributed system. The current _dataframe_to_data_files does both the generation of Tasks and writes the Parquet files. How about splitting this into _dataframe_to_write_tasks and _write_tasks_to_parquet, where Ray would implement a distributed variant of the latter. Thoughts?

@Fokko Thanks for the comments. I think a WriteTask is not a task of distributed system, It's just like a writer in task. maybe one task has one writer, or one task has multiple writers. The number of data files can be controlled by repartitioning before writting, like spark.

sharkdtu added 2 commits February 25, 2025 19:58

fix: make file name of write task consistent with java api

d74fe29

fix style

781f887

Fokko reviewed Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: make output file name of write task consistent with java api #1720

feat: make output file name of write task consistent with java api #1720

Uh oh!

sharkdtu commented Feb 25, 2025 •

edited

Loading

Uh oh!

sharkdtu commented Feb 27, 2025

Uh oh!

Fokko commented Feb 27, 2025

Uh oh!

sharkdtu commented Feb 28, 2025 •

edited

Loading

Uh oh!

Fokko commented Mar 4, 2025

Uh oh!

Fokko Mar 4, 2025 •

edited

Loading

Uh oh!

sharkdtu commented Mar 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -1874,7 +1875,7 @@ class WriteTask:
		def generate_data_file_filename(self, extension: str) -> str:

feat: make output file name of write task consistent with java api #1720

Are you sure you want to change the base?

feat: make output file name of write task consistent with java api #1720

Uh oh!

Conversation

sharkdtu commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sharkdtu commented Feb 27, 2025

Uh oh!

Fokko commented Feb 27, 2025

Uh oh!

sharkdtu commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Mar 4, 2025

Uh oh!

Fokko Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sharkdtu commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sharkdtu commented Feb 25, 2025 •

edited

Loading

sharkdtu commented Feb 28, 2025 •

edited

Loading

Fokko Mar 4, 2025 •

edited

Loading

sharkdtu commented Mar 5, 2025 •

edited

Loading