-
Notifications
You must be signed in to change notification settings - Fork 268
feat: make output file name of write task consistent with java api #1720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@Fokko Could you please take a look at this PR? Thanks! |
Hey @sharkdtu Thanks for raising this. What would be the benefit of adding this counter to the output? I think it is unique without the counter. |
@Fokko Sorry for not providing detailed background information. I have updated the pr description, please take a look. |
@sharkdtu Thanks for the added context. Still, I don't think this is the right place to add this. Would each of the Ray workers call |
@@ -1874,7 +1875,7 @@ class WriteTask: | |||
def generate_data_file_filename(self, extension: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
def generate_data_file_filename(self, extension: str) -> str: | |
def generate_data_file_filename(self, extension: str, task_id: Optional[int] = None) -> str: |
To only include the task ID for distributed engines?
@Fokko Thanks for the comments. I think a |
Resolves: #1719
The output file name of java api is "{partitionId}-{taskId}-{operationId}-{counterId}.{extension}"
however, the output file name of pyiceberg is "00000-{task_id}-{write_uuid}.{extension}", where
task_id
is assigned as a counter id for a target file. it is ok for non-distributed writting, but for distributed writting, it's better to be consistent with java api.background
I am implementing distributed writing to Iceberg tables based on Ray Datasink, where each Ray task is responsible for writing data files(by calling
_dataframe_to_data_files
), and the driver collects the data files information and commits the transaction. To better differentiate, it would be ideal to have a certain naming convention for the files written by each Ray task under the same transaction: