Skip to content

Support commit retrie #964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
ZENOTME opened this issue Feb 12, 2025 · 11 comments
Open
3 tasks

Support commit retrie #964

ZENOTME opened this issue Feb 12, 2025 · 11 comments

Comments

@ZENOTME
Copy link
Contributor

ZENOTME commented Feb 12, 2025

I would like to separate this task into multiple steps:

  1. Identify the RetryableCommitError type.
    We can introduce a new ErrorKind::RetryableCommitError to abstract kinds of catalog errors.
  2. Support to store the update actions and reapply them to the table when the commit fails.
  3. Add retry commit, this requires a retry library.
    About the retry library, personally, I think https://github.com/Xuanwo/backon can be a good candidate. Its maintainer is @Xuanwo. (Thanks for this great job!)

Welcome more suggestions and elaborations. cc @Fokko @Xuanwo @liurenjie1024 @sdd

@liurenjie1024
Copy link
Contributor

Thanks @ZENOTME for raising this. The core part of commit of conflict detection for different isolation levels, which is quite hard to implement correctly. Retry itself is not a big problem, it's just a reload of table metadata and do conflict detection. I'm not familiar with backon, so not sure if it's a fit here.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 12, 2025

Thanks @ZENOTME for raising this. The core part of commit of conflict detection for different isolation levels, which is quite hard to implement correctly. Retry itself is not a big problem, it's just a reload of table metadata and do conflict detection. I'm not familiar with backon, so not sure if it's a fit here.

Thanks @liurenjie1024. I will take some effort to investigate more about conflict detection.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 12, 2025

Hi, I take some time to figure out the whole commit process. The whole commit phase can be described as follows:

  1. load current metadata from the catalog
  2. create UpdateAction and apply them to the metadata
    • When applying, there is the conflict detection process based on the current local metadata load at step 1
    • The conflict detection process is specific for the Update Action type. e.g. FastAppend just appends data files so it doesn't have conflict detection.
    • If conflict detection in the apply function fails, it means that the table has some conflict and we can't commit. This process abort
  3. If conflict detection passes, we can send the commit message to the catalog.
    • If the commit fails and the catalog returns CommitFailedException, which means that we commit with stale metadata we can jump back to step 1 and try to commit again.
  4. If commit success, then done.

For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions. How do you think? @liurenjie1024

@liurenjie1024
Copy link
Contributor

For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions.

This sounds reasonable to me.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 13, 2025

For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions.

This sounds reasonable to me.

Let's move. I will work on this later

@jonathanc-n
Copy link
Contributor

@ZENOTME Are you currently working on this? I'm looing to make some progress on this

@Xuanwo
Copy link
Member

Xuanwo commented Mar 19, 2025

  1. We can introduce a new ErrorKind::RetryableCommitError to abstract kinds of catalog errors.

Hi, I believe it should not be an extra kind. Instead, we can add a retryable field in error.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Mar 19, 2025

@ZENOTME Are you currently working on this? I'm looing to make some progress on this

Hi @jonathanc-n, for the transaction part, we better waiting for #949 to merge first because it involves interface change. Maybe you can on the retryable error first.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Mar 19, 2025

  1. We can introduce a new ErrorKind::RetryableCommitError to abstract kinds of catalog errors.

Hi, I believe it should not be an extra kind. Instead, we can add a retryable field in error.

LGTM!

@jonathanc-n
Copy link
Contributor

  1. create UpdateAction and apply them to the metadata

    • When applying, there is the conflict detection process based on the current local metadata load at step 1
    • The conflict detection process is specific for the Update Action type. e.g. FastAppend just appends data files so it doesn't have conflict detection.
    • If conflict detection in the apply function fails, it means that the table has some conflict and we can't commit. This process abort

I'm a bit confused about this part. First of all should we do a operation match and not have to append requirements if there are only append operations.

What do you think about a metadata check (a check if new metadata was already committed, if table already exists, etc.) before we append updates or requirements.

The process would be to:

  1. Call refresh on the table metadata in SnapshotProduceAction
    • This would mean that we would need to pass in a reference of Catalog to TableBuilder to be able to perform the refresh.
    • SnapshotProduceAction can use its transaction instance to use its referenced Table to call the refresh.
  2. Perform the metadata check before appending the updates and requirements to the transaction. This will check the metadata that was used during the snapshot produce process against the current refreshed metadata.
  3. If it fails we can call a CommitExceptionError and have it be retried. Update and requirements never get appended to the Transaction so we do not need to connect to catalog.

Any alternate suggestions or fixes? cc @liurenjie1024 @Fokko @Xuanwo @sdd @ZENOTME @c-thiel

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Mar 31, 2025

Sorry for replying late.

I'm a bit confused about this part. First of all should we do a operation match and not have to append requirements if there are only append operations.

I think we still need an append requirement for the append operation. It just doesn't need the conflict detection, but it still needs a requirement. According to my understanding, the requirement is to check whether there are concurrent snapshot updates. It uses TableRequirement::UuidMatch and TableRequirement::RefSnapshotIdMatch for this.

If the catalog identifies that uuid didn't match, it means that there are concurrent updates. The catalog returns CommitExceptionError. and we should load new metadata and do the conflict detection. If there is conflict then the commit will fail. Otherwise, we can reapply(generate a new update and requirement) and re commit again.

If it fails we can call a CommitExceptionError and have it be retried.

If metadata check fail, I think we should abort because it means that not also there are concurrent update but also conflict. As do in java: https://github.com/apache/iceberg/blob/6d8653bb2bbcecdbe38e98da60ef5578f083e933/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants