Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flink] Remove Flink state from write-only writers #4185

Closed
wants to merge 2 commits into from

Conversation

tsreaper
Copy link
Contributor

Purpose

Currently Paimon writer operators in Flink have states, which record the commitUser (for all writers) and the list of active buckets (for lookup or full-compaction changelog producer). These states prevent a writer object to be closed too early and cause conflicts.

However for write-only writers, because they only create new files, there is no conflict. So for these writers we can remove Flink state. This optimization is also useful for bounded stream jobs, because when chaining source and write operator together, if some source parallelism have finished, checkpoint cannot be performed if there are union list states in the operators which are still running.

Tests

Existing tests should cover this change.

API and Format

No format changes.

Documentation

No new feature.

@@ -52,37 +55,54 @@ public TableWriteOperator(
this.initialCommitUser = initialCommitUser;
}

private boolean needState() {
if (table.coreOptions().writeOnly()) {
// Commit user for writers are used to avoid conflicts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But users may use the commit user to determine if a table's snapshot is generated by the other job. See : #3474

So, I think you can add a config for users to decide whether they need the commitUser state (default true). And explain in the doc that in bounded stream scenarios, if this configuration is true, there will be side effects(when chaining source and write operator together, if some source parallelism have finished, checkpoint cannot be performed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commitUser is only for writers. What you have mentioned is the commitUser for comitters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@wwj6591812
Copy link
Contributor

+1

@tsreaper tsreaper marked this pull request as draft September 18, 2024 10:39
@tsreaper tsreaper marked this pull request as ready for review September 18, 2024 11:13
@tsreaper
Copy link
Contributor Author

Closing because write-only writers should also wait for snapshot. They need to read latest sequenceId.

@tsreaper tsreaper closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants