Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Fix the bug in the FE clone scheduling that causes an infinite loop #50561

Merged
merged 5 commits into from
Sep 6, 2024

Conversation

sevev
Copy link
Contributor

@sevev sevev commented Sep 2, 2024

Why I'm doing:

Some VERSION_INCOMPLETE replicas may not be able to be scheduled for clone repair, such as the following scenario:

  1. The partition visible version is 100 because there are some replica version is still 100 and version 101 publish failed.
  2. The ingestion task is continuous and the lastest committed version is 120.
  3. There are two VERSION_INCOMPLETEreplicas, the first replica commit version101and120failed, so the first replica version is100, last failed version is 120. The second replica commit version 102and108failed, so the second replica version is101and the last failed version is108`.
  4. We consider these two replicas are needed to repair because their last failed version are both greater than 0 and we will choose the replica which last failed version is smaller to repair first.
  5. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still 101 and last failed version is 108.
  6. After clone, the second version is still VERSION_INCOMPLETE and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

What I'm doing:

Try to fix the replica which version is smaller first.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@sevev sevev requested a review from a team as a code owner September 2, 2024 09:57
@mergify mergify bot assigned sevev Sep 2, 2024
Signed-off-by: sevev <[email protected]>
Signed-off-by: sevev <[email protected]>
Signed-off-by: sevev <[email protected]>
Signed-off-by: sevev <[email protected]>
Copy link

sonarcloud bot commented Sep 3, 2024

Copy link

github-actions bot commented Sep 3, 2024

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

github-actions bot commented Sep 3, 2024

[FE Incremental Coverage Report]

pass : 4 / 4 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/clone/TabletSchedCtx.java 4 4 100.00% []

Copy link

github-actions bot commented Sep 3, 2024

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@gengjun-git gengjun-git self-assigned this Sep 4, 2024
@sevev sevev merged commit 19b833f into StarRocks:main Sep 6, 2024
49 checks passed
Copy link

github-actions bot commented Sep 6, 2024

@Mergifyio backport branch-3.3

@github-actions github-actions bot removed the 3.3 label Sep 6, 2024
Copy link

github-actions bot commented Sep 6, 2024

@Mergifyio backport branch-3.2

@github-actions github-actions bot removed the 3.2 label Sep 6, 2024
Copy link

github-actions bot commented Sep 6, 2024

@Mergifyio backport branch-3.1

Copy link
Contributor

mergify bot commented Sep 6, 2024

backport branch-3.3

✅ Backports have been created

@github-actions github-actions bot removed the 3.1 label Sep 6, 2024
Copy link
Contributor

mergify bot commented Sep 6, 2024

backport branch-3.2

✅ Backports have been created

Copy link
Contributor

mergify bot commented Sep 6, 2024

backport branch-3.1

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Sep 6, 2024
…te loop (#50561)

## Why I'm doing:
Some `VERSION_INCOMPLETE` replicas may not be able to be scheduled for clone repair, such as the following scenario:
1. The partition visible version is `100` because there are some replica version is still `100` and version `101` publish failed.
2. The ingestion task is continuous and the lastest committed version is `120`.
3. There are two VERSION_INCOMPLETE` replicas, the first replica commit version `101` and `120` failed, so  the first replica version is `100`, last failed version is `120`.  The second replica commit version `102` and `108` failed, so the second replica version is `101` and the last failed version is `108`.
5. We consider these two replicas are needed to repair because their `last failed version` are both greater than 0 and we will choose the replica which `last failed version` is smaller to repair first.
6. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still `101` and last failed version is `108`.
7. After clone, the second version is still `VERSION_INCOMPLETE` and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

## What I'm doing:
Try to fix the replica which version is smaller first.

Signed-off-by: sevev <[email protected]>
(cherry picked from commit 19b833f)
mergify bot pushed a commit that referenced this pull request Sep 6, 2024
…te loop (#50561)

## Why I'm doing:
Some `VERSION_INCOMPLETE` replicas may not be able to be scheduled for clone repair, such as the following scenario:
1. The partition visible version is `100` because there are some replica version is still `100` and version `101` publish failed.
2. The ingestion task is continuous and the lastest committed version is `120`.
3. There are two VERSION_INCOMPLETE` replicas, the first replica commit version `101` and `120` failed, so  the first replica version is `100`, last failed version is `120`.  The second replica commit version `102` and `108` failed, so the second replica version is `101` and the last failed version is `108`.
5. We consider these two replicas are needed to repair because their `last failed version` are both greater than 0 and we will choose the replica which `last failed version` is smaller to repair first.
6. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still `101` and last failed version is `108`.
7. After clone, the second version is still `VERSION_INCOMPLETE` and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

## What I'm doing:
Try to fix the replica which version is smaller first.

Signed-off-by: sevev <[email protected]>
(cherry picked from commit 19b833f)
mergify bot pushed a commit that referenced this pull request Sep 6, 2024
…te loop (#50561)

## Why I'm doing:
Some `VERSION_INCOMPLETE` replicas may not be able to be scheduled for clone repair, such as the following scenario:
1. The partition visible version is `100` because there are some replica version is still `100` and version `101` publish failed.
2. The ingestion task is continuous and the lastest committed version is `120`.
3. There are two VERSION_INCOMPLETE` replicas, the first replica commit version `101` and `120` failed, so  the first replica version is `100`, last failed version is `120`.  The second replica commit version `102` and `108` failed, so the second replica version is `101` and the last failed version is `108`.
5. We consider these two replicas are needed to repair because their `last failed version` are both greater than 0 and we will choose the replica which `last failed version` is smaller to repair first.
6. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still `101` and last failed version is `108`.
7. After clone, the second version is still `VERSION_INCOMPLETE` and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

## What I'm doing:
Try to fix the replica which version is smaller first.

Signed-off-by: sevev <[email protected]>
(cherry picked from commit 19b833f)
wanpengfei-git pushed a commit that referenced this pull request Sep 6, 2024
@sevev
Copy link
Contributor Author

sevev commented Sep 6, 2024

@mergify backport branch-2.5

Copy link
Contributor

mergify bot commented Sep 6, 2024

backport branch-2.5

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Sep 6, 2024
…te loop (#50561)

## Why I'm doing:
Some `VERSION_INCOMPLETE` replicas may not be able to be scheduled for clone repair, such as the following scenario:
1. The partition visible version is `100` because there are some replica version is still `100` and version `101` publish failed.
2. The ingestion task is continuous and the lastest committed version is `120`.
3. There are two VERSION_INCOMPLETE` replicas, the first replica commit version `101` and `120` failed, so  the first replica version is `100`, last failed version is `120`.  The second replica commit version `102` and `108` failed, so the second replica version is `101` and the last failed version is `108`.
5. We consider these two replicas are needed to repair because their `last failed version` are both greater than 0 and we will choose the replica which `last failed version` is smaller to repair first.
6. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still `101` and last failed version is `108`.
7. After clone, the second version is still `VERSION_INCOMPLETE` and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

## What I'm doing:
Try to fix the replica which version is smaller first.

Signed-off-by: sevev <[email protected]>
(cherry picked from commit 19b833f)
wanpengfei-git pushed a commit that referenced this pull request Sep 6, 2024
xiangguangyxg pushed a commit to xiangguangyxg/starrocks that referenced this pull request Sep 6, 2024
wanpengfei-git pushed a commit that referenced this pull request Sep 9, 2024
HangyuanLiu pushed a commit to HangyuanLiu/starrocks that referenced this pull request Sep 12, 2024
…te loop (StarRocks#50561)

## Why I'm doing:
Some `VERSION_INCOMPLETE` replicas may not be able to be scheduled for clone repair, such as the following scenario:
1. The partition visible version is `100` because there are some replica version is still `100` and version `101` publish failed.
2. The ingestion task is continuous and the lastest committed version is `120`.
3. There are two VERSION_INCOMPLETE` replicas, the first replica commit version `101` and `120` failed, so  the first replica version is `100`, last failed version is `120`.  The second replica commit version `102` and `108` failed, so the second replica version is `101` and the last failed version is `108`.
5. We consider these two replicas are needed to repair because their `last failed version` are both greater than 0 and we will choose the replica which `last failed version` is smaller to repair first. 
6. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still `101` and last failed version is `108`.
7. After clone, the second version is still `VERSION_INCOMPLETE` and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

## What I'm doing:
Try to fix the replica which version is smaller first.


Signed-off-by: sevev <[email protected]>
wanpengfei-git pushed a commit that referenced this pull request Sep 23, 2024
renzhimin7 pushed a commit to renzhimin7/starrocks that referenced this pull request Nov 7, 2024
…te loop (StarRocks#50561)

## Why I'm doing:
Some `VERSION_INCOMPLETE` replicas may not be able to be scheduled for clone repair, such as the following scenario:
1. The partition visible version is `100` because there are some replica version is still `100` and version `101` publish failed.
2. The ingestion task is continuous and the lastest committed version is `120`.
3. There are two VERSION_INCOMPLETE` replicas, the first replica commit version `101` and `120` failed, so  the first replica version is `100`, last failed version is `120`.  The second replica commit version `102` and `108` failed, so the second replica version is `101` and the last failed version is `108`.
5. We consider these two replicas are needed to repair because their `last failed version` are both greater than 0 and we will choose the replica which `last failed version` is smaller to repair first.
6. However, we use visible version in clone task because the clone task can only clone the data with visible version. So the second replica version is still `101` and last failed version is `108`.
7. After clone, the second version is still `VERSION_INCOMPLETE` and we will try to repair again and again which cause the first replica can not repair any more and the the partition visible version is blocked.

## What I'm doing:
Try to fix the replica which version is smaller first.

Signed-off-by: sevev <[email protected]>
Signed-off-by: zhiminr.ren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants