-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Disk Stall implementation and mark tablet not serving if disk is stalled #17624
base: main
Are you sure you want to change the base?
Conversation
…l status output Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
@@ -105,4 +105,5 @@ message FullStatus { | |||
uint32 semi_sync_wait_for_replica_count = 20; | |||
bool super_read_only = 21; | |||
replicationdata.Configuration replication_configuration = 22; | |||
bool disk_stalled = 23; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to propose the same change, great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17624 +/- ##
==========================================
- Coverage 67.64% 67.61% -0.03%
==========================================
Files 1586 1586
Lines 255629 255640 +11
==========================================
- Hits 172910 172858 -52
- Misses 82719 82782 +63 ☔ View full report in Codecov by Sentry. |
@@ -777,7 +778,7 @@ func (sm *stateManager) IsServing() bool { | |||
} | |||
|
|||
func (sm *stateManager) isServingLocked() bool { | |||
return sm.state == StateServing && sm.wantState == StateServing && sm.replHealthy && !sm.demotePrimaryStalled && !sm.lameduck | |||
return sm.state == StateServing && sm.wantState == StateServing && sm.replHealthy && !sm.demotePrimaryStalled && !sm.lameduck && !sm.dhMonitor.IsDiskStalled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, the list of conditions here is getting long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I see that as a good thing 😆
@@ -97,6 +97,7 @@ type stateManager struct { | |||
replHealthy bool | |||
demotePrimaryStalled bool | |||
lameduck bool | |||
dhMonitor DiskHealthMonitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's give this a descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to change DiskHealthMonitor
? or dhMonitor
?
Description
After reading up the feature request #17610, I looked at the implementation of disk stall monitor added in #17470 and how we could use that to address this request.
As stated in #17610 (comment), I changed the code to now make the vttablet broadcast itself as non-serving if the disk is stalled. This would prevent the vtgate from sending queries to this tablet until the disk has recovered.
While at this change, I also refactored how disk stall information was communicated in the full status output. In the original PR I had a comment #16050 (comment) requesting to not use the error as a means of communication in the FullStatus RPC. I forgot to address this comment when I revived the PR. I have made the change in this PR, and introduced a new field in the full status result.
Since the changes haven't gone into a released version of vitess, we don't need to worry about backward compatibility for this refactor.
Related Issue(s)
vtgate
handles replicas with stalled disk #17610Checklist
Deployment Notes