-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to know if DemotePrimary is blocked and send it in the health stream #17289
Changes from all commits
1863256
fcfd133
0c20bed
743db09
3dd0376
e278e11
0b8c3f4
7258704
50f311a
a7cbc4a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ package tabletmanager | |
import ( | ||
"context" | ||
"fmt" | ||
"runtime" | ||
"strings" | ||
"time" | ||
|
||
|
@@ -29,6 +30,7 @@ import ( | |
"vitess.io/vitess/go/vt/log" | ||
"vitess.io/vitess/go/vt/mysqlctl" | ||
"vitess.io/vitess/go/vt/proto/vtrpc" | ||
"vitess.io/vitess/go/vt/topo" | ||
"vitess.io/vitess/go/vt/topo/topoproto" | ||
"vitess.io/vitess/go/vt/vterrors" | ||
"vitess.io/vitess/go/vt/vttablet/tabletserver" | ||
|
@@ -520,6 +522,23 @@ func (tm *TabletManager) demotePrimary(ctx context.Context, revertPartialFailure | |
} | ||
defer tm.unlock() | ||
|
||
finishCtx, cancel := context.WithCancel(context.Background()) | ||
defer cancel() | ||
go func() { | ||
select { | ||
case <-finishCtx.Done(): | ||
// Finished running DemotePrimary. Nothing to do. | ||
case <-time.After(10 * topo.RemoteOperationTimeout): | ||
GuptaManan100 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// We waited for over 10 times of remote operation timeout, but DemotePrimary is still not done. | ||
// Collect more information and signal demote primary is indefinitely stalled. | ||
log.Errorf("DemotePrimary seems to be stalled. Collecting more information.") | ||
tm.QueryServiceControl.SetDemotePrimaryStalled() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't seem to ever reset this. Is that because once it is stalled the only solution is to restart the tablet? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, exactly! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we read the end of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, there is an inherent race between the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
hm I think that is potentially a problem. because if the operator gets notified that a tablet is stalled, it's going to forcefully throw that tablet away with the assumption that (a) there is another tablet that is the real primary and (b) the stalled primary is not serving any traffic. if the stalled primary is able to rejoin the set of serving tablets, both of those assumptions go out the window, and it is unsafe for the operator to safely throw it away. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is true, this would trigger an ERS. Let me see if we can make the tablet not become serving ever again if it is stalled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay! I added few more safeties to ensure nothing goes wrong -
I think with these safeguards we can be sure htat a vttablet is not going to accept any new writes once we mark it as stalled. WDYT @maxenglander? Let's also wait for @deepthi to be able to take a look. |
||
buf := make([]byte, 1<<16) // 64 KB buffer size | ||
stackSize := runtime.Stack(buf, true) | ||
log.Errorf("Stack trace:\n%s", string(buf[:stackSize])) | ||
} | ||
}() | ||
|
||
tablet := tm.Tablet() | ||
wasPrimary := tablet.Type == topodatapb.TabletType_PRIMARY | ||
wasServing := tm.QueryServiceControl.IsServing() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍