-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
=cluster handle aggressively rejoining/replacing nodes from same host/port pair #1083
=cluster handle aggressively rejoining/replacing nodes from same host/port pair #1083
Conversation
…pected crash in the future
actorIdentifier = "[\(id)]" | ||
_ = metadata.removeValue(forKey: "actor/path") // discard it | ||
} else if let path = metadata.removeValue(forKey: "actor/path") { | ||
actorIdentifier = "[\(path)]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revamping this now that we have more distributed actor
types and less paths 👍
@@ -106,7 +106,7 @@ struct SamplePrettyLogHandler: LogHandler { | |||
|
|||
let file = file.split(separator: "/").last ?? "" | |||
let line = line | |||
print("\(self.timestamp()) [\(file):\(line)] [\(nodeInfo)\(Self.CONSOLE_BOLD)\(label)\(Self.CONSOLE_RESET)] [\(level)] \(message)\(metadataString)") | |||
print("\(self.timestamp()) \(level) [\(nodeInfo)\(Self.CONSOLE_BOLD)\(label)\(Self.CONSOLE_RESET)][\(file):\(line)] \(message)\(metadataString)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same formatting as our example formatters in samples
.lazy | ||
.filter { member in member.status <= status } | ||
.count | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used in test but generally useful
@@ -223,6 +238,21 @@ extension Cluster { | |||
} | |||
} | |||
|
|||
extension Cluster.Membership: Sequence { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally useful to for member in membership
👍
change.member.node.asClusterNode == self.id.node | ||
{ | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix™
Though this deserves a deeper dig into SWIM as well.
Also, SwiftFormat's formatting of multi-line if is annoying omg!
SWIM follow up apple/swift-cluster-membership#91 |
Co-authored-by: Yim Lee <[email protected]>
How I love a trailing space being the only failure :P |
Resolves : #1082
Short version: there is an edge case that was not handled well when very aggressively rejoining the cluster from the same host/port.
This can happen in k8s when a pod gets aggressively restarted, or on command line apps when someone joins a cluster "sends just one request and kills the app" since they both then still may be present in SWIM gossip (correctly) as dead, and the cluster may misinterpret this about information about "itself" when the new node joins.
More analysis:
steps:
2022-11-01T13:08:56+0900 trace Client : actor/id=/user/swim actor/path=/user/swim cluster/node=sact://[email protected]:7337 swim/incarnation=0 swim/members/all=["SWIM.Member(SWIMActor(id:sact://RemoteCluster:[email protected]:8337/user/swim, node:sact://RemoteCluster:[email protected]:8337, alive(incarnation: 0), protocolPeriod: 1)", "SWIM.Member(SWIMActor(id:/user/swim, node:sact://[email protected]:7337, alive(incarnation: 0), protocolPeriod: 0)"] swim/members/count=2 swim/ping/origin=sact://RemoteCluster:[email protected]:8337/user/swim swim/ping/payload=membership([SWIM.Member(SWIMActor(id:sact://[email protected]:7337/user/swim, node:sact://[email protected]:7337, suspect(incarnation: 0, suspectedBy: Set([sact://[email protected]:8337#7602583950674506995])), protocolPeriod: 56), SWIM.Member(SWIMActor(id:sact://RemoteCluster:[email protected]:8337/user/swim, node:sact://RemoteCluster:[email protected]:8337, alive(incarnation: 0), protocolPeriod: 0), SWIM.Member(SWIMActor(id:/user/swim, node:sact://[email protected]:7337, alive(incarnation: 0), protocolPeriod: 56)]) swim/ping/seqNr=4 swim/protocolPeriod=1 swim/suspects/count=0 swim/timeoutSuspectsBeforePeriodMax=11 swim/timeoutSuspectsBeforePeriodMin=4 [DistributedCluster] Received ping@4
swim/ping/payload=membership([
SWIM.Member(SWIMActor(id:sact://[email protected]:7337/user/swim, node:sact://[email protected]:7337, suspect(incarnation: 0, suspectedBy: Set([sact://[email protected]:8337#7602583950674506995])), protocolPeriod: 56),
SWIM.Member(SWIMActor(id:sact://[email protected]:7337/user/swim, node:sact://[email protected]:7337, alive(incarnation: 0), protocolPeriod: 56),
SWIM.Member(SWIMActor(id:sact://RemoteCluster:[email protected]:8337/user/swim, node:sact://RemoteCluster:[email protected]:8337, alive(incarnation: 0), protocolPeriod: 0)
confirmDead
yet -- THIS IS OK. But continuing to act on the removed node's information is NOT ok.Long story short: SWIM tells us that a node on this address was dead, but we know we are not dead -- this should only happen on high level gossip, when we see a .down somewhere about us. So we can ignore this from the SWIM level.
This should also get fixed in SWIM itself though, I'll follow up there.