Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Forward ClusterIP traffic to inter-node communication port before CQL is ready #1614

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Nov 30, 2023

Kubernetes reconciles Endpoints and EndpointSlices for Services having
selector by taking entire Pod readiness into account. Because we need to
allow inter-node communication before CQL traffic we used
PublishNonReadyEndpoints on Services to overcome this.
But at the same time, we would like stop accepting new connections when
Pod is tearing down. These two requirements contradicts with each other.

To satisfy both requirements, Operator will reconcile Endpoints and
EndpointSlice resources in per-container way. If Pod container specifies
port and has its own Readiness probe, Endpoint/EndpointSlice
will become ready for this port when given container is ready. If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
Controller logic will consider Pod being deleted (non-nil
deletionTimestamp) as fully non-ready so that all endpoints for all ports
and containers will removed.

Additional container added to Scylla Pod controls readiness of ports
used for inter-node communication.

With above logic, nodes observing other node going down, won't be
able to reconnect until restarted node opens a port for inter-node
communication because traffic will be rejected immediately when node
starts terminating. This fixes an issue of traffic disruption during rolling
restarts happening on ScyllaClusters using ClusterIP for inter-node
communication which was caused by nodes being stuck on reconnection
attempts.

Prerequisites:

Fixes #1077

@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 30, 2023
@scylla-operator-bot scylla-operator-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 30, 2023
@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2023
@zimnx zimnx force-pushed the mz/endpointslices-controller branch 5 times, most recently from 4dc2f8f to d77e0f4 Compare December 4, 2023 17:33
@zimnx zimnx force-pushed the mz/endpointslices-controller branch 3 times, most recently from 8d9c038 to 28f6bd1 Compare December 8, 2023 15:41
@zimnx zimnx changed the title [WIP] Endpointslices controller [WIP] Fix traffic disruption happening during rolling restarts Dec 8, 2023
@zimnx zimnx changed the title [WIP] Fix traffic disruption happening during rolling restarts Fix traffic disruption happening during rolling restarts Dec 8, 2023
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2023
@zimnx zimnx added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 8, 2023
@scylla-operator-bot scylla-operator-bot bot removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Dec 8, 2023
@zimnx zimnx added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Dec 8, 2023
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Dec 8, 2023
@zimnx zimnx added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Dec 8, 2023
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 8, 2023
@zimnx zimnx removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2023
@zimnx zimnx requested a review from tnozicka March 27, 2024 22:20
@tnozicka
Copy link
Member

Hi, apologies for the delay getting back with everything that has been going on. This is large enough so I don't want you and me to context switch in small batches and rather allocate a continuous chunk to get this finished. I'll be on PTO next week but I have scheduled this for the top of my list when I get back. If I don't get this submitted by Apr 30th EOD, please ping me!

Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of comments, only skimmed through the tests for now

},
)
if err != nil {
objectErrs = append(objectErrs, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the other invocations don't wrap this error, but is there actually a reason for not wrapping it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, just consistency with others

PublishNotReadyAddresses: true,
Type: corev1.ServiceTypeClusterIP,
Ports: servicePorts(sc),
Selector: nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we have a comment explaining why there's no selector? Or maybe even a higher-level comment how we work around it? It's obvious in this PR's context, but probably not so much for someone looking at it in the future. On the other hand it can be backtracked to this commit, so I'm just leaving it as a nit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

`,
},
ReadinessProbe: &corev1.Probe{
TimeoutSeconds: int32(30),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we have comments explaining these values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not special, there's nothing to comment on.

return endpointSlices, fmt.Errorf("can't get Pod %q: %w", naming.ManualRef(sc.Namespace, svc.Name), err)
}

// Don't publish endpoints for Pod that are being deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Don't publish endpoints for Pod that are being deleted.
// Don't publish endpoints for Pods that are being deleted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Don't publish endpoints for Pod that are being deleted.
// Removing endpoints prevents from new connections being established while still allowing
// for existing connections to survive and finish their requests.
// We need to do this early, as gap between when Pod is actually deleted and Endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We need to do this early, as gap between when Pod is actually deleted and Endpoint
// We need to do this early, as a gap between when Pod is actually deleted and Endpoint

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

for i := range tt {
tc := tt[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

for i := range tt {
tc := tt[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

for i := range tt {
tc := tt[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

t.Errorf("expected found %#v, got %#v", tc.expectedFound, found)
}
if !reflect.DeepEqual(container, tc.expectedContainer) {
t.Errorf("expected container %#v, got %#v", tc.expectedContainer, container)
Copy link
Member

@rzetelskik rzetelskik Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use cmp.Diff instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


collidedContainerName, collision := hashes[hash]
if collision {
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Errorf("found collision on container endpoint ports, both pod and it's controlled ports and %q container with its serving port generate same hash", collidedContainerName)
t.Errorf("found collision on container endpoint ports, both pod and its controlled ports and %q container with its serving port generate same hash", collidedContainerName)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

t.Fatal(err)
}
if !apiequality.Semantic.DeepEqual(endpoints, tc.expectedEndpoints) {
t.Errorf("expected and actual Endpoints(s) differ: %s", cmp.Diff(tc.expectedEndpoints, endpoints))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Endpoints(s) doesn't really make sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
},
{
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "EndpointSlices to Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness",
name: "EndpointSlices for Service ports backed by container having readiness probe are ready when container is ready regardless of Pod readiness",

for consistency mostly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
},
{
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "EndpointSlice have IPV6 AddressType when PodIP from Pod.Status is an IPv6 address",
name: "EndpointSlice has IPV6 AddressType when PodIP from Pod.Status is an IPv6 address",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 4268 to 4310
for _, svcPort := range svc.Spec.Ports {
containerServingPort, ok, err := controllerhelpers.FindContainerServingPort(svcPort, sts.Spec.Template.Spec.Containers)
if err != nil {
t.Fatal(err)
}

ep := discoveryv1.EndpointPort{
Name: pointer.Ptr(svcPort.Name),
Port: pointer.Ptr(svcPort.Port),
Protocol: pointer.Ptr(svcPort.Protocol),
AppProtocol: svcPort.AppProtocol,
}

if ok && containerServingPort.ReadinessProbe != nil {
containerPortsMap[containerServingPort.Name] = append(containerPortsMap[containerServingPort.Name], ep)
} else {
podPorts = append(podPorts, ep)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be abstracted to a func shared between the test and implementation, so that we're sure they don't diverge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
},
{
name: "multi EndpointSlices per Service",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: multiple

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

t.Errorf("expected error %#v, got %#v", tc.expectedError, err)
}
if !reflect.DeepEqual(found, tc.expectedFound) {
t.Errorf("expected found %#v, got %#v", tc.expectedFound, found)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%t instead of %#v?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? these two produce the same output

@tnozicka
Copy link
Member

tnozicka commented May 2, 2024

@zimnx can you please link the issue on the kubernetes repo we've talked about filling in the past that explains the problem and why the existing API concepts didn't cover it? Was there any feedback you got on the approach or alternative ideas from the community?

pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
pod, err := podLister.Pods(sc.Namespace).Get(svc.Name)
if err != nil {
if apierrors.IsNotFound(err) {
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V4 log?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/controller/scyllacluster/resource.go Outdated Show resolved Hide resolved
@@ -249,3 +251,38 @@ func GetScyllaContainerID(pod *corev1.Pod) (string, error) {

return cs.ContainerID, nil
}

func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func FindContainerServingPort(port corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) {
func FindContainerServingPort(svcPort corev1.ServicePort, containers []corev1.Container) (corev1.Container, bool, error) {

port is ambiguous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/controllerhelpers/core.go Outdated Show resolved Hide resolved
pkg/controller/scyllacluster/resource.go Outdated Show resolved Hide resolved
pkg/naming/constants.go Outdated Show resolved Hide resolved
return h, err
}

fnvHash := fnv.New32a()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding sha512 wouldn't solve it, as it's too long to be used as part of a name

I didn't get that. Encoding converts the char sets. Hashes (and their encodings) can be arbitrarily truncated as they have random distribution.

pkg/controller/scyllacluster/conditions.go Show resolved Hide resolved
@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2024
@zimnx zimnx force-pushed the mz/endpointslices-controller branch from 1f80b92 to 48d9bb0 Compare July 9, 2024 20:16
@scylla-operator-bot scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2024
@tnozicka
Copy link
Member

/assign @rzetelskik
/approve

Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zimnx zimnx force-pushed the mz/endpointslices-controller branch from 48d9bb0 to cc8e792 Compare July 10, 2024 08:09
Kubernetes reconciles Endpoints and EndpointSlices for Services having
selector by taking entire Pod readiness into account. Because we need to
allow inter-node communication before CQL traffic we used
PublishNonReadyEndpoints on Services to overcome this.
But at the same time, we would like stop accepting new connections when
Pod is tearing down. These two requirements contradicts with each other.

To satisfy both requirements, Operator will reconcile Endpoints and
EndpointSlice resources in per-container way. If Pod container specifies
port and has its own Readiness probe, Endpoint/EndpointSlice
will become ready for this port when given container is ready. If these
two conditions aren't met, given port becomes ready when entire Pod is
ready.
Controller logic will consider Pod being deleted (non-nil
deletionTimestamp) as fully non-ready so that all endpoints for all
ports
and containers will removed.

Additional container added to Scylla Pod controls readiness of ports
used for inter-node communication.
@zimnx zimnx force-pushed the mz/endpointslices-controller branch from cc8e792 to b14579a Compare July 10, 2024 08:15
Copy link
Contributor

@zimnx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-release-script-latest 1f80b92 link true /test e2e-gke-release-script-latest
ci/prow/e2e-gke-parallel-clusterip b14579a link true /test e2e-gke-parallel-clusterip

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@zimnx zimnx changed the title Forward ClusterIP traffic to inter-node communication port before CQL is ready [WIP] Forward ClusterIP traffic to inter-node communication port before CQL is ready Jul 10, 2024
@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 10, 2024
@zimnx
Copy link
Collaborator Author

zimnx commented Jul 10, 2024

Marking as WIP, I need to triage those clusterip failures

Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out

/lifecycle rotten

@scylla-operator-bot scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2024
@ylebi ylebi removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster
4 participants