Skip to content

Support reconnection/resync with Typha #10306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

fasaxc
Copy link
Member

@fasaxc fasaxc commented Apr 25, 2025

Description

  • Add optional callback for reconnection-aware clients.

  • Adjust Typha discovery to reset after all Typhas have been tried.

  • Make the dedupe buffer reconnection-aware. It now

    • Stores off the keys that it had previously seen when it gets the OnTyphaConnectionRestarted() call.
    • Discards those seen keys as the resync progresses.
    • Synthesises deletions for KVs that weren't seen during the resync.
    • Recalculates the UpdateType when sending keys downstream so that the calculation graph sees a resync as a sequence of updates for exisitng keys.
  • Refactor the client so that it

    • Does one connecction synchronously (including connection attempts to mutliple Typha instances as before)
    • Reconnects in the background after a failure.
    • Sends WaitForDatastore/ResyncInProgress messages when it's doing a reconnection. (This should make the felix_resyncs_started and felix_resync_state Prometheus metrics useful again.
    • Re-uses a single connection attempt tracker so that we cycle through Typha instances on reconnect.
  • Varous minor changes:

    • Add "done" channels to various components to avoid "log to testing.T after test finished" errors in new tests.
    • Add 32 bit random value to connection ID. Makes it a lot more greppable in logs.

Related issues/PRs

CORE-11348

Todos

The client will still bail out if it can't connect to any Typha. Not sure if that's desirable or not; we could keep retrying but issues like running out of file handles might be better handled with a restart.

  • Tests
  • Documentation
  • Release note

Release Note

Typha clients (such as Felix) now try to reconnect and resync with Typha on connection failure.  This reduces the impact of Typha restart and load rebalancing.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@marvin-tigera marvin-tigera added this to the Calico v3.31.0 milestone Apr 25, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 25, 2025
@fasaxc fasaxc force-pushed the typha-reconnection branch from ca8aba2 to a25c0f5 Compare April 28, 2025 15:01
@fasaxc fasaxc force-pushed the typha-reconnection branch 4 times, most recently from 0106904 to d599369 Compare April 29, 2025 14:46
@fasaxc fasaxc marked this pull request as ready for review April 29, 2025 14:51
@fasaxc fasaxc requested a review from a team as a code owner April 29, 2025 14:51
@fasaxc fasaxc added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Apr 29, 2025
- Add optional callback for reconnection-aware clients.

- Adjust Typha discovery to reset after all Typhas have been tried.

- Make the dedupe buffer reconnection-aware. It now

  - Stores off the keys that it had previously seen when it gets the
    OnTyphaConnectionRestarted() call.
  - Discards those seen keys as the resync progresses.
  - Synthesises deletions for KVs that weren't seen during the resync.
  - Recalculates the UpdateType when sending keys downstream so that
    the calculation graph sees a resync as a sequence of updates for
    exisitng keys.

- Refactor the client so that it

  - Does one connecction synchronously (including connection attempts
    to mutliple Typha instances as before)
  - Reconnects in the background after a failure.
  - Sends WaitForDatastore/ResyncInProgress messages when it's doing
    a reconnection.
  - Re-uses a single connection attempt tracker so that we cycle through
    Typha instances on reconnect.

- Varous minor changes:

  - Add "done" channels to various components to avoid "log to testing.T
    after test finished" errors.
  - Add 32 bit random value to connection ID. Makes it a lot more
    greppable in logs.
@fasaxc fasaxc force-pushed the typha-reconnection branch from d599369 to 8a00314 Compare April 29, 2025 15:21
log.Debugf("Typha supports node resource updates: %v", supportsNodeResourceUpdates)
configParams.SetUseNodeResourceUpdates(supportsNodeResourceUpdates)
// Up-to-date Typha client will refuse to connect unless Typha signals
// that it supports node resource updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any issue with version skew on-upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the feature was added in 2019 so, yes, but only if you're skipping a dozen versions. Even in that case, it will only block new felix from talking to old typha, felix will just keep trying untill it connects to an up-level typha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants