Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Open
Tema opened this issue Oct 3, 2024 · 11 comments
Open

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Tema opened this issue Oct 3, 2024 · 11 comments
Labels
type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@Tema
Copy link
Contributor

Tema commented Oct 3, 2024

Feature Request

Describe your feature request related problem

In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. Existing mechanisms, such as the one discussed in Issue #4480 and the PR #6834, introduce rate-limiting and backoff strategies but are insufficient to prevent PD from being overloaded by a high volume of traffic, even before reaching the server-side limits.

Describe the feature you'd like

I propose implementing a circuit breaker pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. This circuit breaker would:

  • Actively monitor incoming requests through both gRPC and HTTP channels from TiDB, TiKV, TiFlash, CDC, and PD-ctl components.
  • Trip the circuit breaker when a predefined threshold of errors, retries, or resource exhaustion is detected, preventing further requests from overwhelming the PD leader.
  • Allow PD to enter a fail-fast state, limiting incoming traffic until the system has had time to recover or the underlying issue has been resolved.
  • Gradually restore normal operations by allowing a limited number of requests to flow through once the circuit has cooled down and conditions have stabilized.

This feature is especially critical for large clusters where a high number of pods can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.

Describe alternatives you've considered

While existing solutions like rate-limiting in Issue #4480 and PR #6834 provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.

Teachability, Documentation, Adoption, Migration Strategy

Introducing the circuit breaker pattern would likely require adjustments in the client request logic across TiDB, TiKV, TiFlash, CDC, and PD-ctl components. This feature could be made configurable, allowing users to set custom thresholds and recovery parameters to fit their specific cluster sizes and workloads.

Documentation would need to include clear guidelines on:

  • How the circuit breaker operates across the different components.
  • Configurable options for tuning the circuit breaker thresholds.
  • Best practices for monitoring and adjusting the circuit breaker behavior to avoid false positives or unnecessary tripping.

Scenarios where this feature could be helpful include:

  • A large number of TiDB nodes restart simultaneously due to a failure, causing a surge of reconnection requests to the PD leader and rebuilding local region cache
  • Continuous retries from components like TiKV, TiDB, TiFlash or CDC during network partitions, causing feedback loops that overwhelm PD.
@Tema Tema added the type/feature-request Categorizes issue or PR as related to a new feature. label Oct 3, 2024
@zhangjinpeng87
Copy link
Member

@niubell PTAL

@siddontang
Copy link
Contributor

how about TiKV or other services? do we also need to consider together?

@Tema
Copy link
Contributor Author

Tema commented Oct 30, 2024

how about TiKV or other services? do we also need to consider together?

@siddontang I agree. Given that a single key could be served only by on TiKV, we can get many tidb servers hammering single tikv server, so this looks like a similar problem as with PD.

@rleungx
Copy link
Member

rleungx commented Nov 5, 2024

Stage 1: Introduce circuit breaker for go client

  1. RFC ready

  2. Implement circuit breaker in PD client client: introduce circuit breaker for region calls #8856

    • Implement the basic framework.
    • Wrap some of the current functions, like GetRegion with the circuit breaker.
  3. Support changing settings dynamically through TiDB variables

    • Make circuit breaker dynamically configurable.
    • Add a new TiDB variable to switch on/off the circuit breaker, e.g., tidb_cb_pd_metadata_error_rate_threshold_pct.
  4. Verify the functionality

    • Add endless cases for the circuit breaker.
    • All existing tests, after we enable the circuit breaker, can pass without a drawback.

The more details can be expanded when we implement it.

@rleungx rleungx pinned this issue Nov 5, 2024
@niubell
Copy link
Contributor

niubell commented Nov 8, 2024

how about TiKV or other services? do we also need to consider together?

@siddontang Yes, the circuit-breaker pattern also applies to client-go(TiKV), this RFC only focuses on the high QPS domains/APIs in PD, and other components can reuse some base libs about circuit-breaker, cc @Tema

@niubell
Copy link
Contributor

niubell commented Nov 8, 2024

@niubell PTAL

@zhangjinpeng87 Thanks for the reminder, we have discussed details several times online before the formal RFC, LGTM

@rleungx
Copy link
Member

rleungx commented Dec 12, 2024

@Tema Would you like to implement the TiDB variables? If not, I will do it in the next few days.

@Tema
Copy link
Contributor Author

Tema commented Dec 13, 2024

@rleungx please allow me to finish it. I plan to cut a PR yearly next week.

@Tema
Copy link
Contributor Author

Tema commented Dec 16, 2024

@rleungx I have difficulties to bump pd/client library version to bring circuit breaker into tidb. The problem is github.com/tikv/pd/client/retry was moved to github.com/tikv/pd/client/pkg/retry and some other methods were moved around. Can someone help to bring latest pd/client library to tidb?

@rleungx
Copy link
Member

rleungx commented Dec 17, 2024

@rleungx I have difficulties to bump pd/client library version to bring circuit breaker into tidb. The problem is github.com/tikv/pd/client/retry was moved to github.com/tikv/pd/client/pkg/retry and some other methods were moved around. Can someone help to bring latest pd/client library to tidb?

Sure, @JmPotato is working on it. See tikv/client-go#1525

ti-chi-bot bot pushed a commit that referenced this issue Dec 18, 2024
ti-chi-bot bot added a commit that referenced this issue Dec 18, 2024
ref #8678, ref #8690

Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
@Tema
Copy link
Contributor Author

Tema commented Dec 20, 2024

tikv/client-go#1525 is landed.
Once pingcap/tidb#58440 landed, we would need to bump it one more time for #8936 then I can work on sysvar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants