Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: introduce circuit breaker for region calls #8856

Merged
merged 11 commits into from
Dec 10, 2024

Conversation

Tema
Copy link
Contributor

@Tema Tema commented Nov 26, 2024

What problem does this PR solve?

Issue Number: Ref #8678, tikv/rfcs#115

What is changed and how does it work?

This PR implements circuit breaker from tikv/rfcs#115 and integrates it into http and grpc PD clients for get region metadata calls, but disabled by default.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Nov 26, 2024
Copy link
Contributor

ti-chi-bot bot commented Nov 26, 2024

Hi @Tema. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 26, 2024
Signed-off-by: artem_danilov <[email protected]>
@rleungx
Copy link
Member

rleungx commented Nov 26, 2024

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Nov 26, 2024
@rleungx rleungx changed the title [pd-client] introduce circuit breaker for region calls client: introduce circuit breaker for region calls Nov 26, 2024
@rleungx rleungx requested review from nolouch and okJiang November 26, 2024 06:23
@nolouch nolouch requested a review from rleungx November 27, 2024 10:19
var ErrOpenState = errors.New("circuit breaker is open")

// Overloading is a type describing service return value
type Overloading int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just using bool?

@@ -541,6 +543,11 @@ func (c *client) GetReplicateConfig(ctx context.Context) (map[string]any, error)
return config, nil
}

// UpdateCircuitBreakerSettings updates config for circuit breaker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's focus on gRPC calls in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can stash http client changes for now

)

// ErrOpenState is returned when the CircuitBreaker is open or half-open with pending requests.
var ErrOpenState = errors.New("circuit breaker is open")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving it to errs/errno.go?

state *State[T]

successCounter prometheus.Counter
failureCounter prometheus.Counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will it be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failureCounter should be used next to successCounter in onResult instead of fastFailCounter.
The idea of having these metrics is to see overall error rate over time to decide on ErrorRateThresholdPct value to set/overrie

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find where it is used in the current codebase.

@Tema
Copy link
Contributor Author

Tema commented Nov 28, 2024

@rleungx can you help to take a look at https://github.com/tikv/pd/actions/runs/12043529055/job/33580779593?pr=8856 as I don't see why it marks it as FAILED while as there is no FAIL in the tests.

@rleungx
Copy link
Member

rleungx commented Nov 28, 2024

@rleungx can you help to take a look at https://github.com/tikv/pd/actions/runs/12043529055/job/33580779593?pr=8856 as I don't see why it marks it as FAILED while as there is no FAIL in the tests.

=== RUN   TestCircuitBreaker_OpenState_Not_Enough_QPS
    circuit_breaker_test.go:58: 
        	Error Trace:	/home/runner/work/pd/pd/client/circuit_breaker/circuit_breaker_test.go:58
        	Error:      	Not equal: 
        	            	expected: circuit_breaker.StateType(0)
        	            	actual  : *circuit_breaker.State[int](&circuit_breaker.State[int]{stateType:0, cb:(*circuit_breaker.CircuitBreaker[int])(0xc00012e7e0), end:time.Time{wall:0xc1c9c8f10d7e21ae, ext:30011969707, loc:(*time.Location)(0xd37300)}, pendingCount:0x0, successCount:0x0, failureCount:0x0})
        	Test:       	TestCircuitBreaker_OpenState_Not_Enough_QPS
--- FAIL: TestCircuitBreaker_OpenState_Not_Enough_QPS (0.00s)

Signed-off-by: artem_danilov <[email protected]>
Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 80.80808% with 38 lines in your changes missing coverage. Please review.

Project coverage is 76.19%. Comparing base (7df0ff9) to head (4c0cc9e).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8856      +/-   ##
==========================================
- Coverage   76.21%   76.19%   -0.03%     
==========================================
  Files         460      461       +1     
  Lines       70239    70431     +192     
==========================================
+ Hits        53532    53662     +130     
- Misses      13362    13409      +47     
- Partials     3345     3360      +15     
Flag Coverage Δ
unittests 76.19% <80.80%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@Tema
Copy link
Contributor Author

Tema commented Nov 29, 2024

@rleungx thanks for the comments. I think I've addressed all of those and also fixed all test and lint failures.

client/circuitbreaker/circuit_breaker.go Outdated Show resolved Hide resolved
client/client.go Show resolved Hide resolved
client/circuitbreaker/circuit_breaker.go Outdated Show resolved Hide resolved
Comment on lines +81 to +85
cb := NewCircuitBreaker[int]("test_cb", settings)
re.Equal(StateClosed, cb.state.stateType)
driveQPS(cb, minCountToOpen, Yes, re)
cb.advance(settings.ErrorRateWindow)
assertFastFail(cb, re)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wraping a function to reduce redundant code?

client/circuitbreaker/circuit_breaker_test.go Show resolved Hide resolved
<-ended[i]
}
assertSucceeds(cb, re)
re.Equal(StateClosed, cb.state.stateType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will be failed. Is it expected?

Suggested change
re.Equal(StateClosed, cb.state.stateType)
re.Equal(StateClosed, cb.state.stateType)
re.Equal(uint32(2), cb.state.successCount)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test multi-thread scene more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will be failed. Is it expected?

No, first two requests are counted inside HalfOpen state, then we sent only once request after closing circuit breaker, so successCount is expected to be 1. I'm adding this assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test multi-thread scene more?

Multi-threaded testing are hard and verbose, so I tend to add them only to critical functionality like this one. I can add a few more tests to validate that we don't double or undercount, but I feel they are much less critical.

Copy link
Member

@okJiang okJiang Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case to test two threads call Execute during one state(StateOpen)? Just like this case. I misunderstand them. I think we need them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. What invariant I would be testing/asserting on after holding two threads in the Execute?

Copy link
Member

@okJiang okJiang Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just test count. Do not need to assert State.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case to test two threads call Execute during one state(StateOpen)?
Open state does not execute calls, but short circuits them, so can't write a pending requests test.

However, I've added TestCircuitBreaker_Count_Only_Requests_In_Same_Window. PTAL.

client/circuitbreaker/circuit_breaker.go Outdated Show resolved Hide resolved
HalfOpenSuccessCount uint32
}

// AlwaysOpenSettings is a configuration that never trips the circuit breaker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never trips = AlwaysCloseSettings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, circuit breaker trips == opens (the current can't flow anymore), meaning never trips == always closed. Or I misunderstood your question? Feel free to suggest a better comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should it be named AlwaysCloseSettings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you! Fixed.

Signed-off-by: artem_danilov <[email protected]>
@Tema
Copy link
Contributor Author

Tema commented Dec 5, 2024

@rleungx @nolouch @okJiang thanks for review. I think I've addressed most of comments in the last commit:

  • aligned metric names with state counters
  • added more documentation to state machine
  • checked total for 0 and moved ErrorRateThresholdPct > 0 check earlier

@Tema
Copy link
Contributor Author

Tema commented Dec 5, 2024

/retest

@Tema
Copy link
Contributor Author

Tema commented Dec 6, 2024

TestConnected failed. Does not seem related to this PR. Pass locally.

@Tema
Copy link
Contributor Author

Tema commented Dec 6, 2024

/retest

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 9, 2024
Copy link
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except #8856 (comment). Rest lgtm

Signed-off-by: artem_danilov <[email protected]>
@Tema
Copy link
Contributor Author

Tema commented Dec 10, 2024

@nolouch @okJiang I've addressed your comments. PTAL.

Also could you please stamp tikv/rfcs#115

Copy link
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but wait for @nolouch last review

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 10, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nolouch, rleungx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Dec 10, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-09 02:32:50.05079489 +0000 UTC m=+232960.139597433: ☑️ agreed by rleungx.
  • 2024-12-10 03:58:24.664723915 +0000 UTC m=+324494.753526459: ☑️ agreed by nolouch.

@ti-chi-bot ti-chi-bot bot merged commit 1e76110 into tikv:master Dec 10, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants