-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance gorouter retry logic #437
Comments
We discussed this further offline: the underlying issue is that the retry mechanism makes assumptions about the round robin iterator which are not true. It assumes that it will always go over every endpoint exactly once when it calls The only simple solution I can think of right now is to introduce local round robin to each iterator but this will most likely make it less round robin if you get what I mean 😄 A middle ground would be to use the pool index to "seed" each iterator, e.g. give them their first index since we assume that most requests will pass, and |
Co-authored-by: Maximilian Moehl <[email protected]>
Co-authored-by: Maximilian Moehl <[email protected]>
Co-authored-by: Maximilian Moehl <[email protected]>
Current behavior
Start to send requests:
hey -c 1000 -n 1000 -H "X-Cf-App-Instance":"1edeb42a-076a-41a2-84a6-a70056e9938e:0" "https://retry-test.<domain>"
After 1-2 seconds:
hey -c 1000 -n 1000 "https://retry-test.<domain>"
All 1000 requests with the CF-App-Instance Header are successful.
Ca. 1/4 of request without CF-App-Instance Header finished with 502 as gorouter sent retries to the same app instance.
Observations:
Found logs (written by the same gorouter):
The request with vcap_request_id=703d8ce7-b187-41b6-4a02-a122d1ee8b60 was sent to the same instance_id=0 two times, reached the max_retries value and finished with 502.
In other words according to the Round Robin logic for seeking of next endpoint:
The Gorouter gets a first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
The Gorouter gets a request (ffed058e-a289-4ca9-7d22-0e345c2bd54f)
The Gorouter retries the first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
EOF is classified as retriable backend error but it does not belongt to FailableClassifiers.
Desired behavior
Affected Version
routing-release: 0.301
The text was updated successfully, but these errors were encountered: