Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance gorouter retry logic #437

Open
b1tamara opened this issue Sep 10, 2024 · 1 comment
Open

Enhance gorouter retry logic #437

b1tamara opened this issue Sep 10, 2024 · 1 comment
Labels

Comments

@b1tamara
Copy link
Contributor

b1tamara commented Sep 10, 2024

Current behavior

  • balancing_algorithm: round_robin
  • cf app with two instances
  • each instance has max_connection limit = 1000
  • app code:
import express from 'express';
import http from 'http';
import { setTimeout } from 'timers/promises';

const app = express();

app.get('/', async (req, res) => {
    await setTimeout(5000);
    res.send('Hello, World!');
});

const server = http.createServer(app);
server.maxConnections = 1000;

const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
    console.log(`Server is running on http://localhost:${PORT}`);
    console.log(`Max connections: ${server.maxConnections}`);
});

Start to send requests:
hey -c 1000 -n 1000 -H "X-Cf-App-Instance":"1edeb42a-076a-41a2-84a6-a70056e9938e:0" "https://retry-test.<domain>"
After 1-2 seconds:
hey -c 1000 -n 1000 "https://retry-test.<domain>"

All 1000 requests with the CF-App-Instance Header are successful.
Ca. 1/4 of request without CF-App-Instance Header finished with 502 as gorouter sent retries to the same app instance.

Observations:

  • Every RequestInfo has a pointer to the EndpointPool. For the same uri(route) it points to the same EndpointPool.
  • The issue is related to the iteration over pool endpoints.
  • Gorouter gets a lot of concurrent requests to the same route.
  • Max_retries = number of app instances.
  • A couple of seconds the app instance with instance_id = 0 cannot accept new connections because of max connection limit in the app code.

Found logs (written by the same gorouter):

"timestamp":"2024-09-10T09:46:47.350182234Z","message":"backend","data":{"instance_id":"0"},"attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.350406060Z","message":"backend","data":{"instance_id":"1"},"attempt":1,"vcap_request_id":"ffed058e-a289-4ca9-7d22-0e345c2bd54f"

"timestamp":"2024-09-10T09:46:47.357750671Z","message":"backend","data":{"instance_id":"0"},"attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.357588430Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.363893483Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

The request with vcap_request_id=703d8ce7-b187-41b6-4a02-a122d1ee8b60 was sent to the same instance_id=0 two times, reached the max_retries value and finished with 502.

In other words according to the Round Robin logic for seeking of next endpoint:
The Gorouter gets a first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->

  • pool.currentIndex=0
  • pool.nextIdx = 1
  • attempt = 1
  • selectedEndpoint: app instance with instance_id=0
  • Result: backend-endpoint-failed (EOF) & It should be retried.

The Gorouter gets a request (ffed058e-a289-4ca9-7d22-0e345c2bd54f)

  • pool.currentIndex=1
  • pool.nextIdx = 0
  • attempt=1
  • selectedEndpoint: app instance with instance_id=1
  • Result: Success

The Gorouter retries the first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->

  • pool.currentIndex=0
  • pool.nextIdx = 1
  • attempt=2
  • selectedEndpoint: app instance with instance_id=0
  • Result: backend-endpoint-failed (EOF) & 502 as max_retries reached.

EOF is classified as retriable backend error but it does not belongt to FailableClassifiers.

Desired behavior

  • All requests have to be successful as two app instances can accept and process 2000 requests.
  • Gorouter's retry should not go to the same app instance twice.

Affected Version

routing-release: 0.301

@maxmoehl
Copy link
Member

We discussed this further offline: the underlying issue is that the retry mechanism makes assumptions about the round robin iterator which are not true. It assumes that it will always go over every endpoint exactly once when it calls next (we documented that but apparently never properly verified it 😞) which is not true because the iterator uses a global counter from the referenced pool which may be incremented by parallel requests.

The only simple solution I can think of right now is to introduce local round robin to each iterator but this will most likely make it less round robin if you get what I mean 😄

A middle ground would be to use the pool index to "seed" each iterator, e.g. give them their first index since we assume that most requests will pass, and next calls on that iterator will then loop over the entire pool exactly once.

b1tamara added a commit to sap-contributions/gorouter that referenced this issue Oct 2, 2024
b1tamara added a commit to sap-contributions/gorouter that referenced this issue Oct 2, 2024
ameowlia pushed a commit that referenced this issue Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

2 participants