Enhance gorouter retry logic #437
Description
Current behavior
- balancing_algorithm: round_robin
- cf app with two instances
- each instance has max_connection limit = 1000
- app code:
import express from 'express';
import http from 'http';
import { setTimeout } from 'timers/promises';
const app = express();
app.get('/', async (req, res) => {
await setTimeout(5000);
res.send('Hello, World!');
});
const server = http.createServer(app);
server.maxConnections = 1000;
const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
console.log(`Server is running on http://localhost:${PORT}`);
console.log(`Max connections: ${server.maxConnections}`);
});
Start to send requests:
hey -c 1000 -n 1000 -H "X-Cf-App-Instance":"1edeb42a-076a-41a2-84a6-a70056e9938e:0" "https://retry-test.<domain>"
After 1-2 seconds:
hey -c 1000 -n 1000 "https://retry-test.<domain>"
All 1000 requests with the CF-App-Instance Header are successful.
Ca. 1/4 of request without CF-App-Instance Header finished with 502 as gorouter sent retries to the same app instance.
Observations:
- Every RequestInfo has a pointer to the EndpointPool. For the same uri(route) it points to the same EndpointPool.
- The issue is related to the iteration over pool endpoints.
- Gorouter gets a lot of concurrent requests to the same route.
- Max_retries = number of app instances.
- A couple of seconds the app instance with instance_id = 0 cannot accept new connections because of max connection limit in the app code.
Found logs (written by the same gorouter):
"timestamp":"2024-09-10T09:46:47.350182234Z","message":"backend","data":{"instance_id":"0"},"attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"
"timestamp":"2024-09-10T09:46:47.350406060Z","message":"backend","data":{"instance_id":"1"},"attempt":1,"vcap_request_id":"ffed058e-a289-4ca9-7d22-0e345c2bd54f"
"timestamp":"2024-09-10T09:46:47.357750671Z","message":"backend","data":{"instance_id":"0"},"attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"
"timestamp":"2024-09-10T09:46:47.357588430Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"
"timestamp":"2024-09-10T09:46:47.363893483Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"
The request with vcap_request_id=703d8ce7-b187-41b6-4a02-a122d1ee8b60 was sent to the same instance_id=0 two times, reached the max_retries value and finished with 502.
In other words according to the Round Robin logic for seeking of next endpoint:
The Gorouter gets a first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
- pool.currentIndex=0
- pool.nextIdx = 1
- attempt = 1
- selectedEndpoint: app instance with instance_id=0
- Result: backend-endpoint-failed (EOF) & It should be retried.
The Gorouter gets a request (ffed058e-a289-4ca9-7d22-0e345c2bd54f)
- pool.currentIndex=1
- pool.nextIdx = 0
- attempt=1
- selectedEndpoint: app instance with instance_id=1
- Result: Success
The Gorouter retries the first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
- pool.currentIndex=0
- pool.nextIdx = 1
- attempt=2
- selectedEndpoint: app instance with instance_id=0
- Result: backend-endpoint-failed (EOF) & 502 as max_retries reached.
EOF is classified as retriable backend error but it does not belongt to FailableClassifiers.
Desired behavior
- All requests have to be successful as two app instances can accept and process 2000 requests.
- Gorouter's retry should not go to the same app instance twice.
Affected Version
routing-release: 0.301