Enhance gorouter retry logic

### Current behavior

- balancing_algorithm: round_robin
- cf app with two instances
- each instance has max_connection limit = 1000 
- app code:
```
import express from 'express';
import http from 'http';
import { setTimeout } from 'timers/promises';

const app = express();

app.get('/', async (req, res) => {
    await setTimeout(5000);
    res.send('Hello, World!');
});

const server = http.createServer(app);
server.maxConnections = 1000;

const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
    console.log(`Server is running on http://localhost:${PORT}`);
    console.log(`Max connections: ${server.maxConnections}`);
});
```

Start to send requests:
`hey -c 1000 -n 1000 -H "X-Cf-App-Instance":"1edeb42a-076a-41a2-84a6-a70056e9938e:0" "https://retry-test.<domain>"`
After 1-2 seconds:
`hey -c 1000 -n 1000 "https://retry-test.<domain>"`

All 1000 requests with the CF-App-Instance Header are successful. 
Ca. 1/4 of request without  CF-App-Instance Header finished with 502 as gorouter sent retries to the same app instance.

Observations: 
- Every RequestInfo has a pointer to the [EndpointPool](https://github.com/cloudfoundry/gorouter/blob/6331451d30dbf996cc9132ab617f07b25707b06b/handlers/lookup.go#L108). For the same uri(route) it points to the same EndpointPool. 
- The issue is related to the iteration [over pool endpoints](https://github.com/cloudfoundry/gorouter/blob/6331451d30dbf996cc9132ab617f07b25707b06b/route/roundrobin.go#L74). 
- Gorouter gets a lot of concurrent requests to the same route. 
- Max_retries = number of app instances.
- A couple of seconds the app instance with instance_id = 0 cannot accept new connections because of max connection limit in the app code. 

Found logs (written by the same gorouter):
```
"timestamp":"2024-09-10T09:46:47.350182234Z","message":"backend","data":{"instance_id":"0"},"attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.350406060Z","message":"backend","data":{"instance_id":"1"},"attempt":1,"vcap_request_id":"ffed058e-a289-4ca9-7d22-0e345c2bd54f"

"timestamp":"2024-09-10T09:46:47.357750671Z","message":"backend","data":{"instance_id":"0"},"attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.357588430Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":1,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"

"timestamp":"2024-09-10T09:46:47.363893483Z","message":"backend-endpoint-failed","data":{"instance_id":"0"},"error":"EOF (via idempotent request)","attempt":2,"vcap_request_id":"703d8ce7-b187-41b6-4a02-a122d1ee8b60"
```
The request with vcap_request_id=703d8ce7-b187-41b6-4a02-a122d1ee8b60 was sent to the same instance_id=0 two times, reached the max_retries value and finished with 502. 

In other words according to the [Round Robin logic for seeking of next endpoint](https://github.com/cloudfoundry/gorouter/blob/6331451d30dbf996cc9132ab617f07b25707b06b/route/roundrobin.go#L91):
The Gorouter gets a first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
- pool.currentIndex=0
- pool.nextIdx = 1  
- attempt = 1
- selectedEndpoint: app instance with instance_id=0
- Result: backend-endpoint-failed (EOF) & It should be retried. 

The Gorouter gets a request (ffed058e-a289-4ca9-7d22-0e345c2bd54f)
- pool.currentIndex=1
- pool.nextIdx = 0
- attempt=1
- selectedEndpoint: app instance with instance_id=1
- Result: Success

The Gorouter retries the first request (703d8ce7-b187-41b6-4a02-a122d1ee8b60) ->
- pool.currentIndex=0
- pool.nextIdx = 1  
- attempt=2
- selectedEndpoint: app instance with instance_id=0
- Result: backend-endpoint-failed (EOF) & 502 as max_retries reached. 

EOF is classified as retriable backend error but it does not belongt to FailableClassifiers.

### Desired behavior

- All requests have to be successful as two app instances can accept and process 2000 requests. 
- Gorouter's retry should not go to the same app instance twice. 

### Affected Version

routing-release: 0.301

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance gorouter retry logic #437

Current behavior

Desired behavior

Affected Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhance gorouter retry logic #437

Description

Current behavior

Desired behavior

Affected Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions