Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otlpmetrichttp: load-balance between multiple endpoint ips #5838

Open
sh0rez opened this issue Sep 24, 2024 · 2 comments
Open

otlpmetrichttp: load-balance between multiple endpoint ips #5838

sh0rez opened this issue Sep 24, 2024 · 2 comments
Labels
enhancement New feature or request response needed Waiting on user input before progress can be made

Comments

@sh0rez
Copy link
Member

sh0rez commented Sep 24, 2024

Problem Statement

I want to horizontally scale the OTel collector and have the SDK (somewhat evenly) distribute requests to collector instances.

I have a Headless Service for my collector that returns all instances when querying via DNS:

$ dig otelcol
;; ANSWER SECTION:
otelcol.                600     IN      A       172.22.0.5
otelcol.                600     IN      A       172.22.0.8

However, because the Go HTTP Client which this package uses keeps the tcp connection alive, the SDK sticks to the first ever returned address until it becomes unreachable.

This also applies to regular k8s Services, because once the tcp conn is opened, no further loadbalancing from the k8s side takes place.

There is golang/go#34511 requesting this for the standard library, but no real progress has been made since 2019.

Proposed Solution

Instead of relying on the HTTP Client to determine the endpoint out of the DNS list, do the following:

  • manually keep a list of endpoints
  • refresh it every n seconds (once per minute?)
  • for each write request, choose an ip of above list on a round-robin / random basis

If deemed acceptable, I am happy to contribute this functionality

Alternatives

Disable Keepalive

By disabling TCP keepalive, a new connection is made on every request, which includes a DNS lookup.
I confirmed this works by mangling with SDK internals, but is inefficient.

Use custom RoundTripper

In the Go issue the use of https://github.com/CAFxX/balancer is suggested.

This however leads to a DNS lookup on every request, which is undesirable

Have users deploy server-side loadbalancers

Of course this can be fixed server-side by deploying another layer of load-balancing proxies (nginx, etc) in front of the otel collector.
This greatly complicates the pipeline setup though, as one might end up with 3 layers (http loadbalancing, stateless collector for sticky otlp loadbalancing, stateful collector for processing)

@sh0rez sh0rez added the enhancement New feature or request label Sep 24, 2024
@dmathieu
Copy link
Member

Keeping a list of multiple endpoints is something that would break the specification requirements for OTLP exporters.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#configuration-options

Also, if we start doing that, it's a feature we're introducing to a stable component. We won't be able to remove it when/if Go fixes this and it's necessary anymore.

Using a custom round tripper/transport is also not going to be possible for now. See #2632

Disabling keep alives could be a valid option we add to the HTTP exporters clients.

@dmathieu
Copy link
Member

@sh0rez can this be closed?

@dmathieu dmathieu added the response needed Waiting on user input before progress can be made label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request response needed Waiting on user input before progress can be made
Projects
None yet
Development

No branches or pull requests

2 participants