Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/loadbalancing] feat(lb): Introduce the ability to load balance on composite keys in lb #36567

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jpkrohling
Copy link
Member

@jpkrohling jpkrohling commented Nov 27, 2024

Right now, there's a problem at high throughput using the load balancer
and the service.name resource attribute: The load balancers themself
get slow. While it's possible to vertically scale them to a point (e.g.
about 100k req/sec), as they get slow they star tot back up traffic and
block on requests. Applications then can't write as many spans out, and
start dropping spans.

This commit seeks to address that by extending the load balancing
collector to allow create a composite from attributes that can still
keep the load balancing decision "consistent enough" to reduce
cardinallity, but still spread the load across ${N} collectors.

It doesn't make too many assumptions about how the operators will use
this, except that the underlying data (the spans) are unlikely to be
complete in all cases, and the key generation is "best effort". This is
a deviation from the existing design, in which hard-requires
"span.name".

== Design Notes
=== Contributor Skill

As a contributor, I'm very much new to the opentelemetry collector, and
do not anticipate I will be contributing much except for as needs
require to tune the collectors that I am responsible for. Given this,
the code may violate certain assumptions that are otherwise "well
known".

=== Required Knowledge

The biggest surprise in this code was that despite accepting a slice,
the routingIdentifierFromTraces function assumes spans have been
processed with the batchpersignal.SplitTraces() function, which appears
to ensure taht each "trace" contains only a single span (thus allowing
them to be multiplexed effectively)

This allows the function to be simplified quite substantially.

=== Use case

The primary use case I am thinking about when writing this work is
calculating metrics in the spanmetricsconnector component. Essentially,
services drive far too much traffic for a single collector instance to
handle, so we need to multiplex it in a way that still allows them to be
calculated in a single place (limiting cardinality) but also, spreads
the load across ${N} collectors.

=== Traces only implementation

This commit addreses this only for traces, as I only care about traces.
The logic can likely be extended easily, however.

Fixes #35320
Fixes #33660

Right now, there's a problem at high throughput using the load balancer
and the `service.name` resource attribute: The load balancers themself
get slow. While it's possible to vertically scale them to a point (e.g.
about 100k req/sec), as they get slow they star tot back up traffic and
block on requests. Applications then can't write as many spans out, and
start dropping spans.

This commit seeks to address that by extending the load balancing
collector to allow create a composite from attributes that can still
keep the load balancing decision "consistent enough" to reduce
cardinallity, but still spread the load across ${N} collectors.

It doesn't make too many assumptions about how the operators will use
this, except that the underlying data (the spans) are unlikely to be
complete in all cases, and the key generation is "best effort". This is
a deviation from the existing design, in which hard-requires
"span.name".

== Design Notes
=== Contributor Skill

As a contributor, I'm very much new to the opentelemetry collector, and
do not anticipate I will be contributing much except for as needs
require to tune the collectors that I am responsible for. Given this,
the code may violate certain assumptions that are otherwise "well
known".

=== Required Knowledge

The biggest surprise in this code was that despite accepting a slice,
the routingIdentifierFromTraces function assumes spans have been
processed with the batchpersignal.SplitTraces() function, which appears
to ensure taht each "trace" contains only a single span (thus allowing
them to be multiplexed effectively)

This allows the function to be simplified quite substantially.

=== Use case

The primary use case I am thinking about when writing this work is
calculating metrics in the spanmetricsconnector component. Essentially,
services drive far too much traffic for a single collector instance to
handle, so we need to multiplex it in a way that still allows them to be
calculated in a single place (limiting cardinality) but also, spreads
the load across ${N} collectors.

=== Traces only implementation

This commit addreses this only for traces, as I only care about traces.
The logic can likely be extended easily, however.
@jpkrohling
Copy link
Member Author

Original work by @dhftah on #35125. This is just a rebase of that to address conflicts and linter issues.

@jpkrohling jpkrohling changed the title feat(lb): Introduce the ability to load balance on composite keys in lb [exporter/loadbalancing] feat(lb): Introduce the ability to load balance on composite keys in lb Nov 27, 2024
Signed-off-by: Juraci Paixão Kröhling <[email protected]>
@dhftah
Copy link
Contributor

dhftah commented Dec 9, 2024

(How do we move this forward?)

@jpkrohling
Copy link
Member Author

(How do we move this forward?)

I think @atoulme was AFK for a few days. If he's OK, I'd prefer to merge this as is, and work on the follow-up items on new PRs.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 24, 2024
Copy link
Contributor

github-actions bot commented Jan 7, 2025

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jan 7, 2025
@atoulme atoulme reopened this Feb 3, 2025
Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this PR. LGTM

@github-actions github-actions bot removed the Stale label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Routing by origin - service is not good enough Load Balancing by Attribute
5 participants