-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] Tail-based sampling (TBS) improvements #14931
Labels
Comments
This was referenced Dec 19, 2024
|
2 tasks
moving this to it106 as TBS changes are still not closed |
Here's my take on which tasks should be handled in the current iteration and which ones could be moved to another iteration if not feasible to tackle them all.
it-107:
@raultorrecilla some of the subtasks aren't groomed yet, they need to be added to an iteration. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a meta-issue on tail-based sampling.
Tail-based sampling comes up frequently in bug reports, as there is minimal documentation and guidance on TBS configuration. It is not clear to users how TBS works, which leads to misconfigured TBS storage size, and consequently apm-server and ES issues.
When TBS local storage (badger) is filled, it results in error in writing traces (where apm-server logs
received error writing sampled trace: configured storage limit reached (current: 127210377485, limit: 126000000000)
) and bypassing TBS as sampling rate jumps to 100%, causing a performance cliff and downstream effects: surprising significant increase on writes to ES, and either slowing ES and causing backpressure to apm-server, or unexpected high storage usage in ES.The task list contains tasks to either document it properly, investigate/fix bugs, and to provide escape hatches for compromises.
Impact: TBS is a popular feature among heavy apm-server users who rely on TBS to reduce ES storage requirements while retaining the value of the sampled traces. We need to ensure and show that TBS is good for high load, like the rest of apm-server.
Tasks
sampling.tail.storage_limit
and storage limit handling #14933The text was updated successfully, but these errors were encountered: