[meta] Tail-based sampling (TBS) improvements #14931

carsonip · 2024-12-12T16:23:38Z

This is a meta-issue on tail-based sampling.

Tail-based sampling comes up frequently in bug reports, as there is minimal documentation and guidance on TBS configuration. It is not clear to users how TBS works, which leads to misconfigured TBS storage size, and consequently apm-server and ES issues.

When TBS local storage (badger) is filled, it results in error in writing traces (where apm-server logs received error writing sampled trace: configured storage limit reached (current: 127210377485, limit: 126000000000)) and bypassing TBS as sampling rate jumps to 100%, causing a performance cliff and downstream effects: surprising significant increase on writes to ES, and either slowing ES and causing backpressure to apm-server, or unexpected high storage usage in ES.

The task list contains tasks to either document it properly, investigate/fix bugs, and to provide escape hatches for compromises.

Impact: TBS is a popular feature among heavy apm-server users who rely on TBS to reduce ES storage requirements while retaining the value of the sampled traces. We need to ensure and show that TBS is good for high load, like the rest of apm-server.

Tasks

Give feedback

Benchmark and document tail-based sampling performance #11346

docs performance
Configurable option to handle events failed to be processed by TBS #11127

enhancement
Expose TBS TTL config via integration policy #13525

enhancement
TBS: apm-server never recovers from storage limit exceeded in rare cases #14923

bug
Update badger to latest version #11546

9.0-candidate enhancement
Revisit default TBS storage size limit sampling.tail.storage_limit and storage limit handling #14933

9.0-candidate enhancement
TBS: Document monitoring of disk space used by Tail Based Sampling in public docs. #14996

enhancement
TBS: Expired entries stay much longer than TTL and consume disk space #15121

enhancement
TBS: Explore replacing badger with pebble #15246

enhancement
Options

The text was updated successfully, but these errors were encountered:

lucabelluccini · 2024-12-19T14:12:00Z

[ESS/ECE only] Ability to see the disk size on Integration servers on the fly (even better, the available live disk usage) in Admin Console https://github.com/elastic/cloud/issues/128879
- Mitigation until then: guide users to know what is the disk size via documentation pointers
[ESS priority] Ability to automatically set the TBS max disk usage in the Integration policy as percentage of the whole disk OR set it automatically to a sane max value and freeze it (so the customer cannot exceed the maximum)
[ESS/ECE and on-premise] Ability to monitor TBS disk-related metrics on self-hosted APM Servers, Integration Servers and Integration Servers in ESS via at least a Dashboard (likely not possible to add new graphs to Stack Monitoring). The dashboards could be shipped with the apm input package or with the Elastic Agent
- [ALL], make sure the necessary metrics are shipped monitoring: apm-server not shiping all of its monitoring metrics #14247 and available for search & aggregations
- [ESS/ECE only], the prerequisite is to enable Metrics shipping via L&M on the deployment. This has to be documented.
- [On-premise], the prerequisite is to put in place a dedicated Metricbeat to monitor the Integration Server, which is odd
  - It would be great to have this integrated with the monitoring of all the other components via the EA Monitoring collection instead of relying on an external Metricbeat. I do not get why we are able to collect metrics from Filebeat, Metricbeat and other components, but not APM Server.
    - An alternative, might be to develop a beats integration able to collect monitoring data reusing the Metricbeat module https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-beat.html
  - [DOCS] The instructions we give here are no more necessary. Since 8.15 it is possible to customize the Elastic Agent policy to set agent.monitoring.http.enabled to true (but not for the Elastic Cloud agent policy) via the dedicated Monitoring settings. See screenshot. Opened issue at [DOCS] Better / improved instructions for APM Integration monitoring for on-premise observability-docs#4726
[ALL] Once the metrics are shipped, it would be nice to provide an out-of-the-box alert if disk is getting full due to TBS or hit the soft limit in order to be aware when APM Server will let through all the transactions.

raultorrecilla · 2025-01-10T17:54:13Z

moving this to it106 as TBS changes are still not closed

simitt · 2025-01-13T12:51:54Z

Here's my take on which tasks should be handled in the current iteration and which ones could be moved to another iteration if not feasible to tackle them all.

it-106:

it-107:

Benchmark and document tail-based sampling performance #11346
Expose TBS TTL config via integration policy #13525 (should become part of 9.0, but is not a breaking change)
TBS: Document monitoring of disk space used by Tail Based Sampling in public docs. #14996
TBS: Expired entries stay much longer than TTL and consume disk space #15121

@raultorrecilla some of the subtasks aren't groomed yet, they need to be added to an iteration.

carsonip added the meta label Dec 12, 2024

This was referenced Dec 19, 2024

Add option to enable TBS in benchmark terraform #14985

Merged

Record tbs disk usage stats in benchtest #14995

Merged

mergify bot mentioned this issue Dec 19, 2024

[8.x] Record tbs disk usage stats in benchtest (backport #14995) #15002

Merged

2 tasks

raultorrecilla assigned carsonip Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[meta] Tail-based sampling (TBS) improvements #14931

[meta] Tail-based sampling (TBS) improvements #14931

carsonip commented Dec 12, 2024 •

edited

Loading

Tasks

lucabelluccini commented Dec 19, 2024 •

edited

Loading

raultorrecilla commented Jan 10, 2025 •

edited

Loading

simitt commented Jan 13, 2025

[meta] Tail-based sampling (TBS) improvements #14931

[meta] Tail-based sampling (TBS) improvements #14931

Comments

carsonip commented Dec 12, 2024 • edited Loading

Tasks

lucabelluccini commented Dec 19, 2024 • edited Loading

raultorrecilla commented Jan 10, 2025 • edited Loading

simitt commented Jan 13, 2025

carsonip commented Dec 12, 2024 •

edited

Loading

lucabelluccini commented Dec 19, 2024 •

edited

Loading

raultorrecilla commented Jan 10, 2025 •

edited

Loading