Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long task timer implementation #1674

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Fix long task timer implementation #1674

merged 1 commit into from
Jul 31, 2024

Conversation

tylerwowen
Copy link
Contributor

@tylerwowen tylerwowen commented Jul 24, 2024

The implementation of CumulativeHistogramLongTaskTimer in micrometer has a known issue. Besides, the implementation of writeLongTaskTimer is not compatible with CumulativeHistogramLongTaskTimer and results in non-sense histogram. The +Inf bucket should be the total counts of elements in the histogram, but it's set to the active count.

double count = timer.activeTasks();
List<String> metrics = new ArrayList<>();
metrics.add(writeMetricWithSuffix(timer.getId(), "active.count", wallTime, count));
metrics.add(
writeMetricWithSuffix(
timer.getId(), "duration.sum", wallTime, timer.duration(baseTimeUnit)));
metrics.add(writeMetricWithSuffix(timer.getId(), "max", wallTime, timer.max(baseTimeUnit)));
if (percentileValues.length > 0) {
metrics.addAll(writePercentiles(timer, wallTime, percentileValues));
}
if (histogramCounts.length > 0) {
metrics.addAll(writeHistogram(wallTime, timer, histogramCounts, count, baseTimeUnit));

Because PinStats supports GaugeHistogram, this PR adopts the second recommendation in the micrometer issue.

Alternatively, GaugeHistogram is exactly what we want and doesn't require the logic of adding the previous snapshot to maintain monotonicity. Unfortunately, GaugeHistogram is not available in Prometheus' text format; it's only defined in OpenMetrics.

GaugeHistogram works for Teletraan's use case because the purpose of this metrics is to allow users to detect out of SLA host launches immediately. It's not intended to provide the histogram of actual host launch durations.

The fix is simply to inherit from DefaultLongTaskTimer instead of CumulativeHistogramLongTaskTimer.

@tylerwowen tylerwowen requested a review from a team as a code owner July 24, 2024 18:45
@github-actions github-actions bot added the deploy-service Includes changes to deploy-service label Jul 24, 2024
@tylerwowen tylerwowen merged commit d144d49 into master Jul 31, 2024
6 checks passed
@tylerwowen tylerwowen deleted the touyang/ltt_fix branch July 31, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deploy-service Includes changes to deploy-service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants