From c24bb0c67b4251d26a80a83cc79eee2706d54a7c Mon Sep 17 00:00:00 2001 From: Paul Gottschling Date: Wed, 2 Oct 2024 10:48:44 -0400 Subject: [PATCH] Respond to evanfreed feedback - Describe `backend_write_requests_failed_precondition_total` - Include the precondition metric in the write availability formula. - Turn the `registered_servers` discussion into a discussion of Teleport instance version, since it's not possible to group this metric by service and subtract the count of Auth Service/Proxy Service instances from the count of all registered services. --- .../management/diagnostics/metrics.mdx | 53 ++++++++++++------- 1 file changed, 34 insertions(+), 19 deletions(-) diff --git a/docs/pages/admin-guides/management/diagnostics/metrics.mdx b/docs/pages/admin-guides/management/diagnostics/metrics.mdx index 037c24cb3c2b..08acf7acca44 100644 --- a/docs/pages/admin-guides/management/diagnostics/metrics.mdx +++ b/docs/pages/admin-guides/management/diagnostics/metrics.mdx @@ -67,11 +67,31 @@ The following backend operation metrics are available: |Delete a range of items|`batch_write_requests`| |Update the keepalive status of an item|`write_requests`| -You can use these metrics to define an availability formula, i.e., the -percentage of reads or writes that succeeded. Take the sum of requests that -succeeded (including batch requests) over the total sum of requests, multiplied -by 100. If your backend begins to appear unavailable, you can investigate your -backend infrastructure. +During failed backend writes, a Teleport process also increments the +`backend_write_requests_failed_precondition_total` metric if the cause of the +failure is expected. For example, the metric increments during a create +operation if a record already exists, during an update or delete operation if +the record is not found, and during an atomic write if the resource was modified +concurrently. All of these conditions can hold in a well-functioning Teleport +cluster. + +`backend_write_requests_failed_precondition_total` increments whenever +`backend_write_requests_failed_total` increments, and you can use it to +distinguish potentially expected write failures from unexpected, problematic +ones. + +You can use backend operation metrics to define an availability formula, i.e., +the percentage of reads or writes that succeeded. For example, in Prometheus, +you can define a query similar to the following. This takes the percentage of +write requests that failed for unexpected reasons and subtracts it from 1 to get +a percentage of successful writes: + +``` +1- (sum(rate(backend_write_requests_failed_total -sum(rate(teleport_backend_write_requests_failed_precondition_total)) / sum(rate(backend_write_requests_total)) +``` + +If your backend begins to appear unavailable, you can investigate your backend +infrastructure. ### Backend operation performance @@ -160,27 +180,22 @@ by type over a given interval: max(teleport_reverse_tunnels_connected) by (type)) ``` -### Count and version of Teleport Agents - -Alongside the number of connected resources and reverse tunnels, you can track -the number of Agents in your Teleport cluster. Since you can run multiple -Teleport services on a single Agent instance, this metric helps you understand -the architecture of your Teleport Agent deployment so you can diagnose issues -with resource utilization. +## Teleport instance versions At regular intervals (around 7 seconds with jitter), the Auth Service refreshes -its count of registered Agents. You can measure this count with the metric, -`teleport_registered_servers`. To get the number of registered Agents by -version, you can use this query in Grafana: +its count of registered Teleport instances, including Agents and Teleport +processes that run the Auth Service and Proxy Service. You can measure this +count with the metric, `teleport_registered_servers`. To get the number of +registered instances by version, you can use this query in Grafana: ```text sum by (version)(teleport_registered_servers) ``` -Since this metric is grouped by version, you can also tell how many of your -Agents are behind the version of the Auth Service and Proxy Service, which can -help you identify any that are at risk of violating the Teleport [version -compatibility guarantees](../../../upgrading/overview.mdx). +You can use this metric to tell how many of your registered Teleport instances +are behind the version of the Auth Service and Proxy Service, which can help you +identify any that are at risk of violating the Teleport [version compatibility +guarantees](../../../upgrading/overview.mdx). We strongly encourage self-hosted Teleport users to enroll their Agents in automatic updates. You can track the count of Teleport Agents that are not