Skip to content

Commit

Permalink
Respond to evanfreed feedback
Browse files Browse the repository at this point in the history
- Describe `backend_write_requests_failed_precondition_total`
- Include the precondition metric in the write availability formula.
- Turn the `registered_servers` discussion into a discussion of Teleport
  instance version, since it's not possible to group this metric by
  service and subtract the count of Auth Service/Proxy Service instances
  from the count of all registered services.
  • Loading branch information
ptgott authored and github-actions committed Oct 9, 2024
1 parent 4e4536d commit c24bb0c
Showing 1 changed file with 34 additions and 19 deletions.
53 changes: 34 additions & 19 deletions docs/pages/admin-guides/management/diagnostics/metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,11 +67,31 @@ The following backend operation metrics are available:
|Delete a range of items|`batch_write_requests`|
|Update the keepalive status of an item|`write_requests`|

You can use these metrics to define an availability formula, i.e., the
percentage of reads or writes that succeeded. Take the sum of requests that
succeeded (including batch requests) over the total sum of requests, multiplied
by 100. If your backend begins to appear unavailable, you can investigate your
backend infrastructure.
During failed backend writes, a Teleport process also increments the
`backend_write_requests_failed_precondition_total` metric if the cause of the
failure is expected. For example, the metric increments during a create
operation if a record already exists, during an update or delete operation if
the record is not found, and during an atomic write if the resource was modified
concurrently. All of these conditions can hold in a well-functioning Teleport
cluster.

`backend_write_requests_failed_precondition_total` increments whenever
`backend_write_requests_failed_total` increments, and you can use it to
distinguish potentially expected write failures from unexpected, problematic
ones.

You can use backend operation metrics to define an availability formula, i.e.,
the percentage of reads or writes that succeeded. For example, in Prometheus,
you can define a query similar to the following. This takes the percentage of
write requests that failed for unexpected reasons and subtracts it from 1 to get
a percentage of successful writes:

```
1- (sum(rate(backend_write_requests_failed_total -sum(rate(teleport_backend_write_requests_failed_precondition_total)) / sum(rate(backend_write_requests_total))
```

If your backend begins to appear unavailable, you can investigate your backend
infrastructure.

### Backend operation performance

Expand Down Expand Up @@ -160,27 +180,22 @@ by type over a given interval:
max(teleport_reverse_tunnels_connected) by (type))
```

### Count and version of Teleport Agents

Alongside the number of connected resources and reverse tunnels, you can track
the number of Agents in your Teleport cluster. Since you can run multiple
Teleport services on a single Agent instance, this metric helps you understand
the architecture of your Teleport Agent deployment so you can diagnose issues
with resource utilization.
## Teleport instance versions

At regular intervals (around 7 seconds with jitter), the Auth Service refreshes
its count of registered Agents. You can measure this count with the metric,
`teleport_registered_servers`. To get the number of registered Agents by
version, you can use this query in Grafana:
its count of registered Teleport instances, including Agents and Teleport
processes that run the Auth Service and Proxy Service. You can measure this
count with the metric, `teleport_registered_servers`. To get the number of
registered instances by version, you can use this query in Grafana:

```text
sum by (version)(teleport_registered_servers)
```

Since this metric is grouped by version, you can also tell how many of your
Agents are behind the version of the Auth Service and Proxy Service, which can
help you identify any that are at risk of violating the Teleport [version
compatibility guarantees](../../../upgrading/overview.mdx).
You can use this metric to tell how many of your registered Teleport instances
are behind the version of the Auth Service and Proxy Service, which can help you
identify any that are at risk of violating the Teleport [version compatibility
guarantees](../../../upgrading/overview.mdx).

We strongly encourage self-hosted Teleport users to enroll their Agents in
automatic updates. You can track the count of Teleport Agents that are not
Expand Down

0 comments on commit c24bb0c

Please sign in to comment.