Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Nina Hingerl <[email protected]>
  • Loading branch information
hisarbalik and NHingerl authored Sep 11, 2024
1 parent 947b97c commit 734c66d
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions docs/contributor/arch/014-telemetry-self-monitor-storage.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 14. Telemetry Self Monitoring Storage
# 14. Telemetry Self-Monitoring Storage

Date: 2024-10-09

Expand All @@ -9,23 +9,23 @@ Proposed
## Context

The Telemetry module self-monitoring is crucial for the overall health of the system. The self-monitoring data is used to detect issues in the Telemetry module and to provide insights into the system's health. The self-monitoring data is stored in a time-series database (TSDB) and is used to generate alerts.
The current storage configuration and retention policy for the self-monitoring data are not well-defined, currently, some installation faces the issue self-monitoring storage fill-up and exceed the storage limit despite the retention policies 2 hours or 50 MBytes.
The Telemetry self-monitoring data is stored in the Prometheus TSDB, which is designed for large scale deployments, the amount data collected by the Telemetry self-monitoring is actually small compared to the Prometheus capabilities (currently few MBytes) nevertheless the storage size and retention policies have to be carefully configured.
The current storage configuration and retention policy for the self-monitoring data are not well-defined. Currently, some installations face the issue that self-monitoring storage fills up and exceeds the storage limit despite the retention policies of 2 hours or 50 MBytes.
The Telemetry self-monitoring data is stored in the Prometheus TSDB, which is designed for large-scale deployments. The amount of data collected by the Telemetry self-monitoring is actually small compared to the Prometheus capabilities (a few MBytes). Nevertheless, the storage size and retention policies must be carefully configured.


### Storage and Retention with TSDB

The TSDB storage size-based retention works in a way, it includes data blocks the WAL, checkpoints, m-mapped chunks, and persistent blocks. The TSDB although counts all of those storage blocks to decide any deletion, the WAL, checkpoints, and m-mapped chunks required for normal operation of TSDB.
Only persistence blocks are deleted even if all those data blocks go beyond the configured retention size. The WAL segments can grow up to 128MB before compacting, and Prometheus will keep at least 3 WAL files so called 2/3 rules. To ensure the Telemetry self-monitoring doesn't exceed the storage limit, minimum storage volume size should be calculated at least 3 * WAL segment size + some more space for other data types.
Even if all those data blocks go beyond the configured retention size, only persistence blocks are deleted. The WAL segments can grow up to 128MB before compacting, and Prometheus will keep at least 3 WAL files; so-called 2/3 rules. To ensure that Telemetry self-monitoring doesn't exceed the storage limit, minimum storage volume size should be calculated to be at least 3 * WAL segment size + some more space for other data types.

### TSDB Storage architecture and retention

For more information and insight into the Prometheus storage architecture and retention policy, please refer to the [Prometheus TSDB: Compaction and Retention](https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention).
The TSDB WAL and checkpoint architecture is described in the [Prometheus TSDB: WAL and Checkpoint](https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/).
For more information about the Prometheus storage architecture and retention policy, see [Prometheus TSDB: Compaction and Retention](https://ganeshvernekar.com/blog/prometheus-tsdb-compaction-and-retention).
For the TSDB WAL and checkpoint architecture, see [Prometheus TSDB: WAL and Checkpoint](https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/).


## Consequences

The Telemetry self-monitoring requires or collect very small amount of data for operation (currently few MBytes), despite the small amount of data, the storage size have to be at least 500MByte for a normal and safe operation.
Even though the Telemetry self-monitoring collects very little data for operation (currently, a few MBytes), the storage size must be at least 500MByte for a normal and safe operation.


0 comments on commit 734c66d

Please sign in to comment.