WIP: SLA reporting docs

Icinga · Aug 15, 2024 · ab289bc · ab289bc
1 parent 80a2088
commit ab289bc
Show file tree

Hide file tree

Showing 3 changed files with 146 additions and 0 deletions.
diff --git a/doc/10-Sla-Reporting.md b/doc/10-Sla-Reporting.md
@@ -0,0 +1,146 @@
+# SLA Reporting
+
+A Service Level Agreement (SLA) is a legally binding contract between an outsourcing and technology provider and its
+customer. Its purpose is to define the level of service that the supplier promises to deliver to the customer.
+In short, SLA reports play an integral role for any service provider operating at the cutting edge of our digital world.
+
+Icinga DB is designed to automatically identify and record the most relevant host and service events exclusively
+in a separate table. By default, these events are retained forever unless you have set the retention
+[`sla-days` option](03-Configuration.md#retention). It is important to note that Icinga DB records the raw events in
+the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services
+based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/)
+is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe.
+
+## Technical Description
+
+!!! info
+
+    This documentation provides a detailed technical explanation of how Icinga DB fulfils all the
+    necessary requirements for the generation of an accurate service level agreement (SLA).
+
+Icinga DB provides a built-in support for automatically storing the relevant events of your hosts and services without
+a manual action. Generally, these events are every **hard** state change a particular host and service encounters and
+all the downtimes scheduled for that host and service throughout its entire lifetime. It is important to note that the
+aforementioned events are not analogous to those utilised by [Icinga DB Web](https://icinga.com/docs/icinga-db-web/latest/)
+for visualising host and service states.
+
+In case of a hard state change of a monitored host or service, Icinga DB records the precise temporal occurrence of
+that state change in milliseconds within the `sla_history_state` table. The following image serves as a visual
+illustration of the relational and representational aspects of state change events.
+
+![SLA history state](images/sla_history_state.png)
+
+In contrast, two timestamps are retained for downtimes, one indicating the commencement of the downtime as
+`downtime_start` and the other denoting its end time, designated as `downtime_end` within the `sla_history_downtime`
+table. For the sake of completeness, the following image provides also a visual representation of the
+`sla_history_downtime` table and its relations.
+
+![SLA history downtime](images/sla_history_downtime.png)
+
+In certain circumstances, namely when a host or service is created and subsequently never deleted, this approach has
+been empirically demonstrated to be sufficient. Nevertheless, in the case of a host being deleted and then recreated
+a couple of days later, the generation of SLA reports in
+[Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) for that host at
+the end of the week may yield disparate results, depending on the state of the host prior to its deletion.
+
+In order to generate SLA reports with the greatest possible accuracy, we have decided to supplement the existing data
+with information regarding the `creation` and `deletion`of hosts and services in a new `sla_lifecycle` table, introduced
+in Icinga DB `1.3.0`. The upgrade script for `1.3.0` generates a `create_time` SLA lifecycle entry for all existing
+hosts and services. However, since that script has no knowledge of the creation time of these existing objects, the
+timestamp for them is produced in the following manner. As previously outlined, Icinga DB has the capability to store
+timestamps for both `hard state` changes and downtimes since its first stable release. This enables the upgrade script
+to identify the least event time of a given host or service from the `sla_history_state` and `sla_history_downtime`
+tables. In cases where no timestamps cannot be obtained from the aforementioned tables, it simply fallbacks to `now`,
+i.e, in such situations, the creation time of the host or service in question is set to the current timestamp.
+
+Unfortunately, Icinga 2 lacks the capability to accurately determine the deletion time of an object, which this
+proposed release aims to address. It should be noted that Icinga DB is also incapable of identifying the precise
+timestamp of an object's deletion. Instead, it simply records the time at which the deletion event for that particular
+object occurred and populates the `sla_lifecycle` table accordingly. Consequently, if a host and service is deleted
+while Icinga DB is stopped or not in an operational state, the events that Icinga DB would otherwise record once it is
+restarted will not reflect the actual deletion or creation time. Nevertheless, there is no superior method for
+addressing this issue in a manner that is both expedient and graceful.
+
+### Implementation
+
+
+
+#### Computing SLA OK percent
+
+The following is a simplified explanation of the current (Icinga DB `1.3.0`) methodology behind the `get_sla_ok_percent`
+SQL procedure, used to calculate the SLA OK percent. It is a fundamental characteristic of functional specifications
+for Icinga Reporting to only generate reports covering a specific timeframe. Accordingly, the `get_sla_ok_percent`
+SQL procedure necessitates the input of the start and end timeframes within which the SLA is to be calculated.
+
+First get the total time of the specified timeframe, expressed in milliseconds, for which we're going to compute
+the SLA OK percent (`total_time = timeline_end - timeline_start`).
+
+Next, it is necessary to identify the latest [`hard_state`](#hard-state-vs-previous-hard-state) of the service or host
+that occurred at or prior to the timeline start date, and marking it as the initial one. In case the first query fails
+to determine a `hard_state` entry, it proceeds to search for a [`previous_hard_state`](#hard-state-vs-previous-hard-state)
+entry in the `sla_history_state` table that have been recorded after the start of the timeline. If this approach also
+fails to retrieve the desired outcome, the regular non-historical `host_state` or `service_state` table is then
+examined for the current state. Should this also produce no results, then it uses `OK` as its initial state.
+
+Afterward, it traverses the entire state and downtime events within the provided timeframe, performing a series of
+simple mathematical arithmetic operations. The complete algorithmic process is illustrated in the following pseudocode.
+
+```
+total_time := timeline_end - timeline_start
+
+// Mark the timeline start date as our last event time for now.
+last_event_time := timeline_start
+
+// The problem time of a given host or service is initial set to zero.
+problem_time := 0
+
+// The previous_hard_state is determined dynamically as described above, however,
+// for the purposes of this analysis, we'll just set it to 'OK'.
+previous_hard_state := OK
+
+// Loop through all the state and downtime events within the provided timeframe and ordered by their timestamp.
+for event in (sla_history_state, sla_history_downtime) do
+    if (event.previous_hard_state is PENDING) then
+        // A PENDING state event indicates that the host or service in question has not yet had a check result that
+        // clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating
+        // the SLA and we must exclude the duration of that PENDING state from the total time.
+        total_time = total_time - (event.event_time - last_event_time)
+    else if (previous_hard_state is greater than OK/UP AND previous_hard_state is not PENDING AND checkable is not in DOWNTIME) then
+        // If the previous_hard_state is set to a non-OK state and the host or service in question was not in downtime,
+        // we consider that time slot to be problematic and add the duration to the problem time.
+        problem_time = problem_time + event.event_time - last_event_time
+    endif
+
+    // Set the "last_event_time" to the timestamp of the event being currently processed.
+    last_event_time = event.event_time
+    
+    if (event.type is "state change event") then
+        // If the event being currently processed is a state change event, we mark its
+        // latest hard state as the previous one for the next iteration.
+        previous_hard_state = event.hard_state
+    endif
+endloop
+```
+
+At this point, we now have computed the problem time of a particular host or service for a given time frame. The final
+step is to determine the percentage of the remaining total time. In other words, we want to find out how much of the
+total time is taken up by the problem time, so that we can obtain our final SLA OK percentage result.
+```
+sla_ok_percent := 100 * (total_time - problem_time) / total_time
+```
+
+## Appendix
+
+### Hard State vs. Previous Hard State
+
+The `hard_state` column denotes the most recent hard state of the host and the service.
+Conversely, the `previous_hard_state` column indicates the preceding hard state that was formerly stored in the
+`hard_state` column prior to the host or service transitioning to a new hard state. Please refer to the tabular
+representation below for a visual representation of this information.
+
+| previous_hard_state           | hard_state |
+|-------------------------------|------------|
+| PENDING (no check result yet) | OK         |
+| OK                            | Warning    |
+| Warning                       | Critical   |
+| Critical                      | OK         |
diff --git a/doc/images/sla_history_downtime.png b/doc/images/sla_history_downtime.png
diff --git a/doc/images/sla_history_state.png b/doc/images/sla_history_state.png