Skip to content

Commit

Permalink
WIP: SLA reporting docs
Browse files Browse the repository at this point in the history
  • Loading branch information
yhabteab committed Aug 15, 2024
1 parent 80a2088 commit ab289bc
Show file tree
Hide file tree
Showing 3 changed files with 146 additions and 0 deletions.
146 changes: 146 additions & 0 deletions doc/10-Sla-Reporting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# SLA Reporting

A Service Level Agreement (SLA) is a legally binding contract between an outsourcing and technology provider and its
customer. Its purpose is to define the level of service that the supplier promises to deliver to the customer.
In short, SLA reports play an integral role for any service provider operating at the cutting edge of our digital world.

Icinga DB is designed to automatically identify and record the most relevant host and service events exclusively
in a separate table. By default, these events are retained forever unless you have set the retention
[`sla-days` option](03-Configuration.md#retention). It is important to note that Icinga DB records the raw events in
the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services
based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/)
is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe.

## Technical Description

!!! info

This documentation provides a detailed technical explanation of how Icinga DB fulfils all the
necessary requirements for the generation of an accurate service level agreement (SLA).

Icinga DB provides a built-in support for automatically storing the relevant events of your hosts and services without
a manual action. Generally, these events are every **hard** state change a particular host and service encounters and
all the downtimes scheduled for that host and service throughout its entire lifetime. It is important to note that the
aforementioned events are not analogous to those utilised by [Icinga DB Web](https://icinga.com/docs/icinga-db-web/latest/)
for visualising host and service states.

In case of a hard state change of a monitored host or service, Icinga DB records the precise temporal occurrence of
that state change in milliseconds within the `sla_history_state` table. The following image serves as a visual
illustration of the relational and representational aspects of state change events.

![SLA history state](images/sla_history_state.png)

In contrast, two timestamps are retained for downtimes, one indicating the commencement of the downtime as
`downtime_start` and the other denoting its end time, designated as `downtime_end` within the `sla_history_downtime`
table. For the sake of completeness, the following image provides also a visual representation of the
`sla_history_downtime` table and its relations.

![SLA history downtime](images/sla_history_downtime.png)

In certain circumstances, namely when a host or service is created and subsequently never deleted, this approach has
been empirically demonstrated to be sufficient. Nevertheless, in the case of a host being deleted and then recreated
a couple of days later, the generation of SLA reports in
[Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) for that host at
the end of the week may yield disparate results, depending on the state of the host prior to its deletion.

In order to generate SLA reports with the greatest possible accuracy, we have decided to supplement the existing data
with information regarding the `creation` and `deletion`of hosts and services in a new `sla_lifecycle` table, introduced
in Icinga DB `1.3.0`. The upgrade script for `1.3.0` generates a `create_time` SLA lifecycle entry for all existing
hosts and services. However, since that script has no knowledge of the creation time of these existing objects, the
timestamp for them is produced in the following manner. As previously outlined, Icinga DB has the capability to store
timestamps for both `hard state` changes and downtimes since its first stable release. This enables the upgrade script
to identify the least event time of a given host or service from the `sla_history_state` and `sla_history_downtime`
tables. In cases where no timestamps cannot be obtained from the aforementioned tables, it simply fallbacks to `now`,
i.e, in such situations, the creation time of the host or service in question is set to the current timestamp.

Unfortunately, Icinga 2 lacks the capability to accurately determine the deletion time of an object, which this
proposed release aims to address. It should be noted that Icinga DB is also incapable of identifying the precise
timestamp of an object's deletion. Instead, it simply records the time at which the deletion event for that particular
object occurred and populates the `sla_lifecycle` table accordingly. Consequently, if a host and service is deleted
while Icinga DB is stopped or not in an operational state, the events that Icinga DB would otherwise record once it is
restarted will not reflect the actual deletion or creation time. Nevertheless, there is no superior method for
addressing this issue in a manner that is both expedient and graceful.

### Implementation



#### Computing SLA OK percent

The following is a simplified explanation of the current (Icinga DB `1.3.0`) methodology behind the `get_sla_ok_percent`
SQL procedure, used to calculate the SLA OK percent. It is a fundamental characteristic of functional specifications
for Icinga Reporting to only generate reports covering a specific timeframe. Accordingly, the `get_sla_ok_percent`
SQL procedure necessitates the input of the start and end timeframes within which the SLA is to be calculated.

First get the total time of the specified timeframe, expressed in milliseconds, for which we're going to compute
the SLA OK percent (`total_time = timeline_end - timeline_start`).

Next, it is necessary to identify the latest [`hard_state`](#hard-state-vs-previous-hard-state) of the service or host
that occurred at or prior to the timeline start date, and marking it as the initial one. In case the first query fails
to determine a `hard_state` entry, it proceeds to search for a [`previous_hard_state`](#hard-state-vs-previous-hard-state)
entry in the `sla_history_state` table that have been recorded after the start of the timeline. If this approach also
fails to retrieve the desired outcome, the regular non-historical `host_state` or `service_state` table is then
examined for the current state. Should this also produce no results, then it uses `OK` as its initial state.

Afterward, it traverses the entire state and downtime events within the provided timeframe, performing a series of
simple mathematical arithmetic operations. The complete algorithmic process is illustrated in the following pseudocode.

```
total_time := timeline_end - timeline_start
// Mark the timeline start date as our last event time for now.
last_event_time := timeline_start
// The problem time of a given host or service is initial set to zero.
problem_time := 0
// The previous_hard_state is determined dynamically as described above, however,
// for the purposes of this analysis, we'll just set it to 'OK'.
previous_hard_state := OK
// Loop through all the state and downtime events within the provided timeframe and ordered by their timestamp.
for event in (sla_history_state, sla_history_downtime) do
if (event.previous_hard_state is PENDING) then
// A PENDING state event indicates that the host or service in question has not yet had a check result that
// clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating
// the SLA and we must exclude the duration of that PENDING state from the total time.
total_time = total_time - (event.event_time - last_event_time)
else if (previous_hard_state is greater than OK/UP AND previous_hard_state is not PENDING AND checkable is not in DOWNTIME) then
// If the previous_hard_state is set to a non-OK state and the host or service in question was not in downtime,
// we consider that time slot to be problematic and add the duration to the problem time.
problem_time = problem_time + event.event_time - last_event_time
endif
// Set the "last_event_time" to the timestamp of the event being currently processed.
last_event_time = event.event_time
if (event.type is "state change event") then
// If the event being currently processed is a state change event, we mark its
// latest hard state as the previous one for the next iteration.
previous_hard_state = event.hard_state
endif
endloop
```

At this point, we now have computed the problem time of a particular host or service for a given time frame. The final
step is to determine the percentage of the remaining total time. In other words, we want to find out how much of the
total time is taken up by the problem time, so that we can obtain our final SLA OK percentage result.
```
sla_ok_percent := 100 * (total_time - problem_time) / total_time
```

## Appendix

### Hard State vs. Previous Hard State

The `hard_state` column denotes the most recent hard state of the host and the service.
Conversely, the `previous_hard_state` column indicates the preceding hard state that was formerly stored in the
`hard_state` column prior to the host or service transitioning to a new hard state. Please refer to the tabular
representation below for a visual representation of this information.

| previous_hard_state | hard_state |
|-------------------------------|------------|
| PENDING (no check result yet) | OK |
| OK | Warning |
| Warning | Critical |
| Critical | OK |
Binary file added doc/images/sla_history_downtime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/sla_history_state.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ab289bc

Please sign in to comment.