-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
146 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
# SLA Reporting | ||
|
||
A Service Level Agreement (SLA) is a legally binding contract between an outsourcing and technology provider and its | ||
customer. Its purpose is to define the level of service that the supplier promises to deliver to the customer. | ||
In short, SLA reports play an integral role for any service provider operating at the cutting edge of our digital world. | ||
|
||
Icinga DB is designed to automatically identify and record the most relevant host and service events exclusively | ||
in a separate table. By default, these events are retained forever unless you have set the retention | ||
[`sla-days` option](03-Configuration.md#retention). It is important to note that Icinga DB records the raw events in | ||
the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services | ||
based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) | ||
is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe. | ||
|
||
## Technical Description | ||
|
||
!!! info | ||
|
||
This documentation provides a detailed technical explanation of how Icinga DB fulfils all the | ||
necessary requirements for the generation of an accurate service level agreement (SLA). | ||
|
||
Icinga DB provides a built-in support for automatically storing the relevant events of your hosts and services without | ||
a manual action. Generally, these events are every **hard** state change a particular host and service encounters and | ||
all the downtimes scheduled for that host and service throughout its entire lifetime. It is important to note that the | ||
aforementioned events are not analogous to those utilised by [Icinga DB Web](https://icinga.com/docs/icinga-db-web/latest/) | ||
for visualising host and service states. | ||
|
||
In case of a hard state change of a monitored host or service, Icinga DB records the precise temporal occurrence of | ||
that state change in milliseconds within the `sla_history_state` table. The following image serves as a visual | ||
illustration of the relational and representational aspects of state change events. | ||
|
||
![SLA history state](images/sla_history_state.png) | ||
|
||
In contrast, two timestamps are retained for downtimes, one indicating the commencement of the downtime as | ||
`downtime_start` and the other denoting its end time, designated as `downtime_end` within the `sla_history_downtime` | ||
table. For the sake of completeness, the following image provides also a visual representation of the | ||
`sla_history_downtime` table and its relations. | ||
|
||
![SLA history downtime](images/sla_history_downtime.png) | ||
|
||
In certain circumstances, namely when a host or service is created and subsequently never deleted, this approach has | ||
been empirically demonstrated to be sufficient. Nevertheless, in the case of a host being deleted and then recreated | ||
a couple of days later, the generation of SLA reports in | ||
[Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) for that host at | ||
the end of the week may yield disparate results, depending on the state of the host prior to its deletion. | ||
|
||
In order to generate SLA reports with the greatest possible accuracy, we have decided to supplement the existing data | ||
with information regarding the `creation` and `deletion`of hosts and services in a new `sla_lifecycle` table, introduced | ||
in Icinga DB `1.3.0`. The upgrade script for `1.3.0` generates a `create_time` SLA lifecycle entry for all existing | ||
hosts and services. However, since that script has no knowledge of the creation time of these existing objects, the | ||
timestamp for them is produced in the following manner. As previously outlined, Icinga DB has the capability to store | ||
timestamps for both `hard state` changes and downtimes since its first stable release. This enables the upgrade script | ||
to identify the least event time of a given host or service from the `sla_history_state` and `sla_history_downtime` | ||
tables. In cases where no timestamps cannot be obtained from the aforementioned tables, it simply fallbacks to `now`, | ||
i.e, in such situations, the creation time of the host or service in question is set to the current timestamp. | ||
|
||
Unfortunately, Icinga 2 lacks the capability to accurately determine the deletion time of an object, which this | ||
proposed release aims to address. It should be noted that Icinga DB is also incapable of identifying the precise | ||
timestamp of an object's deletion. Instead, it simply records the time at which the deletion event for that particular | ||
object occurred and populates the `sla_lifecycle` table accordingly. Consequently, if a host and service is deleted | ||
while Icinga DB is stopped or not in an operational state, the events that Icinga DB would otherwise record once it is | ||
restarted will not reflect the actual deletion or creation time. Nevertheless, there is no superior method for | ||
addressing this issue in a manner that is both expedient and graceful. | ||
|
||
### Implementation | ||
|
||
|
||
|
||
#### Computing SLA OK percent | ||
|
||
The following is a simplified explanation of the current (Icinga DB `1.3.0`) methodology behind the `get_sla_ok_percent` | ||
SQL procedure, used to calculate the SLA OK percent. It is a fundamental characteristic of functional specifications | ||
for Icinga Reporting to only generate reports covering a specific timeframe. Accordingly, the `get_sla_ok_percent` | ||
SQL procedure necessitates the input of the start and end timeframes within which the SLA is to be calculated. | ||
|
||
First get the total time of the specified timeframe, expressed in milliseconds, for which we're going to compute | ||
the SLA OK percent (`total_time = timeline_end - timeline_start`). | ||
|
||
Next, it is necessary to identify the latest [`hard_state`](#hard-state-vs-previous-hard-state) of the service or host | ||
that occurred at or prior to the timeline start date, and marking it as the initial one. In case the first query fails | ||
to determine a `hard_state` entry, it proceeds to search for a [`previous_hard_state`](#hard-state-vs-previous-hard-state) | ||
entry in the `sla_history_state` table that have been recorded after the start of the timeline. If this approach also | ||
fails to retrieve the desired outcome, the regular non-historical `host_state` or `service_state` table is then | ||
examined for the current state. Should this also produce no results, then it uses `OK` as its initial state. | ||
|
||
Afterward, it traverses the entire state and downtime events within the provided timeframe, performing a series of | ||
simple mathematical arithmetic operations. The complete algorithmic process is illustrated in the following pseudocode. | ||
|
||
``` | ||
total_time := timeline_end - timeline_start | ||
// Mark the timeline start date as our last event time for now. | ||
last_event_time := timeline_start | ||
// The problem time of a given host or service is initial set to zero. | ||
problem_time := 0 | ||
// The previous_hard_state is determined dynamically as described above, however, | ||
// for the purposes of this analysis, we'll just set it to 'OK'. | ||
previous_hard_state := OK | ||
// Loop through all the state and downtime events within the provided timeframe and ordered by their timestamp. | ||
for event in (sla_history_state, sla_history_downtime) do | ||
if (event.previous_hard_state is PENDING) then | ||
// A PENDING state event indicates that the host or service in question has not yet had a check result that | ||
// clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating | ||
// the SLA and we must exclude the duration of that PENDING state from the total time. | ||
total_time = total_time - (event.event_time - last_event_time) | ||
else if (previous_hard_state is greater than OK/UP AND previous_hard_state is not PENDING AND checkable is not in DOWNTIME) then | ||
// If the previous_hard_state is set to a non-OK state and the host or service in question was not in downtime, | ||
// we consider that time slot to be problematic and add the duration to the problem time. | ||
problem_time = problem_time + event.event_time - last_event_time | ||
endif | ||
// Set the "last_event_time" to the timestamp of the event being currently processed. | ||
last_event_time = event.event_time | ||
if (event.type is "state change event") then | ||
// If the event being currently processed is a state change event, we mark its | ||
// latest hard state as the previous one for the next iteration. | ||
previous_hard_state = event.hard_state | ||
endif | ||
endloop | ||
``` | ||
|
||
At this point, we now have computed the problem time of a particular host or service for a given time frame. The final | ||
step is to determine the percentage of the remaining total time. In other words, we want to find out how much of the | ||
total time is taken up by the problem time, so that we can obtain our final SLA OK percentage result. | ||
``` | ||
sla_ok_percent := 100 * (total_time - problem_time) / total_time | ||
``` | ||
|
||
## Appendix | ||
|
||
### Hard State vs. Previous Hard State | ||
|
||
The `hard_state` column denotes the most recent hard state of the host and the service. | ||
Conversely, the `previous_hard_state` column indicates the preceding hard state that was formerly stored in the | ||
`hard_state` column prior to the host or service transitioning to a new hard state. Please refer to the tabular | ||
representation below for a visual representation of this information. | ||
|
||
| previous_hard_state | hard_state | | ||
|-------------------------------|------------| | ||
| PENDING (no check result yet) | OK | | ||
| OK | Warning | | ||
| Warning | Critical | | ||
| Critical | OK | |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.