-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
always use the selected timeframe for sla calculation #834
Comments
First, thanks a lot for this very detailed and well written issue. I was able to reproduce your reported behavior and have integrated it in the SLA tests as follows. As one might expect, the second test "i834-2" is going to fail. diff --git a/tests/sql/sla_test.go b/tests/sql/sla_test.go
index de2bace..bde20c2 100644
--- a/tests/sql/sla_test.go
+++ b/tests/sql/sla_test.go
@@ -210,6 +210,26 @@ func TestSla(t *testing.T) {
Start: 1000,
End: 2000,
Expected: 60,
+ }, {
+ Name: "i834",
+ Events: []SlaHistoryEvent{
+ &State{Time: 1609459200000, State: 0, PreviousState: 99}, // 2021-01-01 00:00:00.0000
+ &State{Time: 1609545600000, State: 3, PreviousState: 0}, // 2021-01-02 00:00:00.0000
+ &State{Time: 1609576898000, State: 0, PreviousState: 3}, // 2021-01-02 08:41:38.0000
+ },
+ Start: 1609459200000, // 2021-01-01 00:00:00.0000
+ End: 1640995200000, // 2022-01-01 00:00:00.0000
+ Expected: 99.9008,
+ }, {
+ Name: "i834-2",
+ Events: []SlaHistoryEvent{
+ &State{Time: 1609459200000, State: 0, PreviousState: 99}, // 2021-01-01 00:00:00.0000
+ &State{Time: 1609545600000, State: 3, PreviousState: 0}, // 2021-01-02 00:00:00.0000
+ &State{Time: 1609576898000, State: 0, PreviousState: 3}, // 2021-01-02 08:41:38.0000
+ },
+ Start: 1578009600000, // 2020-01-03 00:00:00.0000
+ End: 1609632000000, // 2021-01-03 00:00:00.0000
+ Expected: 99.901,
}}
for _, test := range tests { However, the issue's core lies in the handling of initially pending data. The ongoing PR #566 tries to explain this with prose and pseudo code, https://github.com/Icinga/icingadb/blob/add-create-delete-history-events/doc/10-Sla-Reporting.md#computing-sla-ok-percent. As @yhabteab wrote over there, the current logic follows the idea that an initial pending state is equivalent to no usable data so far. This can be demonstrated with the following test case, effectively "ignoring" the first half of the tested time window. When applying your suggested schema fix, this test will fail with a SLA of 80%. icingadb/tests/sql/sla_test.go Lines 191 to 201 in 34ac386
The bug description addresses missing data for the time frame, but as I have tried to explain above, this is not the issue at hand. For example, altering "i843-2" to have a first Go Code of Changed Test
{
Name: "i834-3",
Events: []SlaHistoryEvent{
&State{Time: 1609459200000, State: 0, PreviousState: 0}, // 2021-01-01 00:00:00.0000
&State{Time: 1609545600000, State: 3, PreviousState: 0}, // 2021-01-02 00:00:00.0000
&State{Time: 1609576898000, State: 0, PreviousState: 3}, // 2021-01-02 08:41:38.0000
},
Start: 1578009600000, // 2020-01-03 00:00:00.0000
End: 1609632000000, // 2021-01-03 00:00:00.0000
Expected: 99.901,
} The concrete issue is due to initially pending data, not due to absent data. Thus, we are dealing with a design decision. Please feel free to comment on this or suggest changes. For the moment, I just wanted to describe why things are happening the way they are. |
Preface
I noticed this bug when we evaluated the icinga reporting module for our internal SLA Monitoring. The Values seemed to be off when a Host did not exist for the whole timeframe used in the report.
Describe the bug
The SLA value is not calculated based on the total time selected by the timeframe, instead, if there is no data for the whole timeframe, the total time starts with the first entry, which results in a wrong value.
For example, a timeframe of a year is selcted and the data starts after half a year, then the total_time should always be a year. Let's assume the following, we have a downtime of 08 hours 41 minutes and 38 seconds, which should result in a SLA value 99,9 over a year.
Expected: timeframe:
1 Year
total_time:1 Year
, with a downtime of08:41:38
, which results in99,9
Actual: timeframe:
1 Year
total_time:1/2 Year
, with a downtime of08:41:38
, which results in lessTo Reproduce
Steps to reproduce the behavior:
Expected behavior
Both queries have a timeframe of one year, therefore i would expect both to have the same result.
Your Environment
Additional context
The Database function is located in the icingadb schema file and in my tests the issue is resolved when its changed liked this:
And i could not think of a situation where this behaviour would be needed, but i guess it has its reason why its in there 🙂, maybe someone who has a deeper understanding of the reporting module may have a clue?
I hope its not a stupid mistake on my end, even if i tried to check everything throughly.
Last but not least, thanks to anyone who spent the time looking into it and espacially thanks for the software and all the effort everyone has put into it.
The text was updated successfully, but these errors were encountered: