-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] How to deal in handlers with check exit status out of bounds #127
Comments
I have been thinking about this, I want you to know I have not been ignoring this I just wanted to gather some thoughts and allow time for some other people to weigh in. It's been a number of years since I have played with nagios so my memory is a bit hazy. I can't find any documentation on their definition of "Out of Bounds". If you could point me to that it would be helpful. I only found sparse references in the cpp code (which its been a while since I have written):
Generally when I think "Out of Bounds" I think in context of running out of a resource, going beyond the index of an array, etc. To me this translates to a bash 255 which indicates that a process tried to return a status code that was greater than 255. After reading through this section of code it becomes a bit more clear as to its intent. I am not opposed to creating a specific library method to map status exit codes to more meaningful error messages. I think I could be talked into supporting "Out of Bounds" on the whole but I would want to see others weigh in on here as well. |
Naively...it could seem useful to map error code to whatever errno thinks the status means in some handlers.. as errno defines a defacto standard for errors messages.
We have to be careful here...what the return code could mean maybe sensu client/agent specific..and maybe different than what the sensu server/backend understands as standard errors codes... But to do this right, we'd have to be able to instruct the handler as to which error code override mapping to use. |
Hi, "If we follow the convention of returning only standard [0-3] status codes, we might be masking any of the above under an unknown error message." My opinion: |
I have thought more about this and I think I am leaning back towards my initial stance that sensu should concern itself with error code
It's very easy to say "oh a 127 means command/file not found". But is that really the case on all platforms, shells, and methods of execution? While everything I maintain is linux and as a consequence sometimes do not consider the various behaviors on other platforms we need to be cautious of this trap. From a sensu server standpoint it can't know what it means as it is simply responding to information that the client has supplied. The client simply captures the input, output, exit code, and message and forwards this to the server. The message, knowledge of the platform/system, exit code, and method of execution is allows us to determine the meaning of the alert and how to respond. As skilled practitioners we often do not realize how much we intrinsically know about a system because we are so familiar with it. Unless we want to build a very comprehensive system that inspects all of these in (near) real time which would incur significant performance penalties and deep understanding of every system I think we should stick to having sensu concern itself with alert severity rather than trying to guess what it means. |
I recently launched a PR for
handler-slack-multichannel.rb
(see here) in which I suggested changing the approach on how to deal with exit codes from checks that are out of bounds (out of the [0-3] range).In the discussion that started in the PR I suggested following what (at least in my experience) has been a standard from nagios conventions into pretty much anything I used that follows the nagios path (including icinga 1&2 and sensu), that when a check script returns a non [0-3] exit code it should not be masked as UNKNOWN status.
The logic behind this approach is the following:
curl
) that exits in error code and is not caughtIf we follow the convention of returning only standard [0-3] status codes, we might be masking any of the above under an unknown error message.
In my personal experience it has always been useful to be able to see the explicit error code and report it to the operator, since this helps identifying the issue faster and clearer.
Alternatively, I would have this as a configurable option that would allow handlers to set an override for any non-standard status code as whatever we prefer (some people would like to have it as UNKNOWN but I have also been in scenarios where it was preferred to report such failures as CRITICAL).
I would like to hear more opinions regarding this.
The text was updated successfully, but these errors were encountered: