Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handler-slack-multichannel.rb: several improvements #82

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kali-brandwatch
Copy link
Contributor

  • Add support for single channel defined as a string rather than an array (removes the need of having 2 different handlers)
  • Implement a retry strategy for handling slack api transitient errors (like timeouts). Retries hardcoded to 5 with 10 second timeout, but we should make this a setting
  • Exits with non zero status on api call errors, to allow sensu-server to catch the output and log it
  • Deals with non-nagios standard exit codes for checks (0-3), also known as "out of bounds" errors
    • Sets the color the same as UNKNOWN
    • Displays the exit code as it was returned by the check, for detailed information

Pull Request Checklist

Is this in reference to an existing issue?

General

  • Update Changelog following the conventions laid out here

  • Update README with any necessary configuration snippets

  • Binstubs are created if needed

  • RuboCop passes

  • Existing tests pass

New Plugins

  • Tests

  • Add the plugin to the README

  • Does it have a complete header as outlined here

Purpose

Known Compatibility Issues

- Add support for single channel defined as a string rather than an array (removes the need of having 2 different handlers)
- Implement a retry strategy for handling slack api transitient errors (like timeouts). Retries hardcoded to 5 with 10 second timeout, but we should make this a setting
- Exits with non zero status on api call errors, to allow sensu-server to catch the output and log it
- Deals with non-nagios standard exit codes for checks (0-3), also known as "out of bounds" errors
	- Sets the color the same as UNKNOWN
	- Displays the exit code as it was returned by the check, for detailed information
@majormoses
Copy link
Member

Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly.

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks great, let me know if you have any questions about my suggestions.

req.body = payload(text, channel).to_json
# Implement a retry strategy, with 5 tries max and a timeout of 10 seconds each
tries ||= 5
Timeout.timeout(10) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are gonna use timeout we should expose options to specify http level timeouts otherwise they will be obscured by a generic timeout message. See sensu-plugins/sensu-plugins-http#123 for additional context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for configurable options, we probably would like to have not only the timeout but also the number of retries, and the sleeping time in between retries, as configurable settings. Something like I proposed on https://github.com/sensu-plugins/sensu-plugins-slack/pull/83/files

On the other hand, I am not sure we do need to go as deep as specifying dns, open and read timeouts. Could we not just rescue these errors from Net::?

bin/handler-slack-multichannel.rb Outdated Show resolved Hide resolved
bin/handler-slack-multichannel.rb Outdated Show resolved Hide resolved
bin/handler-slack-multichannel.rb Outdated Show resolved Hide resolved
bin/handler-slack-multichannel.rb Outdated Show resolved Hide resolved
bin/handler-slack-multichannel.rb Outdated Show resolved Hide resolved
@@ -321,7 +341,8 @@ def color
2 => '#FF0000',
3 => '#6600CC'
}
color.fetch(check_status.to_i)
# When the sensu check status is out of bounds, use color for UNK
color.fetch(check_status.to_i, '#6600CC')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the overlap we should break this out into a library method that can be shared to avoid duplication: https://github.com/sensu-plugins/sensu-plugins-slack/blob/4.0.0/bin/handler-slack.rb#L206-L228

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See following comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved by fac988a, please review

@@ -335,6 +356,7 @@ def translate_status
2 => :CRITICAL,
3 => :UNKNOWN
}
status[check_status.to_i]
# When the sensu check status is out of bounds, don't hide the status but show it for detailed info
status.fetch(check_status.to_i, "OUT OF BOUNDS: #{check_status.to_i}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that this is the right approach, I think we should just fall back to an unknown. See previous comment about merging the functionality via a library method.

Copy link
Contributor Author

@kali-brandwatch kali-brandwatch Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not entirely agree with having OUT OF BOUNDS messages declared as UNKNOWN. An UNK in nagios-style means the check code has pruposely set this exit status and reflects the purpose of the programmer to explicitly tell the operator there is no way to know the state of the service checked.
An OOB from nagios standards means that the check script has exited somewhere in a nasty way. Typical cases are uncaught bad curl exits, missing files, missing permissions (on check script or files to read) and a myriad more.
The very reason to have OOB is to let the operator know that the check itself is failing, rather than working in an expected manner that is not able to monitor a service or resource.
I can see how it would make sense to avoid duplicity here, although you could argue likewise about the whole having a def translate_status and a def color which are based on the same input. We could return both in a hash and access the key we want specifically in the code.
In any case I strongly believe having OOB explicitly defined and the exit code returned to the operator for visibility is crucial to understand what is going on with the service monitored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point about translate_status and color duplication, I did not initially write that bit. I added in the error handling to return a status (and corresponding color) which is understood in the sensu ecosystem in #62. More often than not changes to existing plugins are made within the confines of how it already works. I did not use that slack handler at the time so I did not really have anywhere to test it. I made the smallest and safest change that solved the problem someone was facing.

For the moment I think we should stick to the status[1] of unknown for non [0-2] codes because there is lack of precedence in the ecosystem. In ~5 years of consuming, contributing, and maintaining sensu I have yet to see anyone follow this convention you are proposing. While we mirror certain aspects of nagios I don't want to hold ourselves to all of their conventions as it ties our hands. I am not opposed to it in general but I think we should put out an RFC and get community feedback before declaring a new convention to be followed. I am fine with adding some additional output to tell the operator more in the meantime.

[1]: leaving the status code intact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it probably make sense to open this discussion to the whole community. I still advocate for my proposed convention but I will move the conversation to there for openness and greater feedback.

Copy link
Contributor Author

@kali-brandwatch kali-brandwatch Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added here: sensu-plugins/community#127

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the de-duplication of translate_status and color, I will check if this would mean lots of further changes. If it doesn't I will go ahead with it, otherwise I will try to keep it simple for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved bu fac988a, please review

@kali-brandwatch
Copy link
Contributor Author

Ok I will add a commit with the suggested changes here, and mark conversations as resolved accordingly.

majormoses and others added 6 commits April 11, 2019 10:57
Keep client and check names in the title

Co-Authored-By: kali-brandwatch <[email protected]>
Change to string interpolation

Co-Authored-By: kali-brandwatch <[email protected]>
Change to string interpolation

Co-Authored-By: kali-brandwatch <[email protected]>
Change to string interpolation

Co-Authored-By: kali-brandwatch <[email protected]>
- Remove unnecessary puts for the whole request.
- Fix syntax error on string interpolation.
- Shorten error messages to keep clearer error messages and make rubocop happy about line length.
…dard exit codes

As discussed on sensu-plugins#82 (comment) and sensu-plugins#82 (comment)
Add a configurable setting for which exit status code will be used when the check returns a non standard status (non [0..3]), defaulting to 3 (UNK)
Set rescues on `color` and `translate_status` to use this setting.
Should resolve both comments above
@majormoses
Copy link
Member

@kali-brandwatch this PR should be updated to reflect that #83 was merged. A rebase is probably in order. I think it would be best to split out the OOB to another PR so we can get the rest of this merged as I don't think we have come to a consensus on how best to proceed on that front.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants