-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neutron: set a failure-timeout on neutron-ha-tool #2063
base: master
Are you sure you want to change the base?
Conversation
@@ -154,6 +154,9 @@ | |||
agent "systemd:neutron-l3-ha-service" | |||
op node[:neutron][:ha][:neutron_l3_ha_resource][:op] | |||
action :update | |||
meta ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint/ParenthesesAsGroupedExpression: (...) interpreted as grouped expression. (https://github.com/bbatsov/ruby-style-guide#parens-no-spaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message references the l3 agent but the change affects neutron-l3-ha-service. It's not clear to me what the exact problem is or why timing out a failure of neutron-l3-ha-service would address it. I'm guessing there is some missing detail regarding the interaction between the two - please can you clarify in the commit message?
We don't want the neutron-ha-tool service to be stopped after 3 weeks of weekly patching and rebooting the rabbitmq cluster. Set a timeout of a failure if it happened more than 10 minutes ago.
f194469
to
25eab46
Compare
@aspiers sorry, fixed the typo. this is about the neutron-l3-ha-service which randomly but regularly gets stopped by pacemaker because of some sequense of consecutive errors. For example recently somebody broke keystone for a time of 15 minutes, and that caused pacemaker to stop the service due to repeated failure. this is not helpful for achieving high availability when pacemaker just kills the service that should take care of availability. |
OK thanks, that makes sense now. Ideally I would prefer that info to be in the commit message too, since the commit message doesn't feel entirely self-explanatory yet. But the main problem seems to be that the CI is currently failing:
I guess that's probably related to this change somehow. |
@aspiers commented on March 31, 2019 1:26 PM:
[snipped] I'm going to see if |
We don't want the l3 ha tool service to be stopped after 3 weeks of weekly
patching and rebooting the rabbitmq cluster. Set a timeout of a failure
if it happened more than 10 minutes ago.