Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine batches of successive roles for same nodes, #1345

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sjamgade
Copy link

@sjamgade sjamgade commented Oct 2, 2017

we can speed-up the application of (n+1)th role if both(n,n+1)
roles are being applied on the same node. This speedsup of deployment of
ceilometer by atleast 1m20s (measured: 90sec) and swift by ~20s.

eg. In our 2node deployment ceilometer{server,central} are always
applied on the same node, given that they have different priorities, they are
to be applied, one after the other.

This does not violate any order constraints as the application procedure
of (n+1)th role is transparent to the nth role

Why is this change necessary?
Speed up the chef-client execution

@aspiers : it's actually reducing the number of times chef-client is run rather than speeding up execution of any single run

@sjamgade sjamgade requested a review from vuntz October 2, 2017 13:36
batches << [roles, nodes_in_batch] unless nodes_in_batch.empty?
if pnodes_in_batch == nodes_in_batch && !proles.nil?
# proles is same as batches[-1][0]
proles << roles

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/IndentationWidth: Use 2 (not 4) spaces for indentation. (https://github.com/bbatsov/ruby-style-guide#spaces-indentation)

@sjamgade sjamgade force-pushed the mergebatch branch 3 times, most recently from 058b6e4 to ae3a82f Compare October 2, 2017 13:43
@sjamgade
Copy link
Author

sjamgade commented Oct 2, 2017

aspiers
aspiers previously requested changes Oct 6, 2017
Copy link
Member

@aspiers aspiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I think this sounds like a great idea, so thanks a lot for working on this!

As you can see from my comments, there are some serious code legibility issues here. The vast majority of these issues existed with #apply_role for many years before you joined SUSE ;-) However it's super-important that any changes to it should remove technical debt, not add new technical debt, otherwise we just continue the trend of this method getting worse and worse over the years.

In particular, #apply_role has quite clearly delimited blocks of logic, so it lends itself very well to extraction of a few new methods (one commit at a time!). This would make it far easier to see the inputs, outputs, and actions of each chunk, and thus the overall flow of logic. If you could do this before enhancing it then you'd probably get a +1000 from me :-)

@@ -1071,6 +1071,8 @@ def apply_role(role, inst, in_queue, bootstrap = false)

# element_order is an Array where each item represents a batch of roles and
# the batches must be applied sequentially in this order.
pnodes_in_batch = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use easily understandable variable names. It is not clear from the names what these two variables are for. I'm guessing p means "previously added to batches" rather than "previously seen in element_order", is that right? IIUC the distinction is subtle but crucial to how you've written this code. If so, call them something like nodes_in_prev_batch and roles_for_prev_batch.

Also, by inserting these assignments here, you are divorcing the comment explaining element_order from its usage, so please don't do that.

@@ -1169,7 +1171,14 @@ def apply_role(role, inst, in_queue, bootstrap = false)
end
end # roles.each

batches << [roles, nodes_in_batch] unless nodes_in_batch.empty?
if pnodes_in_batch == nodes_in_batch && !proles.nil?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #apply_role method is already WAY too long, and hopelessly unreadable as a result, so I don't want even 8 new lines to be added. Please can you find a way to make this change which makes the method smaller not larger, by extracting out 1 or more blocks into new methods, one commit at a time?

Also, what if nodes_in_batch is a subset of pnodes_in_batch? Couldn't that also be a situation where the batches could be merged?

Copy link
Author

@sjamgade sjamgade Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element_order = [
[b1]
[c1,c2]
]
batch 1: node1=[b1] node2=[b1]
batch 2: node1=[c1]

It think : this could break, if c1 needs b1 to be entirely functional

Copy link
Member

@aspiers aspiers Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could break, if c1 needs b1 to be entirely functional

Not necessarily, but that demonstrates my point 2) below - it depends on whether all the recipes for b1 were ordered to happen before the recipes for c1. For example:

element_order = [
[b1]
[c1,c2]
]
batch 1: node1=[b1] node2=[b1]
batch 2: node1=[c1] node2=[c1]

This PR would change that to:

batch 1: node1=[b1,c1] node2=[b1,c1]

If some of c1's recipes come before b1's in the runlist then there's a problem. Normally runlist ordering is determined for all roles within a barclamp by chef_order so this should not be an issue (assuming barclamps are deployed in an order matching chef_order?), but this can be overridden per role by element_run_list_order. The only place we actually use that is with pacemaker remotes, but even there I'm not sure we actually use it - maybe the code only has remnants of a previous usage which we reverted when @vuntz came up with a different approach? I can't remember but @vuntz will know for sure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some of c1's recipes come before b1's in the runlist then there's a problem.
yes this is correct.

I did look at element_run_list_order in nova, ceilometer swift and other barclamps.
Didnt find the issue you mentioned. As each one of them had explicit priorities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I must have done the grep wrong last time because I see them now.

batches << [roles, nodes_in_batch] unless nodes_in_batch.empty?
if pnodes_in_batch == nodes_in_batch && !proles.nil?
# proles is same as batches[-1][0]
proles << roles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really struggling to understand this because #apply_role is such horrible spaghetti code and I don't look at it often enough to remember all the details (I guess @vuntz is the only one who does - which is maybe why he fails to see the need for it to be refactored, but that's another discussion I'll continue with him separately ;-) ...

But IIUC it seems like this would result in say, ceilometer-agent being appended to proles so that it runs in the same batch as ceilometer-central, and moving the line which appends to batches into the else clause means that no new batch is added during this iteration of the element_order.each loop. Is that right? If so, I think it deserves more comments explaining what's going on, because it just took me about 20 minutes to understand these few lines.

Copy link
Author

@sjamgade sjamgade Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.
Will add comments.

# proles is same as batches[-1][0]
proles << roles
else
proles = roles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so these two lines are only in the else block because that allows merging of even more than two batches into one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@aspiers
Copy link
Member

aspiers commented Oct 6, 2017

Two more thoughts:

  1. The answer given to "Why is this change necessary?" was "Speed up the chef-client execution", but IIUC that's not precisely what's happening; it's actually reducing the number of times chef-client is run rather than speeding up execution of any single run - right? So if I'm right please change the wording on that to clarify.

  2. Actually, this also assumes that the cookbooks for (say) ceilometer would be ordered so that in the chef-client runs for the combined batch, any actions for ceilometer-agent would be performed after ceilometer-central. And that sounds like it could potentially be a fundamental flaw in this optimization. I'll leave it to @vuntz to comment on whether or not this is likely to be an issue. I suspect it's fine for most if not all cookbooks, but it would be introducing a new assumption into the design so we'd have to go through all of them to make sure. In particular I wonder whether it would break compute HA deployment.

@aspiers
Copy link
Member

aspiers commented Oct 6, 2017

And another:

  1. Orchestration is hard ;-/

@sjamgade
Copy link
Author

sjamgade commented Oct 6, 2017

  1. Done.

  2. ceilometer-{central,agent} have different priorities which makes them n and n+1.
    Chef client respects these priorities.

  3. Cannot agree more.

@Itxaka
Copy link
Member

Itxaka commented Oct 6, 2017

Sorry to barge in here but:

As you can see from my comments, there are some serious code legibility issues here. The vast majority of these issues existed with #apply_role for many years before you joined SUSE ;-) However it's super-important that any changes to it should remove technical debt, not add new technical debt, otherwise we just continue the trend of this method getting worse and worse over the years.

In particular, #apply_role has quite clearly delimited blocks of logic, so it lends itself very well to extraction of a few new methods (one commit at a time!). This would make it far easier to see the inputs, outputs, and actions of each chunk, and thus the overall flow of logic. If you could do this before enhancing it then you'd probably get a +1000 from me :-)

Totally agree here after having to deal with apply_role, and exactly what I was thinking of doing shortly, extracting apply_role into a minimum of 4 methods (as those are the four big steps inside) in order to be able to refactor that monstrous 500 line method into smaller chunks for clarity and maybe being able to (kind of) easily try to reduce/refactor smaller pieces inside it.

Plus the test coverage for it is minimal.

@sjamgade I understand this falls outside of the scope of the card but let me know if you plan of doing anything to apply_role so we dont do the same work twice in case I got some free time to refactor it :D

@sjamgade
Copy link
Author

sjamgade commented Oct 6, 2017

@aspiers gating passed.
I am not sure if that covers the HA-Compute case you mentioned.

https://ci.suse.de/job/openstack-mkcloud/86719/

https://ci.suse.de/job/openstack-mkcloud/86733/

@vuntz
Copy link
Member

vuntz commented Oct 9, 2017

I'm mixed here: this is probably fine at the moment, but this would conflict with a potential change where we restrict what chef is doing on apply (which is something we were slowly heading towards). So if anything, we should have a big comment explaining the shortcut and that this is not compatible with the minimal chef-client approach on apply.

Copy link
Member

@aspiers aspiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I now think this is probably OK (modulo the nitpick), but I agree with @vuntz that it could conflict with the future direction. However if we ever get to that, we can make sure to properly refactor #apply_role (thanks @Itxaka, great to hear you agree with me on this!) and then it should be relatively easy to adapt the logic accordingly.

@@ -1069,6 +1069,8 @@ def apply_role(role, inst, in_queue, bootstrap = false)
proposal = Proposal.where(barclamp: @bc_name, name: inst).first
save_proposal = false

nodes_previousbatch = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: please be consistent in your variable naming style. "previous batch" is two words, not one. Since the existing style is to put _ between words, put it between all words ("nodes_previous_batch"). Also, since maximising legibility is an important goal, I prefer names which are as close as possible to natural English language, which is why I suggested nodes_in_prev_batch and roles_for_prev_batch. Admittedly that's more of a personal preference, but still worth considering.

@aspiers aspiers dismissed their stale review October 13, 2017 11:28

Complaints addressed / invalid

@aspiers
Copy link
Member

aspiers commented Oct 13, 2017

Totally agree here after having to deal with apply_role, and exactly what I was thinking of doing shortly, extracting apply_role into a minimum of 4 methods (as those are the four big steps inside) in order to be able to refactor that monstrous 500 line method into smaller chunks for clarity and maybe being able to (kind of) easily try to reduce/refactor smaller pieces inside it.

YES that's exactly what I was thinking of!

Plus the test coverage for it is minimal.

@sjamgade I understand this falls outside of the scope of the card but let me know if you plan of doing anything to apply_role so we dont do the same work twice in case I got some free time to refactor it :D

It would make me sooo happy to see this refactoring done :-)

end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

# role (which would be an improvement over the requirement to
# explicitly list all nodes).
new_nodes.reject! do |node_name|
not pre_cached_nodes[node_name].nil? && Rails.logger.debug("skipping deleted node #{node_name}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/Not: Use ! instead of not. (https://github.com/bbatsov/ruby-style-guide#bang-not-not)
Metrics/LineLength: Line is too long. [106/100] (https://github.com/SUSE/style-guides/blob/master/Ruby.md#metricslinelength)

# have some alias that be used to assign all existing nodes to a
# role (which would be an improvement over the requirement to
# explicitly list all nodes).
new_nodes.reject! do |node_name|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/InverseMethods: Use select! instead of inverting reject!.

# Don't add deleted nodes to the run order, they clearly won't have
# the old role
old_nodes.reject! do |node_name|
not pre_cached_nodes[node_name].nil? && Rails.logger.debug("skipping deleted node #{node_name}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/Not: Use ! instead of not. (https://github.com/bbatsov/ruby-style-guide#bang-not-not)
Metrics/LineLength: Line is too long. [106/100] (https://github.com/SUSE/style-guides/blob/master/Ruby.md#metricslinelength)

end
# Don't add deleted nodes to the run order, they clearly won't have
# the old role
old_nodes.reject! do |node_name|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/InverseMethods: Use select! instead of inverting reject!.

end
merged_batches << current_batch
end
return merged_batches

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/RedundantReturn: Redundant return detected. (https://github.com/bbatsov/ruby-style-guide#no-explicit-return)

@@ -919,6 +919,35 @@ def self.proposal_to_role(proposal, bc_name)
RoleObject.new role
end

# we can speed-up the application of (n+1)th role if both(n,n+1)
# roles are being applied on the same node.
#

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

@@ -919,6 +919,35 @@ def self.proposal_to_role(proposal, bc_name)
RoleObject.new role
end

# we can speed-up the application of (n+1)th role if both(n,n+1)
# roles are being applied on the same node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

# a batch is [roles, nodes]
def mergebatches(batches)
merged_batches = []
unless bathces.empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batches typo. If you are pushing code before testing it, please clearly label and mark the PR as WIP, otherwise there is a risk broken code will get merged.

# merging batches together
#
# a batch is [roles, nodes]
def mergebatches(batches)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much better to have this as a separate method!

But as mentioned before, please stay consistent with the existing style, so this should be merge_batches.

#
# eg. In our 2node deployment ceilometer{server,central} are always
# applied on the same node, given that they have different priorities,
# they are to be applied, one after the other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to explicitly mention element_run_list_order here and explain its effect on the run-list, otherwise the reader will be wondering why/how they end up with different priorities.

@@ -919,6 +919,35 @@ def self.proposal_to_role(proposal, bc_name)
RoleObject.new role
end

# we can speed-up the application of (n+1)th role if both(n,n+1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment block - definitely important to have it since this batch merging is not immediately obvious without explanation!

# is run rather than speeding up execution of any single run, by
# merging batches together
#
# a batch is [roles, nodes]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I find comments like this really helpful for understanding the data structures being manipulated.

We can speed-up the application of (n+1)th role if both(n,n+1)
roles are being applied on the same node. This speedup of deployment of
ceilometer by atleast 1m20s (measured: 90sec) and swift by ~20s.

eg. In our 2node deployment ceilometer{server,central} are always
applied on the same node, given that they have different priorities, they are
to be applied, one after the other.

This does not violate any order constraints as the application procedure
of (n+1)th role is transparent to the nth role
@sjamgade sjamgade removed the wip label Oct 19, 2017
@sjamgade sjamgade dismissed aspiers’s stale review October 20, 2017 08:39

please do review the latest changes

@jloehel
Copy link

jloehel commented Feb 19, 2018

I tested these changes in a cloud environment with 4 nodes and develcloud7 as source. I compared the time with and without the patch. You can see the results here:
graph
It's definitely a performance gain for in example the Barclamp "ceilometer". The roles are merged correctly.

(Pdb) batches["A"]["ceilometer"]
'[[["ceilometer-server", ["ceilometer-central"], ["ceilometer-agent"]], ["d52-54-77-77-01-01.xxx"]]]'
(Pdb) batches["B"]["ceilometer"]
'[[["ceilometer-server"], ["d52-54-77-77-01-01.xxx"]], [["ceilometer-central"], ["d52-54-77-77-01-01.xxx"]], [["ceilometer-agent"], ["d52-54-77-77-01-01.xxx"]]]'

It looks good to me. 👍

# eg. In our 2node deployment ceilometer{server,central} are always
# applied on the same node, given that they have different priorities,
# they are to be applied, one after the other, these priorities come
# from element_run_list_order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nitpicks here:

  1. The priorities come from chef_order by default but can be overridden per role by element_run_list_order
  2. ceilometer-{server,central} is just one example, but this code will merge any adjacent batches regardless of what run_list priority the roles in. Is it possible to make the code automatically check that the run_list priority of batch n is lower than the priority of batch n+1, and only merge if that is true? If that was added then I think this would feel safe enough to me for a +1 vote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

7 participants