-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified ChassisGroupChassisDelete to convert chassis_name #14904
base: main
Are you sure you want to change the base?
Conversation
…o fix later comparison Signed-off-by: Eric Gelinas <[email protected]>
} | ||
|
||
// Check if chassis group exists. ovn-nbctl doesn't provide an "--if-exists" option for this. | ||
output, err = o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", fmt.Sprintf("name=%s", string(haChassisGroupName))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output, err = o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", fmt.Sprintf("name=%s", string(haChassisGroupName))) | |
output, err = o.nbctl("--no-headings", "--data=bare", "--colum=name,ha_chassis", "find", "ha_chassis_group", "name="+string(haChassisGroupName)) |
fmt.Sprintf()
is slow so we recently started replacing it with simpler/faster string concatenation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately this issue was introduced by myself in 1d98019 trying to workaround the issue that the ovn-nbctl ha-chassis-group-remove-chassis
command doesn't provide an --if-exists
option where if the chassis member doesn't exist in the group, then no error will be raised.
Previously if the chassis was a member of multiple chassis groups (OVN networks) then if one of the OVN networks couldn't
be started (and thus its chassis wasn't added to that network's chassis group) then attempting to remove unavailable OVN
network would fail because the previous implementation thought the chassis was in the chassis group, when infact it was
in another chassis group.
However this PR as it stands introduces another issue because it does not consider that the same chassis can have multiple ha_chassis
entries (one per ovn network). This is done to achieve the per-ovn network chassis failover priority, in order to distribute uplink traffic across the cluster.
Taking the example of a microcloud with 2 OVN networks:
ovn-nbctl find ha_chassis
_uuid : f139a82d-7d32-4799-80ce-6feecbb4b3df
chassis_name : micro01
external_ids : {}
priority : 26056
_uuid : 6fe7b2ed-d98b-4e64-9585-0eef369ba082
chassis_name : micro01
external_ids : {}
priority : 1138
_uuid : a1a8d8c2-2cc6-4806-8bd8-5a4ddc485890
chassis_name : micro03
external_ids : {}
priority : 15471
_uuid : a07811af-327d-4467-ae7e-3be7afc98098
chassis_name : micro02
external_ids : {}
priority : 10207
_uuid : a7efafc7-9f65-4383-aa9d-ab8cad5443ae
chassis_name : micro03
external_ids : {}
priority : 29049
_uuid : 701780b3-8930-4365-abe7-85ab0865356e
chassis_name : micro02
external_ids : {}
priority : 22819
ovn-nbctl find ha_chassis_group
_uuid : 2df9866e-03a8-4bb9-9447-5f928014f338
external_ids : {}
ha_chassis : [a07811af-327d-4467-ae7e-3be7afc98098, a1a8d8c2-2cc6-4806-8bd8-5a4ddc485890, f139a82d-7d32-4799-80ce-6feecbb4b3df]
name : lxd-net3
_uuid : 9cc51612-237f-43dd-bb32-aebf191bf3c6
external_ids : {}
ha_chassis : [6fe7b2ed-d98b-4e64-9585-0eef369ba082, 701780b3-8930-4365-abe7-85ab0865356e, a7efafc7-9f65-4383-aa9d-ab8cad5443ae]
name : lxd-net2
We see that micro01
has 2 ha_chassis
entries:
_uuid : f139a82d-7d32-4799-80ce-6feecbb4b3df
chassis_name : micro01
external_ids : {}
priority : 26056
and
_uuid : 6fe7b2ed-d98b-4e64-9585-0eef369ba082
chassis_name : micro01
external_ids : {}
priority : 1138
The f139a82d-7d32-4799-80ce-6feecbb4b3df
is used in the ha_chassis_group
for lxd-net3
network and 6fe7b2ed-d98b-4e64-9585-0eef369ba082
is used in the has_chassis_group
for lxd-net2
network respectively.
So the problem with the PR is that it takes the entries in the ha_chassis
table and matches the first entry that matches the cluster member's chassis ID.
Then it will use that UUID to match against the ha_chassis_group
members entries for the ha_chassis_group
entry for the associated network.
This is likely to not match correctly, at least some of the time.
The naive approach we could use is to simply lookup each chassis member using the UUID and convert it to a chassis ID/name via:
ovn-nbctl get ha_chassis <group member UUID> chassis_name
But because each ovn-nbctl
invocation is expensive, because it connects to the ovn cluster and loads part of the database, I think we should load all ha_chassis members and iterate through them.
Once we switch to libovsdb, we can switch to just doing a chassis name lookup for each chassis member like Incus do here:
https://github.com/lxc/incus/blob/main/internal/server/network/ovn/ovn_nb_actions.go#L2401-L2410
So for now I think we should do this:
- First, get chassis members from the
ha_chassis_group
table, using the existing command we are using. - Next, get list of
ha_chassis
entries, and see if there is an entry that matches the cluster member's chassis ID and where its UUID is inside the chassis members list for this network's chassis group.
We can also add a test to https://github.com/canonical/lxd/blob/main/test/suites/network_ovn.sh such that when lxd shutdown
is called, the chassis group member entry should be removed.
@escabo id like to get this into 5.21/candidate as a cherry-pick before we push 5.21.3 to 5.21/stable if timing works out. |
Now gets the list of ha_chassis first to convert the received chassis_name into a _uuid for later comparison.
Fixes #14902.