-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid resetting the VF if the netns does not exist #313
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Andrea Panattoni <[email protected]>
Pull Request Test Coverage Report for Build 12802985585Details
💛 - Coveralls |
cmd/sriov/main.go
Outdated
|
||
if !netConf.DPDKMode { | ||
netns, err := ns.GetNS(args.Netns) | ||
netns, err = ns.GetNS(args.Netns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always do this not only for non dpdk devices.
if not we should be able to reproduce the issue on a vfio-pci device for example (maybe not using the ipam)
but for example if the driver is slow and we need to reset the vlan on the vf and that part takes a lot of time we maybe be able to reproduce the issue no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point. I updated the code so that if the net namespace doesn't exist, then Release and Reset are not called.
WDYT?
95fcfd3
to
dcdaec1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! nice work!
@adrianchiris can you please give this one a look so we can merge it? |
cmd/sriov/main.go
Outdated
"func", "cmdDel", | ||
"netConf.DeviceID", netConf.DeviceID, | ||
"args.Netns", args.Netns) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case dont we want to skip to L315 and mark pci address as released ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better not:
The scenario I'm trying to solve here is when a CmdDelete is called when the device has already been allocated to another Pod. If we call DeleteAllocatedPCI(...)
, we are freeing up a device on a running pod.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took another look at the codebase, can this really happen ? we use the pci allocator to check if the device is allocated in CmdAdd, see [1]
so even if NS was deleted and device was re-assigned to a new pod, its CNI call would fail on ADD until we released the device.
if my understanding of the above is true then why dont we want to reset Vf config in this case ?
[1]
sriov-cni/pkg/config/config.go
Line 60 in fca6591
isAllocated, err := allocator.IsAllocated(n.DeviceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; in the scenario I just described, it probably can't happen.
What we saw happening is related to a very long execution of ipam.ExecDel(...)
.
This scenario includes 3 sriov-cni calls:
a. first_cniDEL starts and call ipam.ExecDel(...). this is going to take a very long time (~2 minutes)
b. second_cniDEL is invoked and, for some reason, it completes quite quickly (ipam deletion, vlan/mac reset, ...). PIC is deallocated, netns is deleted.
c. cniADD is invoked. It assigns the VF to a new pod, configuring VLAN, MAC, ...
d. first_cniDEL continues its execution, as the ipam.ExecDel
finished. Here, what I want to avoid is resetting the VF MAC, VLAN and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a better solution is to use lock file to handle on a single cmd ADD/DEL at a time for a given ns, network, interface ?
that way second cni DEL will fail and prevent freeing up the PCI in the allocator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of /tmp, we should go through all the files we have put on the fs when we start up the operator controller, and ensure our local db (files we put on the file system) is clean. There is no guarantee that /tmp will be cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can flock the pci file if it works. just need to make sure we release when we exit ADD and DEL cmds.
so , according to flock docs[1] it would release the lock if the process exits so in case of reboot we would not be stuck in a deadlock (it would be released when process exits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to create a reproducer (not easy at all!) and see if the flock solves the problem.
instead of /tmp, we should go through all the files we have put on the fs when we start up the operator controller, and ensure our local db (files we put on the file system) is clean. There is no guarantee that /tmp will be cleared.
This is a good point. We might move the sriov-cni operation folder from /var/lib/cni/sriov
to /run/sriov
. That folder is supposed to be clean at every boot.
https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s15.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you say "is supposed to be". I would like to build code that does not have "supposed to be". If you do, and the assumption is wrong, then you have a bug. Better to not assume anything and build robust code that checks the system state instead of comparing to some cached version of the state stored on a file system
nasty bug :) |
723d1e5
to
3a78a13
Compare
When running on heavy a load, sriov-cni might be invoked multiple times by the kubelet or the container runtime. In these situations, it might happen that a `cmdAdd` or a `cmdDel` run on the same device that is still handled by a `cmdDel` instance (e.g. when the IPAM plugin takes a long time to finish). This commit adds a Lock mechanism around the `/var/lib/cni/sriov/<PCI>` address, to avoid two or more instances handling the same the device. The lock is based on `flock` [1] using the `O_CLOEXEC` flag, which garantees that the lock is released when the process finishes. [1] https://linux.die.net/man/2/flock Signed-off-by: Andrea Panattoni <[email protected]>
3a78a13
to
2f6d2f0
Compare
Depends on: - k8snetworkplumbingwg#313 Signed-off-by: Andrea Panattoni <[email protected]>
Depends on: - k8snetworkplumbingwg#313 Signed-off-by: Andrea Panattoni <[email protected]>
Depends on: - k8snetworkplumbingwg#313 Signed-off-by: Andrea Panattoni <[email protected]>
I implemented the fix via the I manually tested these changes on OpenShift 4.17 and with automated tests (See #318) Please take another look |
The VF might have been assigned to another running
Pod if the network namespace is not present. In cases
like this, resetting the VF might break the configuration
of the running Pod.
The above scenario might occur when the IPAM plugin takes a lot of time to complete, and the CmdDel gets called
multiple times.
Check if the namespace is still present before resetting the VF
@bn222 @SchSeba WDYT about this approach?