Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid resetting the VF if the netns does not exist #313

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zeeke
Copy link
Member

@zeeke zeeke commented Nov 28, 2024

The VF might have been assigned to another running
Pod if the network namespace is not present. In cases
like this, resetting the VF might break the configuration
of the running Pod.
The above scenario might occur when the IPAM plugin takes a lot of time to complete, and the CmdDel gets called
multiple times.

Check if the namespace is still present before resetting the VF

@bn222 @SchSeba WDYT about this approach?

Signed-off-by: Andrea Panattoni <[email protected]>
@coveralls
Copy link

coveralls commented Nov 28, 2024

Pull Request Test Coverage Report for Build 12802985585

Details

  • 7 of 57 (12.28%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.0%) to 40.416%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/config/config.go 7 9 77.78%
cmd/sriov/main.go 0 22 0.0%
pkg/utils/pci_allocator.go 0 26 0.0%
Totals Coverage Status
Change from base Build 12743187155: -1.0%
Covered Lines: 563
Relevant Lines: 1393

💛 - Coveralls

@zeeke zeeke requested review from SchSeba and Eoghan1232 December 6, 2024 16:13

if !netConf.DPDKMode {
netns, err := ns.GetNS(args.Netns)
netns, err = ns.GetNS(args.Netns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always do this not only for non dpdk devices.
if not we should be able to reproduce the issue on a vfio-pci device for example (maybe not using the ipam)
but for example if the driver is slow and we need to reset the vlan on the vf and that part takes a lot of time we maybe be able to reproduce the issue no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. I updated the code so that if the net namespace doesn't exist, then Release and Reset are not called.
WDYT?

@zeeke zeeke force-pushed the us/ocpbugs-45028 branch 2 times, most recently from 95fcfd3 to dcdaec1 Compare December 17, 2024 12:21
Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! nice work!

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 17, 2024

@adrianchiris can you please give this one a look so we can merge it?

"func", "cmdDel",
"netConf.DeviceID", netConf.DeviceID,
"args.Netns", args.Netns)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case dont we want to skip to L315 and mark pci address as released ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better not:

The scenario I'm trying to solve here is when a CmdDelete is called when the device has already been allocated to another Pod. If we call DeleteAllocatedPCI(...), we are freeing up a device on a running pod.

WDYT?

Copy link
Contributor

@adrianchiris adrianchiris Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took another look at the codebase, can this really happen ? we use the pci allocator to check if the device is allocated in CmdAdd, see [1]

so even if NS was deleted and device was re-assigned to a new pod, its CNI call would fail on ADD until we released the device.

if my understanding of the above is true then why dont we want to reset Vf config in this case ?

[1]

isAllocated, err := allocator.IsAllocated(n.DeviceID)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; in the scenario I just described, it probably can't happen.
What we saw happening is related to a very long execution of ipam.ExecDel(...).
This scenario includes 3 sriov-cni calls:
a. first_cniDEL starts and call ipam.ExecDel(...). this is going to take a very long time (~2 minutes)
b. second_cniDEL is invoked and, for some reason, it completes quite quickly (ipam deletion, vlan/mac reset, ...). PIC is deallocated, netns is deleted.
c. cniADD is invoked. It assigns the VF to a new pod, configuring VLAN, MAC, ...
d. first_cniDEL continues its execution, as the ipam.ExecDel finished. Here, what I want to avoid is resetting the VF MAC, VLAN and so on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better solution is to use lock file to handle on a single cmd ADD/DEL at a time for a given ns, network, interface ?
that way second cni DEL will fail and prevent freeing up the PCI in the allocator ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of /tmp, we should go through all the files we have put on the fs when we start up the operator controller, and ensure our local db (files we put on the file system) is clean. There is no guarantee that /tmp will be cleared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can flock the pci file if it works. just need to make sure we release when we exit ADD and DEL cmds.

so , according to flock docs[1] it would release the lock if the process exits so in case of reboot we would not be stuck in a deadlock (it would be released when process exits)

[1] https://man7.org/linux/man-pages/man2/flock.2.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to create a reproducer (not easy at all!) and see if the flock solves the problem.

instead of /tmp, we should go through all the files we have put on the fs when we start up the operator controller, and ensure our local db (files we put on the file system) is clean. There is no guarantee that /tmp will be cleared.

This is a good point. We might move the sriov-cni operation folder from /var/lib/cni/sriov to /run/sriov. That folder is supposed to be clean at every boot.

https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s15.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you say "is supposed to be". I would like to build code that does not have "supposed to be". If you do, and the assumption is wrong, then you have a bug. Better to not assume anything and build robust code that checks the system state instead of comparing to some cached version of the state stored on a file system

@adrianchiris
Copy link
Contributor

adrianchiris commented Dec 17, 2024

nasty bug :)

@zeeke zeeke force-pushed the us/ocpbugs-45028 branch 2 times, most recently from 723d1e5 to 3a78a13 Compare January 15, 2025 12:20
When running on heavy a load, sriov-cni might be invoked multiple times
by the kubelet or the container runtime. In these situations, it might happen
that a `cmdAdd` or a `cmdDel` run on the same device that is still handled by
a `cmdDel` instance (e.g. when the IPAM plugin takes a long time to finish).

This commit adds a Lock mechanism around the `/var/lib/cni/sriov/<PCI>` address,
to avoid two or more instances handling the same the device.
The lock is based on `flock` [1] using the `O_CLOEXEC` flag, which  garantees that the lock
is released when the process finishes.

[1] https://linux.die.net/man/2/flock

Signed-off-by: Andrea Panattoni <[email protected]>
zeeke added a commit to zeeke/sriov-cni that referenced this pull request Jan 16, 2025
Depends on:
 - k8snetworkplumbingwg#313

Signed-off-by: Andrea Panattoni <[email protected]>
zeeke added a commit to zeeke/sriov-cni that referenced this pull request Jan 16, 2025
Depends on:
 - k8snetworkplumbingwg#313

Signed-off-by: Andrea Panattoni <[email protected]>
zeeke added a commit to zeeke/sriov-cni that referenced this pull request Jan 16, 2025
Depends on:
 - k8snetworkplumbingwg#313

Signed-off-by: Andrea Panattoni <[email protected]>
@zeeke
Copy link
Member Author

zeeke commented Jan 16, 2025

I implemented the fix via the unix.Flock strategy.

I manually tested these changes on OpenShift 4.17 and with automated tests (See #318)

Please take another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants